Tiered Chunking Strategy
Different content needs different treatment:| Tier | Content Type | Max Size | Overlap | Splitter |
|---|---|---|---|---|
| 1 | Code (AST) | 3500 chars | 0 | AST-based |
| 2 | Documentation | 1500 chars | 150 | LangChain |
| 3 | Configuration | 1500 chars | 100 | LangChain |
| 4 | Other Code | 1500 chars | 100 | LangChain |
Tier 1: AST-Based Chunking
For languages with AST support, code is split at semantic boundaries.Supported Languages
How It Works
Context Injection
Each chunk receives context about its location, including parent classes, modules, and decorators/annotations:Decorator/Annotation Context
SHARC extracts decorators and annotations, embedding them in the context for better semantic understanding:- TypeScript/JavaScript:
@Decorator(),@decorator - Python:
@decorator,@decorator(args) - Java:
@Annotation,@Annotation(value) - C#:
[Attribute],[Attribute(args)] - Rust:
#[attribute],#[derive(...)] - Scala:
@annotation
Container Nodes
These nodes provide context but aren’t chunked separately:- Classes
- Interfaces
- Modules
- Namespaces
- Impl blocks (Rust)
Leaf Nodes
These are extracted as individual chunks:- Functions
- Methods
- Arrow functions
- Property assignments
Tier 2: Documentation Chunking
For markdown and text files:- Smaller chunks (1500 chars)
- 150 char overlap preserves context
- Section headers kept with content
Why Overlap?
Documentation flows between sections. Overlap ensures:Tier 3: Configuration Chunking
For JSON, YAML, TOML:- Groups related keys together
- 100 char overlap for context
- Preserves structure
Tier 4: Other Code
For languages without AST support:- Character-based splitting
- 100 char overlap
- May split mid-function
File Context
All chunks include file information:Size Limits
Per-Chunk Limits
| Tier | Max Characters |
|---|---|
| Code | 3500 |
| Docs | 1500 |
| Config | 1500 |
| Other | 1500 |
Per-File Limits
- Maximum file size: 1MB
- Files larger than this are skipped
- Prevents indexing minified bundles
Custom Extensions
Add file types during indexing:Ignore Patterns
Skip files/directories:Default Ignores
Always skipped:ignorePatterns use glob semantics, so patterns like static/**, *.tmp, and private/** behave as expected.
Optional team-dependent patterns such as vendor/**, third_party/**, and generated/** are opt-in.