Chunking & Retrieval
Chunking is how Hoard splits documents into semantically meaningful pieces for search and retrieval.
What Is a Chunk?
A chunk is a text span optimized for:
- Search relevance — Enough context for accurate ranking
- LLM context — Right size for AI consumption
- Citation precision — Specific enough to be useful
Target size: 200-500 tokens per chunk
Why Chunking Matters
| Without Chunking | With Chunking |
|---|---|
| Return whole documents | Return specific passages |
| Vague citations | Precise citations |
| Wastes context window | Efficient context use |
| Poor ranking granularity | Fine-grained ranking |
Chunking Algorithm
Hoard uses a simple whitespace-based sliding window algorithm:
- Tokenize — Split text on whitespace using
\S+regex - Window — Create chunks of
max_tokenstokens - Slide — Advance by
max_tokens - overlap_tokens - Overlap — Each chunk shares
overlap_tokenswith the next
All content (markdown, plain text, etc.) is processed the same way — no special parsing for headings or structure.
Example:
Original Document (1000 tokens)├── Tokens 0-399 → Chunk 0├── Tokens 350-749 → Chunk 1 (50 token overlap)└── Tokens 700-999 → Chunk 2 (50 token overlap)Overlap
Chunks include overlap for context continuity:
Chunk 0: ████████████████ ▼ ┌──────────┘ │Chunk 1: ████████████████████ └─────┘ overlap (50 tokens)Default overlap: 50 tokens
This ensures:
- References in chunk N are understandable
- Search finds matches near chunk boundaries
- Context flows naturally
Configuration
In ~/.hoard/config.yaml:
connectors: local_files: chunk_max_tokens: 400 # Target chunk size chunk_overlap_tokens: 50 # Overlap between chunksChunk ID Format
Every chunk has a predictable ID:
{entity_id}:{chunk_index}
Example: abc-123:0 (first chunk) abc-123:1 (second chunk) abc-123:2 (third chunk)This enables:
- Direct chunk retrieval
- Stable references
- Cross-reference in citations
Retrieval
Use the MCP tools to retrieve chunks:
get— Returns full entity with all chunksget_chunk— Returns single chunk with optional surrounding context
Search Results
Search returns matching chunks grouped by entity:
{ "results": [ { "entity_id": "abc-123", "entity_title": "Project Notes", "chunks": [ { "chunk_id": "abc-123:2", "content": "The meeting discussed...", "score": 0.87, "char_offset_start": 1200, "char_offset_end": 1850 } ] } ]}Character Offsets
Each chunk tracks its position in the original:
| Field | Description |
|---|---|
char_offset_start | First character position |
char_offset_end | Last character position |
This enables:
- Highlighting in original document
- Precise citations with line numbers
- Linking back to exact location
Best Practices
For Connectors
- Use SDK chunking helper (
chunk_plain_text) - Preserve character offsets
- Don’t create very small chunks (< 50 tokens)
- Don’t create very large chunks (> 1000 tokens)
For Users
- Shorter, focused documents chunk more efficiently
- One topic per file improves retrieval precision
SDK Chunking Helper
from hoard.sdk.chunking import chunk_plain_text, ChunkSpan
# Chunk any text contentchunks: list[ChunkSpan] = chunk_plain_text( content, max_tokens=400, overlap_tokens=50)
# Each ChunkSpan has:# - text: The chunk content# - start: char_offset_start (integer)# - end: char_offset_end (integer)Next Steps
- Search — How chunks are searched
- Data Model — Chunk schema details
- SDK Utilities — Chunking functions