Skip to content

Chunking & Retrieval

Chunking is how Hoard splits documents into semantically meaningful pieces for search and retrieval.

What Is a Chunk?

A chunk is a text span optimized for:

  • Search relevance — Enough context for accurate ranking
  • LLM context — Right size for AI consumption
  • Citation precision — Specific enough to be useful

Target size: 200-500 tokens per chunk

Why Chunking Matters

Without ChunkingWith Chunking
Return whole documentsReturn specific passages
Vague citationsPrecise citations
Wastes context windowEfficient context use
Poor ranking granularityFine-grained ranking

Chunking Algorithm

Hoard uses a simple whitespace-based sliding window algorithm:

  1. Tokenize — Split text on whitespace using \S+ regex
  2. Window — Create chunks of max_tokens tokens
  3. Slide — Advance by max_tokens - overlap_tokens
  4. Overlap — Each chunk shares overlap_tokens with the next

All content (markdown, plain text, etc.) is processed the same way — no special parsing for headings or structure.

Example:

Original Document (1000 tokens)
├── Tokens 0-399 → Chunk 0
├── Tokens 350-749 → Chunk 1 (50 token overlap)
└── Tokens 700-999 → Chunk 2 (50 token overlap)

Overlap

Chunks include overlap for context continuity:

Chunk 0: ████████████████
┌──────────┘
Chunk 1: ████████████████████
└─────┘
overlap (50 tokens)

Default overlap: 50 tokens

This ensures:

  • References in chunk N are understandable
  • Search finds matches near chunk boundaries
  • Context flows naturally

Configuration

In ~/.hoard/config.yaml:

connectors:
local_files:
chunk_max_tokens: 400 # Target chunk size
chunk_overlap_tokens: 50 # Overlap between chunks

Chunk ID Format

Every chunk has a predictable ID:

{entity_id}:{chunk_index}
Example: abc-123:0 (first chunk)
abc-123:1 (second chunk)
abc-123:2 (third chunk)

This enables:

  • Direct chunk retrieval
  • Stable references
  • Cross-reference in citations

Retrieval

Use the MCP tools to retrieve chunks:

  • get — Returns full entity with all chunks
  • get_chunk — Returns single chunk with optional surrounding context

Search Results

Search returns matching chunks grouped by entity:

{
"results": [
{
"entity_id": "abc-123",
"entity_title": "Project Notes",
"chunks": [
{
"chunk_id": "abc-123:2",
"content": "The meeting discussed...",
"score": 0.87,
"char_offset_start": 1200,
"char_offset_end": 1850
}
]
}
]
}

Character Offsets

Each chunk tracks its position in the original:

FieldDescription
char_offset_startFirst character position
char_offset_endLast character position

This enables:

  • Highlighting in original document
  • Precise citations with line numbers
  • Linking back to exact location

Best Practices

For Connectors

  • Use SDK chunking helper (chunk_plain_text)
  • Preserve character offsets
  • Don’t create very small chunks (< 50 tokens)
  • Don’t create very large chunks (> 1000 tokens)

For Users

  • Shorter, focused documents chunk more efficiently
  • One topic per file improves retrieval precision

SDK Chunking Helper

from hoard.sdk.chunking import chunk_plain_text, ChunkSpan
# Chunk any text content
chunks: list[ChunkSpan] = chunk_plain_text(
content,
max_tokens=400,
overlap_tokens=50
)
# Each ChunkSpan has:
# - text: The chunk content
# - start: char_offset_start (integer)
# - end: char_offset_end (integer)

Next Steps