Chunking & Retrieval

Chunking is how Hoard splits documents into semantically meaningful pieces for search and retrieval.

What Is a Chunk?

A chunk is a text span optimized for:

Search relevance — Enough context for accurate ranking
LLM context — Right size for AI consumption
Citation precision — Specific enough to be useful

Target size: 200-500 tokens per chunk

Why Chunking Matters

Without Chunking	With Chunking
Return whole documents	Return specific passages
Vague citations	Precise citations
Wastes context window	Efficient context use
Poor ranking granularity	Fine-grained ranking

Chunking Algorithm

Hoard uses a simple whitespace-based sliding window algorithm:

Tokenize — Split text on whitespace using \S+ regex
Window — Create chunks of max_tokens tokens
Slide — Advance by max_tokens - overlap_tokens
Overlap — Each chunk shares overlap_tokens with the next

All content (markdown, plain text, etc.) is processed the same way — no special parsing for headings or structure.

Example:

Original Document (1000 tokens)
├── Tokens 0-399   → Chunk 0
├── Tokens 350-749 → Chunk 1 (50 token overlap)
└── Tokens 700-999 → Chunk 2 (50 token overlap)

Overlap

Chunks include overlap for context continuity:

Chunk 0: ████████████████
                    ▼
         ┌──────────┘
         │
Chunk 1: ████████████████████
              └─────┘
              overlap (50 tokens)

Default overlap: 50 tokens

This ensures:

References in chunk N are understandable
Search finds matches near chunk boundaries
Context flows naturally

Configuration

In ~/.hoard/config.yaml:

connectors:
  local_files:
    chunk_max_tokens: 400    # Target chunk size
    chunk_overlap_tokens: 50  # Overlap between chunks

Chunk ID Format

Every chunk has a predictable ID:

{entity_id}:{chunk_index}

Example: abc-123:0  (first chunk)
         abc-123:1  (second chunk)
         abc-123:2  (third chunk)

This enables:

Direct chunk retrieval
Stable references
Cross-reference in citations

Retrieval

Use the MCP tools to retrieve chunks:

get — Returns full entity with all chunks
get_chunk — Returns single chunk with optional surrounding context

Search Results

Search returns matching chunks grouped by entity:

{
  "results": [
    {
      "entity_id": "abc-123",
      "entity_title": "Project Notes",
      "chunks": [
        {
          "chunk_id": "abc-123:2",
          "content": "The meeting discussed...",
          "score": 0.87,
          "char_offset_start": 1200,
          "char_offset_end": 1850
        }
      ]
    }
  ]
}

Character Offsets

Each chunk tracks its position in the original:

Field	Description
`char_offset_start`	First character position
`char_offset_end`	Last character position

This enables:

Highlighting in original document
Precise citations with line numbers
Linking back to exact location

Best Practices

For Connectors

Use SDK chunking helper (chunk_plain_text)
Preserve character offsets
Don’t create very small chunks (< 50 tokens)
Don’t create very large chunks (> 1000 tokens)

For Users

Shorter, focused documents chunk more efficiently
One topic per file improves retrieval precision

SDK Chunking Helper

from hoard.sdk.chunking import chunk_plain_text, ChunkSpan

# Chunk any text content
chunks: list[ChunkSpan] = chunk_plain_text(
    content,
    max_tokens=400,
    overlap_tokens=50
)

# Each ChunkSpan has:
# - text: The chunk content
# - start: char_offset_start (integer)
# - end: char_offset_end (integer)

Next Steps

Search — How chunks are searched
Data Model — Chunk schema details
SDK Utilities — Chunking functions