Skip to content

SDK Utilities

The SDK provides utilities for common connector tasks.

Chunking

from hoard.sdk import chunk_plain_text, ChunkSpan

chunk_plain_text

Whitespace-based chunking with sliding window:

chunks: list[ChunkSpan] = chunk_plain_text(
content,
max_tokens=400, # Target chunk size
overlap_tokens=50, # Overlap between chunks
)
for chunk in chunks:
print(f"{chunk.text[:50]}...")
print(f" Offset: {chunk.start}-{chunk.end}")

ChunkSpan

@dataclass
class ChunkSpan:
text: str # Chunk content
start: int # char_offset_start
end: int # char_offset_end

How Chunking Works

The algorithm:

  1. Split text into tokens using \S+ regex (whitespace-based)
  2. Create windows of max_tokens tokens
  3. Slide window by max_tokens - overlap_tokens
  4. Track character offsets for each chunk
# Example: chunking a document
content = "Your document content here..."
chunks = chunk_plain_text(content, max_tokens=400, overlap_tokens=50)
# Convert to ChunkInput for yielding
from hoard.sdk import ChunkInput
chunk_inputs = [
ChunkInput(
content=c.text,
char_offset_start=c.start,
char_offset_end=c.end,
)
for c in chunks
]

Hashing

from hoard.sdk import compute_content_hash

compute_content_hash

SHA256 hash truncated to 32 hex characters:

hash = compute_content_hash(text)
# "a1b2c3d4e5f6..." (32-char hex)

Use this for the content_hash field in EntityInput:

entity = EntityInput(
source="my_source",
source_id="doc-123",
entity_type="document",
title="My Document",
content_hash=compute_content_hash(full_text),
)

This enables change detection — Hoard skips re-indexing if the hash matches.

Best Practices

Use SDK Chunking

# Good: Consistent sizing with offsets
chunks = chunk_plain_text(content, max_tokens=400)
chunk_inputs = [
ChunkInput(content=c.text, char_offset_start=c.start, char_offset_end=c.end)
for c in chunks
]
# Bad: Arbitrary splitting
chunks = content.split("\n\n") # Inconsistent sizes, no offsets!

Always Hash Content

entity = EntityInput(
...,
content_hash=compute_content_hash(full_text),
)

This enables change detection — unchanged content is skipped on sync.

Handle Errors Gracefully

def scan(self, config):
for item in items:
try:
entity, chunks = process(item)
yield entity, chunks
except Exception as e:
logger.warning(f"Skipping {item}: {e}")
continue # Don't crash the whole sync!

Next Steps

  • Types — EntityInput, ChunkInput details
  • Examples — See utilities in action