SDK Utilities
The SDK provides utilities for common connector tasks.
Chunking
from hoard.sdk import chunk_plain_text, ChunkSpanchunk_plain_text
Whitespace-based chunking with sliding window:
chunks: list[ChunkSpan] = chunk_plain_text( content, max_tokens=400, # Target chunk size overlap_tokens=50, # Overlap between chunks)
for chunk in chunks: print(f"{chunk.text[:50]}...") print(f" Offset: {chunk.start}-{chunk.end}")ChunkSpan
@dataclassclass ChunkSpan: text: str # Chunk content start: int # char_offset_start end: int # char_offset_endHow Chunking Works
The algorithm:
- Split text into tokens using
\S+regex (whitespace-based) - Create windows of
max_tokenstokens - Slide window by
max_tokens - overlap_tokens - Track character offsets for each chunk
# Example: chunking a documentcontent = "Your document content here..."chunks = chunk_plain_text(content, max_tokens=400, overlap_tokens=50)
# Convert to ChunkInput for yieldingfrom hoard.sdk import ChunkInput
chunk_inputs = [ ChunkInput( content=c.text, char_offset_start=c.start, char_offset_end=c.end, ) for c in chunks]Hashing
from hoard.sdk import compute_content_hashcompute_content_hash
SHA256 hash truncated to 32 hex characters:
hash = compute_content_hash(text)# "a1b2c3d4e5f6..." (32-char hex)Use this for the content_hash field in EntityInput:
entity = EntityInput( source="my_source", source_id="doc-123", entity_type="document", title="My Document", content_hash=compute_content_hash(full_text),)This enables change detection — Hoard skips re-indexing if the hash matches.
Best Practices
Use SDK Chunking
# Good: Consistent sizing with offsetschunks = chunk_plain_text(content, max_tokens=400)chunk_inputs = [ ChunkInput(content=c.text, char_offset_start=c.start, char_offset_end=c.end) for c in chunks]
# Bad: Arbitrary splittingchunks = content.split("\n\n") # Inconsistent sizes, no offsets!Always Hash Content
entity = EntityInput( ..., content_hash=compute_content_hash(full_text),)This enables change detection — unchanged content is skipped on sync.
Handle Errors Gracefully
def scan(self, config): for item in items: try: entity, chunks = process(item) yield entity, chunks except Exception as e: logger.warning(f"Skipping {item}: {e}") continue # Don't crash the whole sync!