SDK Utilities

The SDK provides utilities for common connector tasks.

Chunking

from hoard.sdk import chunk_plain_text, ChunkSpan

chunk_plain_text

Whitespace-based chunking with sliding window:

chunks: list[ChunkSpan] = chunk_plain_text(
    content,
    max_tokens=400,      # Target chunk size
    overlap_tokens=50,   # Overlap between chunks
)

for chunk in chunks:
    print(f"{chunk.text[:50]}...")
    print(f"  Offset: {chunk.start}-{chunk.end}")

ChunkSpan

@dataclass
class ChunkSpan:
    text: str   # Chunk content
    start: int  # char_offset_start
    end: int    # char_offset_end

How Chunking Works

The algorithm:

Split text into tokens using \S+ regex (whitespace-based)
Create windows of max_tokens tokens
Slide window by max_tokens - overlap_tokens
Track character offsets for each chunk

# Example: chunking a document
content = "Your document content here..."
chunks = chunk_plain_text(content, max_tokens=400, overlap_tokens=50)

# Convert to ChunkInput for yielding
from hoard.sdk import ChunkInput

chunk_inputs = [
    ChunkInput(
        content=c.text,
        char_offset_start=c.start,
        char_offset_end=c.end,
    )
    for c in chunks
]

Hashing

from hoard.sdk import compute_content_hash

compute_content_hash

SHA256 hash truncated to 32 hex characters:

hash = compute_content_hash(text)
# "a1b2c3d4e5f6..."  (32-char hex)

Use this for the content_hash field in EntityInput:

entity = EntityInput(
    source="my_source",
    source_id="doc-123",
    entity_type="document",
    title="My Document",
    content_hash=compute_content_hash(full_text),
)

This enables change detection — Hoard skips re-indexing if the hash matches.

Best Practices

Use SDK Chunking

# Good: Consistent sizing with offsets
chunks = chunk_plain_text(content, max_tokens=400)
chunk_inputs = [
    ChunkInput(content=c.text, char_offset_start=c.start, char_offset_end=c.end)
    for c in chunks
]

# Bad: Arbitrary splitting
chunks = content.split("\n\n")  # Inconsistent sizes, no offsets!

Always Hash Content

entity = EntityInput(
    ...,
    content_hash=compute_content_hash(full_text),
)

This enables change detection — unchanged content is skipped on sync.

Handle Errors Gracefully

def scan(self, config):
    for item in items:
        try:
            entity, chunks = process(item)
            yield entity, chunks
        except Exception as e:
            logger.warning(f"Skipping {item}: {e}")
            continue  # Don't crash the whole sync!

Next Steps

Types — EntityInput, ChunkInput details
Examples — See utilities in action