Skip to content

Search

Hoard uses hybrid search combining keyword (BM25) and semantic (vector) matching for optimal results.

Hybrid Search Pipeline

┌─────────────────────────────────────────┐
│ Query: "meeting notes" │
└───────────────────┬─────────────────────┘
┌───────────┴───────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ BM25 Search │ │ Vector Search │
│ (keywords) │ │ (semantic) │
└───────────────┘ └───────────────┘
│ │
└───────────┬───────────┘
┌───────────────────────┐
│ Reciprocal Rank Fusion│
│ (merge rankings) │
└───────────────────────┘
┌───────────────────────┐
│ Group by Entity │
└───────────────────────┘
┌───────────────────────┐
│ Return Results │
└───────────────────────┘

Unified Results (Documents + Memory)

Search can return both indexed documents and memory entries. Use types in MCP or --types / --no-memory in the CLI to filter.

Results include:

  • result_type to distinguish entity vs memory
  • source for the connector name or memory
  • One chunk for memory entries

BM25 (Best Matching 25) is a keyword-based ranking algorithm:

  • Exact matching — Finds documents containing query terms
  • Term frequency — More occurrences = higher rank
  • Document length normalization — Long docs don’t dominate
  • IDF weighting — Rare terms matter more

Implemented via SQLite FTS5:

SELECT * FROM chunks_fts
WHERE chunks_fts MATCH 'meeting notes'
ORDER BY rank;

Strengths:

  • Fast — Pure SQL
  • Precise — Exact keyword matches
  • No model needed — Works immediately

Weaknesses:

  • Literal — “meeting” won’t find “conference”
  • No synonyms — Requires exact terms

Vector search uses embeddings for semantic similarity:

  1. Embed query — Convert to vector (e.g., 384 dimensions)
  2. Compare — Find nearest chunk vectors
  3. Rank — Order by cosine similarity

Default model: sentence-transformers/all-MiniLM-L6-v2

Strengths:

  • Semantic — “meeting” finds “conference”
  • Conceptual — Understands meaning
  • Fuzzy — Handles paraphrasing

Weaknesses:

  • Requires model download (~90MB)
  • Slower than pure keyword
  • May miss exact matches

Reciprocal Rank Fusion

RRF merges BM25 and vector rankings:

RRF_score = 1/(k + rank_bm25) + 1/(k + rank_vector)

Where k = 60 (standard constant)

This ensures:

  • Documents ranked highly by both methods score best
  • Neither method dominates unfairly
  • Diverse results from both approaches

Corpus Size & Prefilter

SizeChunksBackend
AnyAll sizesSQLite brute-force

Hoard uses SQLite for all vector operations (no external vector DBs like FAISS or Chroma).

Prefilter Strategy

When corpus exceeds 50,000 chunks:

  1. BM25 retrieves top candidates (configurable via prefilter_limit)
  2. Vector search runs on candidates only
  3. Results merged with RRF

This avoids scanning all embeddings for large corpora.

Search Options

Terminal window
hoard search "meeting notes"

Limit Results

Terminal window
hoard search "query" --limit 5

Filter by Source

Terminal window
hoard search "query" --source obsidian

Filter by Result Type

Terminal window
hoard search "query" --types entity
hoard search "query" --types memory
hoard search "query" --no-memory

Search via MCP

The search tool accepts:

{
"name": "search",
"arguments": {
"query": "meeting notes",
"limit": 20,
"types": ["entity", "memory"],
"source": "obsidian"
}
}

Result Format

Results are grouped by entity:

{
"results": [
{
"result_type": "entity",
"entity_id": "abc-123",
"entity_title": "Project Notes",
"source": "local_files",
"uri": "file:///path/to/file.md",
"chunks": [
{
"chunk_id": "abc-123:2",
"content": "In the meeting, we discussed...",
"score": 0.87,
"char_offset_start": 1200,
"char_offset_end": 1850
}
]
},
{
"result_type": "memory",
"entity_id": "5421d0503fadb55a413761f3745891ac",
"entity_title": "user_preferences",
"source": "memory",
"memory_key": "user_preferences",
"chunks": [
{
"chunk_id": "5421d0503fadb55a413761f3745891ac",
"content": "Prefers concise responses.",
"score": 0.71
}
]
}
],
"next_cursor": null
}

Note: Results use uri (not entity_uri) and do not include totals — use next_cursor for pagination.

Vector search is optional. To enable:

Terminal window
# Install dependencies
pip install hoard[vectors]
# Build embeddings
hoard embeddings build

Search Tips

  1. Use multiple terms — “project meeting notes” beats “notes”
  2. Be specific — Include distinguishing words
  3. Try variations — If no results, try synonyms
  4. Check sync — New content needs hoard sync

Performance Tuning

In ~/.hoard/config.yaml:

search:
rrf_k: 60 # RRF constant (higher = more even blending)
max_chunks_per_entity: 3 # Max chunks returned per entity
vectors:
prefilter_limit: 1000 # BM25 candidates when corpus > 50K chunks

Next Steps