Search

Hoard uses hybrid search combining keyword (BM25) and semantic (vector) matching for optimal results.

Hybrid Search Pipeline

┌─────────────────────────────────────────┐
│            Query: "meeting notes"       │
└───────────────────┬─────────────────────┘
                    │
        ┌───────────┴───────────┐
        ▼                       ▼
┌───────────────┐       ┌───────────────┐
│   BM25 Search │       │ Vector Search │
│   (keywords)  │       │  (semantic)   │
└───────────────┘       └───────────────┘
        │                       │
        └───────────┬───────────┘
                    ▼
        ┌───────────────────────┐
        │ Reciprocal Rank Fusion│
        │    (merge rankings)   │
        └───────────────────────┘
                    │
                    ▼
        ┌───────────────────────┐
        │   Group by Entity     │
        └───────────────────────┘
                    │
                    ▼
        ┌───────────────────────┐
        │    Return Results     │
        └───────────────────────┘

Unified Results (Documents + Memory)

Search can return both indexed documents and memory entries. Use types in MCP or --types / --no-memory in the CLI to filter.

Results include:

result_type to distinguish entity vs memory
source for the connector name or memory
One chunk for memory entries

BM25 Search

BM25 (Best Matching 25) is a keyword-based ranking algorithm:

Exact matching — Finds documents containing query terms
Term frequency — More occurrences = higher rank
Document length normalization — Long docs don’t dominate
IDF weighting — Rare terms matter more

Implemented via SQLite FTS5:

SELECT * FROM chunks_fts
WHERE chunks_fts MATCH 'meeting notes'
ORDER BY rank;

Strengths:

Fast — Pure SQL
Precise — Exact keyword matches
No model needed — Works immediately

Weaknesses:

Literal — “meeting” won’t find “conference”
No synonyms — Requires exact terms

Vector Search

Vector search uses embeddings for semantic similarity:

Embed query — Convert to vector (e.g., 384 dimensions)
Compare — Find nearest chunk vectors
Rank — Order by cosine similarity

Default model: sentence-transformers/all-MiniLM-L6-v2

Strengths:

Semantic — “meeting” finds “conference”
Conceptual — Understands meaning
Fuzzy — Handles paraphrasing

Weaknesses:

Requires model download (~90MB)
Slower than pure keyword
May miss exact matches

Reciprocal Rank Fusion

RRF merges BM25 and vector rankings:

RRF_score = 1/(k + rank_bm25) + 1/(k + rank_vector)

Where k = 60 (standard constant)

This ensures:

Documents ranked highly by both methods score best
Neither method dominates unfairly
Diverse results from both approaches

Corpus Size & Prefilter

Size	Chunks	Backend
Any	All sizes	SQLite brute-force

Hoard uses SQLite for all vector operations (no external vector DBs like FAISS or Chroma).

Prefilter Strategy

When corpus exceeds 50,000 chunks:

BM25 retrieves top candidates (configurable via prefilter_limit)
Vector search runs on candidates only
Results merged with RRF

This avoids scanning all embeddings for large corpora.

Search Options

Basic Search

hoard search "meeting notes"

Limit Results

hoard search "query" --limit 5

Filter by Source

hoard search "query" --source obsidian

Filter by Result Type

hoard search "query" --types entity
hoard search "query" --types memory
hoard search "query" --no-memory

Search via MCP

The search tool accepts:

{
  "name": "search",
  "arguments": {
    "query": "meeting notes",
    "limit": 20,
    "types": ["entity", "memory"],
    "source": "obsidian"
  }
}

Result Format

Results are grouped by entity:

{
  "results": [
    {
      "result_type": "entity",
      "entity_id": "abc-123",
      "entity_title": "Project Notes",
      "source": "local_files",
      "uri": "file:///path/to/file.md",
      "chunks": [
        {
          "chunk_id": "abc-123:2",
          "content": "In the meeting, we discussed...",
          "score": 0.87,
          "char_offset_start": 1200,
          "char_offset_end": 1850
        }
      ]
    },
    {
      "result_type": "memory",
      "entity_id": "5421d0503fadb55a413761f3745891ac",
      "entity_title": "user_preferences",
      "source": "memory",
      "memory_key": "user_preferences",
      "chunks": [
        {
          "chunk_id": "5421d0503fadb55a413761f3745891ac",
          "content": "Prefers concise responses.",
          "score": 0.71
        }
      ]
    }
  ],
  "next_cursor": null
}

Note: Results use uri (not entity_uri) and do not include totals — use next_cursor for pagination.

Enabling Vector Search

Vector search is optional. To enable:

# Install dependencies
pip install hoard[vectors]

# Build embeddings
hoard embeddings build

Search Tips

Use multiple terms — “project meeting notes” beats “notes”
Be specific — Include distinguishing words
Try variations — If no results, try synonyms
Check sync — New content needs hoard sync

Performance Tuning

In ~/.hoard/config.yaml:

search:
  rrf_k: 60                # RRF constant (higher = more even blending)
  max_chunks_per_entity: 3 # Max chunks returned per entity

vectors:
  prefilter_limit: 1000    # BM25 candidates when corpus > 50K chunks

Next Steps

MCP Interface — How AI tools use search
Configuration — Search settings
MCP Tools — All search parameters