# Search Modes

Text Search supports 15 search modes, each with its own algorithm (plus a multi-field BM25F variant and a batch helper). Any combination can be fused via hybrid search using Reciprocal Rank Fusion (RRF) or linear combination.

---

## BM25 Search

BM25 (Best Matching 25) is the default ranking algorithm. It is a probabilistic relevance model that scores each document based on term frequency (TF), inverse document frequency (IDF), and document length normalization.

### When to Use

- **General-purpose full-text search** -- the best default for most use cases.
- **Ranked document retrieval** -- when you need results ordered by relevance.
- **Search across corpora with varying document lengths** -- length normalization prevents long documents from dominating results.

### Formula

```
// BM25 score for document d given query Q = {q1, q2, ..., qn}

score(d, Q) = SUM_i IDF(q_i) * TF_norm(q_i, d)

// where TF_norm (standard Lucene/ATIRE/Robertson):
TF_norm(q, d) = tf(q, d) * (k1 + 1) / (tf(q, d) + k1 * (1 - b + b * |d| / avgdl))

// Parameters:
//   tf(q, d)  = raw term frequency of q in document d
//   k1        = term frequency saturation (default: 1.5)
//               higher k1 = TF matters more; lower = diminishing returns faster
//   b         = length normalization (default: 0.75)
//               b=0: no length normalization; b=1: full normalization
//   |d|       = document length in tokens
//   avgdl     = average document length across corpus
```

### BM25+ Variant

Adds a delta term that prevents long documents from being unfairly penalized. Good for corpora with high length variance.

```
// BM25+ adds a delta parameter to prevent zero TF contribution:
TF_norm(q, d) = delta + tf(q, d) * (k1 + 1) / (k1 * (1 - b + b * |d| / avgdl) + tf(q, d))

// delta (default: 0.5) ensures even a single occurrence contributes > 0
```

### BM25L Variant

Modified length normalization that is less aggressive on long documents.

```
// BM25L applies delta differently, adjusting the normalized TF:
ctd = tf(q, d) / (1 - b + b * |d| / avgdl)
TF_norm(q, d) = (k1 + 1) * (ctd + delta) / (k1 + ctd + delta)
```

### Example Calculation

```
// Corpus: 3 documents, avgdl = 8.0
// Query: "quick fox"
// Doc1: "the quick brown fox jumps over the lazy dog" (9 tokens)
// k1 = 1.5, b = 0.75
//
// Term "quick" in Doc1: tf=1, df=1, N=3
//   IDF = log(1 + (3 - 1 + 0.5) / (1 + 0.5)) = log(1 + 1.667) = 0.981
//   TF_norm = 1 * 2.5 / (1 + 1.5 * (1 - 0.75 + 0.75 * 9/8)) = 2.5 / 2.641 = 0.947
//   contribution = 0.981 * 0.947 = 0.929
//
// Term "fox" in Doc1: tf=1, df=2, N=3
//   IDF = log(1 + (3 - 2 + 0.5) / (2 + 0.5)) = log(1 + 0.6) = 0.470
//   TF_norm = 1 * 2.5 / (1 + 1.5 * 1.094) = 2.5 / 2.641 = 0.947
//   contribution = 0.470 * 0.947 = 0.445
//
// Total BM25 score for Doc1 = 0.929 + 0.445 = 1.374
```

### Compact Scoring with TF Cache

The BM25 engine uses compact posting arrays with a TF cache for efficient scoring:

1. **TF Cache**: A 256-bucket cache maps fieldnorm bucket IDs to precomputed BM25 denominators. Each document's length is encoded as a bucket ID (0-255) via Tantivy-style logarithmic encoding.
2. **Batch scoring**: For each query term, the posting list is traversed as parallel arrays (`postingEntries`, `postingTFs`, `postingFieldnormIds`). Scoring uses the TF cache for O(1) normalization lookup per document.
3. **WAND pruning**: Block-level max scores (`postingBlockMaxScores`, 64-entry blocks) enable skipping entire blocks whose upper-bound score cannot beat the current top-k threshold.
4. **AVX2 vectorization**: Batch scoring processes 4 documents per SIMD cycle on supported platforms.
5. **Fused multi-term search**: The `search_compact` C function fuses term resolution, batch scoring, score accumulation, and top-k selection into a single pass with software prefetching.

### GCL Examples

#### Basic BM25

```gcl
// Add documents
index.add("The quick brown fox jumps over the lazy dog", "doc1");
index.add("Machine learning algorithms for text analysis", "doc2");
index.add("Natural language processing with neural networks", "doc3");
index.build();

// Search
var results = index.search_bm25("machine learning", 10);

for (var i = 0; i < results.size(); i++) {
    var r = results[i];
    info("${r.score}: ${r.key}");
}
```

#### BM25 Variants

Five BM25 variants are available, each with different trade-offs for term frequency saturation and length normalization.

**Lucene (default)** -- Always produces positive IDF values. The recommended default for most use cases.

```gcl
var _indexLucene = TextIndex<String> {
    config: TextIndexConfig {
        bm25: BM25Options { variant: BM25Variant::lucene }
    }
};
```

**BM25+** -- Adds a delta term that prevents long documents from being unfairly penalized.

```gcl
var _indexPlus = TextIndex<String> {
    config: TextIndexConfig {
        bm25: BM25Options { variant: BM25Variant::plus, delta: 0.5 }
    }
};
```

**BM25L** -- Modified length normalization that is less aggressive on long documents.

```gcl
var _indexL = TextIndex<String> {
    config: TextIndexConfig {
        bm25: BM25Options { variant: BM25Variant::bm25l }
    }
};
```

**ATIRE** -- Simple TF/IDF ratio variant used by the ATIRE search engine.

```gcl
var _indexATIRE = TextIndex<String> {
    config: TextIndexConfig {
        bm25: BM25Options { variant: BM25Variant::atire }
    }
};
```

**Robertson** -- The original BM25 formulation. Can produce negative IDF for very frequent terms.

```gcl
var _indexRobertson = TextIndex<String> {
    config: TextIndexConfig {
        bm25: BM25Options { variant: BM25Variant::robertson }
    }
};
```

#### Boosted Search (Term Weighting)

Use `SearchOptions.termBoosts` to boost specific terms. Higher weights make a term more important in the ranking.

```gcl
var boosts = Array<TermBoost> {};
boosts.add(TermBoost { term: "machine", boost: 2.0 });
boosts.add(TermBoost { term: "data", boost: 0.5 });

var modes = Array<SearchMode> {};
modes.add(SearchMode::bm25);

var options = SearchOptions { modes: modes, termBoosts: boosts };
var _results = index.search("machine learning data", 10, options);
// "machine" gets 2x weight, "data" gets 0.5x weight
```

#### Batch Search

Execute multiple queries at once. Useful for running a set of predefined queries against the same index.

```gcl
var queries = Array<String> {};
queries.add("machine learning");
queries.add("neural networks");
queries.add("text analysis");

// Execute multiple searches
var allResults = index.search_bm25_batch(queries, 10);

for (var i = 0; i < allResults.size(); i++) {
    info("Query ${i}: ${allResults[i].size()} results");
}
```

#### Subset Search

Restrict search to a specific set of document keys via `SearchOptions.filter`.

```gcl
// Search only within a specific subset of documents.
// `filter` matches against the original document text (the first argument
// passed to `add()`), normalized the same way add() does it.
var allowedKeys = Array<String> {};
allowedKeys.add("The quick brown fox jumps over the lazy dog");
allowedKeys.add("Natural language processing with neural networks");

var modes = Array<SearchMode> {};
modes.add(SearchMode::bm25);

var options = SearchOptions { modes: modes, filter: allowedKeys };
var _results = index.search("query", 10, options);
```

#### Score Explanation

Debug and understand BM25 scoring with a full breakdown of each term's contribution.

```gcl
// Debug BM25 scoring. The second argument is the original document text
// that was passed to `add()`, not the stored value.
var explanation = index.explain("machine learning", "Machine learning algorithms for text analysis");

if (explanation != null) {
    info("Total Score: ${explanation.totalScore}");
    info("Variant: ${explanation.variant}");
    info("k1: ${explanation.k1}, b: ${explanation.b}");
    info("Doc Length: ${explanation.docLen}");
    info("Avg Doc Length: ${explanation.avgDocLen}");

    for (var i = 0; i < explanation.terms.size(); i++) {
        var term = explanation.terms[i];
        info("  Term: ${term.term}");
        info("    TF: ${term.tf}, IDF: ${term.idf}");
        info("    TF Norm: ${term.tfNorm}, Score: ${term.score}");
    }
}
```

### Configuration Parameters

All under `bm25: BM25Options` in `TextIndexConfig`.

| Parameter | Default | Description |
|-----------|---------|-------------|
| `bm25.variant` | `BM25Variant::lucene` | BM25 variant to use (lucene, plus, bm25l, atire, robertson) |
| `bm25.k1` | `1.5` | Term frequency saturation; higher = TF matters more |
| `bm25.b` | `0.75` | Length normalization; 0 = none, 1 = full |
| `bm25.delta` | `0.5` | Delta for BM25+ and BM25L variants |

---

## BM25F Search (Multi-Field)

BM25F (BM25 with Fields) extends BM25 to support multi-field documents where different fields have different importance weights. For example, a match in the title might be worth 3x more than a match in the body.

### When to Use

- **Structured documents** with distinct fields (title, body, tags, abstract).
- **Field-weighted ranking** -- when a match in one field should count more than in another.
- **Product catalogs, research papers, or CMS content** where metadata fields carry higher signal.

### Formula

```
// BM25F weighted term frequency:
TF_weighted(q, d) = SUM_f weight_f * tf(q, d, f) / (1 - b_f + b_f * |d_f| / avgdl_f)

// Then use weighted TF in standard BM25:
score(d, Q) = SUM_i IDF(q_i) * TF_weighted(q_i, d) * (k1 + 1) / (TF_weighted(q_i, d) + k1)

// Configuration example (typed field refs into a Doc type):
fields: [
    FieldConfig { f: Doc::title, weight: 3.0, fieldB: 0.4 },
    FieldConfig { f: Doc::body,  weight: 1.0, fieldB: 0.75 },
    FieldConfig { f: Doc::tags,  weight: 2.0, fieldB: 0.0 }
]
```

### GCL Examples

#### Basic Field Weighting

```gcl
type Article { title: String; body: String; tags: String?; }

var fields = Array<FieldConfig> {};
fields.add(FieldConfig { f: Article::title, weight: 3.0 });
fields.add(FieldConfig { f: Article::body,  weight: 1.0 });
fields.add(FieldConfig { f: Article::tags,  weight: 2.0 });

var index = TextIndex<Article> {
    config: TextIndexConfig {
        fields: fields,
        stopWords: StopWordOptions { mode: StopWordMode::default }
    }
};

index.add_fields(Article {
    title: "Machine Learning Tutorial",
    body: "Learn ML algorithms and techniques",
    tags: "ml ai tutorial"
});
index.add_fields(Article {
    title: "Natural Language Processing",
    body: "NLP with transformers and attention",
    tags: "nlp transformers"
});

index.build();

// Search with field weighting
var _results = index.search_bm25_f("machine learning", 10);
// Title matches score higher due to 3x weight
```

#### Field-Specific Length Normalization

Each field can have its own length normalization parameter (`fieldB`). Shorter fields like titles benefit from less aggressive normalization.

```gcl
type Paper { title: String; abstract: String; body: String; }

var fields = Array<FieldConfig> {};
fields.add(FieldConfig { f: Paper::title,    weight: 3.0, fieldB: 0.3  }); // Less length penalty for titles
fields.add(FieldConfig { f: Paper::body,     weight: 1.0, fieldB: 0.75 }); // Standard length normalization
fields.add(FieldConfig { f: Paper::abstract, weight: 2.0, fieldB: 0.5  });

var _index = TextIndex<Paper> {
    config: TextIndexConfig {
        fields: fields,
        stopWords: StopWordOptions { mode: StopWordMode::default },
        bm25: BM25Options { b: 0.75 }  // Global default
    }
};
```

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `fields` | `null` (auto-discover) | Array of `FieldConfig` entries with `f` (typed field ref), `weight`, and optional `fieldB`. When null, `add_fields` auto-discovers every `String`/`String?` field on `T` at weight 1.0. |
| `FieldConfig.weight` | `1.0` | Relative importance of this field |
| `FieldConfig.fieldB` | (global `bm25.b`) | Per-field length normalization override |

---

## Semantic Search

Vector similarity search using AI embeddings. The query text is embedded using the same model as the documents, then the `VectorIndex` finds the nearest neighbors by cosine similarity.

### When to Use

- **Conceptual matching** -- finding documents related by meaning even when they use different words.
- **Natural language queries** -- users searching with conversational language rather than keywords.
- **Cross-lingual or synonym-heavy retrieval** -- embeddings capture semantic relationships that keyword search misses.

### How It Works

Semantic search requires a user-provided embedding function (`config.embed`) that converts text to a `Tensor` vector. During indexing, each document is embedded and stored in a `VectorIndex`. At query time, the query is embedded with the same function and nearest neighbors are found by vector distance.

```
// Distance-to-similarity conversion:
similarity = 1.0 / (1.0 + distance)

// Steps:
// 1. Embed query text using user-provided embed function
// 2. Search VectorIndex for k nearest neighbors
// 3. Score = 1.0 / (1.0 + distance)
//
// With chunk-based embedding, matches return the most relevant chunk
// and link back to the parent IndexEntry for full document access
```

### GCL Examples

#### Basic Semantic Search

```gcl
// Define an embedding function (e.g., using the ai library)
fn my_embed(text: String): Tensor {
    // Call your embedding model here
    return ai::embed(text, model);
}

var index = TextIndex<String> {
    config: TextIndexConfig {
        // Provide a function that maps text to a Tensor
        embed: my_embed,
        stopWords: StopWordOptions { mode: StopWordMode::none }
    }
};

index.add("Machine learning is a subset of artificial intelligence", "doc1");
index.add("Neural networks learn from data patterns", "doc2");
index.build();

// Semantic similarity search
var results = index.search_semantic("AI and deep learning", 10);
// Finds conceptually similar documents even without exact terms
```

#### With Pre-Computed Vectors

Use `add_batch()` with pre-computed embeddings to skip model inference during indexing:

```gcl
var entries = Array<TextEntry> {};
entries.add(TextEntry { key: "doc1 text", value: "doc1", vector: precomputedTensor1 });
entries.add(TextEntry { key: "doc2 text", value: "doc2", vector: precomputedTensor2 });
index.add_batch(entries);
index.build();
```

#### With Chunking

For long documents, chunking splits text into smaller pieces before embedding. Each chunk is indexed independently for more precise retrieval.

```gcl
var index = TextIndex<String> {
    config: TextIndexConfig {
        embed: my_embed,
        stopWords: StopWordOptions { mode: StopWordMode::none },

        // Chunk long documents
        chunking: ChunkingOptions {
            strategy: ChunkStrategy::sentence,
            size: 128,      // Words per chunk
            overlap: 20     // Overlapping words
        }
    }
};

index.add("Long document with many paragraphs...", "doc1");
index.build();

var results = index.search_semantic("specific topic", 10);
// Each result carries `chunkKey = "<doc-key>#<chunk-position>"`. Multiple
// chunks from the same document collapse to a single result (best-scoring
// chunk per parent) so callers don't see duplicates.
```

#### Chunking Strategies

Four chunking strategies are available. Choose based on your document structure and retrieval needs.

**Fixed** -- Predictable, uniform size chunks. Splits text at exact word boundaries regardless of sentence or paragraph structure.

```gcl
chunking: ChunkingOptions { strategy: ChunkStrategy::fixed, size: 256, overlap: 50 }
```

**Sentence** -- Respects sentence structure. Chunks are formed by grouping complete sentences up to the target size.

```gcl
chunking: ChunkingOptions { strategy: ChunkStrategy::sentence, size: 128, overlap: 20 }
```

**Paragraph** -- Preserves topical coherence by splitting at paragraph boundaries.

```gcl
chunking: ChunkingOptions { strategy: ChunkStrategy::paragraph, size: 512, overlap: 100 }
```

**Recursive (Adaptive)** -- Tries paragraph boundaries first, then falls back to sentence, then to fixed-size splitting. Best general-purpose strategy.

```gcl
chunking: ChunkingOptions { strategy: ChunkStrategy::recursive, size: 256, overlap: 50 }
```

| Strategy | Best For | Trade-off |
|----------|----------|-----------|
| `fixed` | Uniform embedding sizes | May split mid-sentence |
| `sentence` | General text | Chunk sizes may vary |
| `paragraph` | Structured documents | Large chunks if paragraphs are long |
| `recursive` | Mixed content | Most adaptive, slightly more processing |

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `embed` | `null` | User-provided embedding function: `fn(text: String): Tensor` |
| `chunking.strategy` | `ChunkStrategy::none` | Document chunking strategy |
| `chunking.size` | `256` | Words per chunk |
| `chunking.overlap` | `50` | Overlapping words between chunks |

---

## Exact Search

Normalized substring matching. The query is normalized using the same pipeline as documents, then every document's normalized text is checked for substring containment. Results score `1.0` for any match.

### When to Use

- **Exact phrase lookup** -- finding documents that contain a specific string exactly.
- **Case-insensitive substring matching** -- searching for product codes, identifiers, or known phrases.
- **Lookup by key or name** -- when you know the exact text you are looking for (after normalization).

### Algorithm

```
// Algorithm:
// 1. Normalize query: "Quick FOX" -> "quick fox"
// 2. For each entry in entries index:
//    if entry.normalizedText contains "quick fox" -> match (score: 1.0)
// 3. Collect up to k matches
```

Normalization includes case folding, accent removal, and whitespace collapsing, so `"machine learning"` matches `"Machine Learning"` and `"MACHINE  LEARNING"`.

### GCL Examples

```gcl
// Finds documents containing the exact phrase (normalized)
var _results = index.search_exact("machine learning", 10);
// Matches "Machine Learning", "MACHINE LEARNING", etc.
```

### Configuration Parameters

No mode-specific parameters. Exact search uses the same normalization pipeline configured for the index (case folding, accent removal, etc.).

> **Note:** `search_exact` is a binary substring filter. Every match returns `score = 1.0` and results are sorted alphabetically by key, not by relevance. It is intended as a binary boost in hybrid fusion (e.g. the `keyword` preset). For graded relevance ranking, use `search_bm25` or `search_phrase`.

---

## Fuzzy Search (Document-Level)

Compares the full query string against each document's normalized key using edit distance. Two algorithms are used depending on string length.

### When to Use

- **Short keys** -- product names, titles, person names where the entire key should be similar to the query.
- **Typo tolerance at the whole-key level** -- when user input may have misspellings.
- **Name matching** -- finding "Jon Smith" when searching for "John Smth".

### Algorithm

```
// For short strings (byte length < 20):
// Use Jaro-Winkler similarity (faster, good for short strings)
score = query.jarowinkler(target)
threshold = 1.0 - (maxEdits / maxLength)

// For longer strings:
// Use Levenshtein distance (edit distance)
distance = query.levenshtein(target)
score = 1.0 - (distance / maxLength)
matched = distance <= maxEdits

// Length rejection fast path: if |len1 - len2| > maxEdits, skip immediately
```

### GCL Examples

```gcl
// Default key-level fuzzy: maxEdits=2
var _results = index.search_fuzzy("machne lerning", 10, null);

// Custom max-edits
var _custom = index.search_fuzzy("machne lerning", 10, FuzzyOptions { maxEdits: 1 });
// Finds "machine learning" with typos
```

The third argument is a `FuzzyOptions?` block. Pass `null` for defaults, or set `maxEdits`, `mode`, and `maxTextLength` to customize.

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `FuzzyOptions.maxEdits` | `2` | Maximum Levenshtein distance allowed |
| `FuzzyOptions.mode` | `FuzzyMode::key` | `key` = whole-document, `term` = per-token vocabulary fuzzy |
| `FuzzyOptions.maxTextLength` | from `config.fuzzyMaxTextLength` | Skip docs longer than this for `key` mode |

---

## Term-Level Fuzzy Search (FuzzyMode::term)

Instead of comparing the full query against full documents, set `FuzzyOptions.mode = FuzzyMode::term` to tokenize the query and match each query term against the vocabulary using the trigram pre-filter for candidate selection.

### When to Use

- **Longer documents** -- where you want to find terms that are close to the query terms, regardless of the overall document content.
- **Vocabulary-level typo tolerance** -- matching misspelled individual words against the indexed vocabulary.
- **Search-as-you-type with tolerance** -- forgiving partial or mistyped terms.

### Algorithm

```
// Term-level fuzzy with trigram pre-filter:
// 1. Tokenize query into terms
// 2. For each query term:
//    a. Extract trigrams from query term
//    b. Look up trigramIndex to find candidate vocabulary terms
//    c. Compute Levenshtein distance only against candidates
//    d. If distance <= maxEdits, treat as match
// 3. Collect documents containing any matched vocabulary term
// 4. Score = BM25 using matched terms

// Automatic typo tolerance (Meilisearch-style):
//   word length <= 4:  0 typos allowed
//   word length 5-8: 1 typo allowed
//   word length 9+:  2 typos allowed
```

### GCL Examples

```gcl
// Fuzzy match individual query terms against vocabulary
var _results = index.search_fuzzy("algoritm", 10, FuzzyOptions {
    mode: FuzzyMode::term,
    maxEdits: 2
});
// Matches documents containing "algorithm"
```

### Comparison: Document-Level vs. Term-Level Fuzzy

| Mode | Scope | Best For |
|------|-------|----------|
| `FuzzyMode::key` | Whole key Levenshtein | Short keys, typo tolerance |
| `FuzzyMode::term` | Per-term Levenshtein | Long docs, vocabulary-level matching |

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `FuzzyOptions.maxEdits` | `2` | Maximum Levenshtein distance per term |
| `typoTolerance.enabled` (config) | `false` | Enable automatic typo tolerance for BM25 (Meilisearch-style word-length rules) |

---

## Boolean Search

A recursive descent parser converts boolean query strings into an AST, which is then evaluated against the index using set operations on posting lists.

### When to Use

- **Precise document filtering** -- when you need exact control over which terms must, may, or must not appear.
- **Research and legal search** -- complex queries like `"privacy AND (gdpr OR regulation) NOT marketing"`.
- **Product search with exclusions** -- `"(laptop OR notebook) AND (intel OR amd) NOT refurbished"`.
- **Minimum-should-match queries** -- WEAKAND for flexible matching without requiring all terms.

### Algorithm

```
// Supported operators: AND, OR, NOT, parentheses
// - WEAKAND(N, t1, t2, ...) -- At least N of the listed terms must match
// Parser: recursive descent (boolean_parser.gcl)
// Executor: set operations (boolean_engine.gcl)

// Query: "fox AND NOT dog"
// AST:   AND(Term("fox"), NOT(Term("dog")))
//
// Evaluation:
// 1. Resolve "fox" -> posting list {doc1, doc2, doc3}
// 2. Resolve "dog" -> posting list {doc1, doc3}
// 3. NOT("dog") = all docs - {doc1, doc3}
// 4. AND = intersection: {doc1,doc2,doc3} intersect (all - {doc1,doc3}) = {doc2}
//
// Score: BM25 of matching terms for ranked results

// Complex: "(cat OR dog) AND NOT (fish AND bird)"
// AST:   AND(OR(Term("cat"), Term("dog")), NOT(AND(Term("fish"), Term("bird"))))
```

### Operator Summary

| Operator | Meaning | Example |
|----------|---------|---------|
| `AND` | Both terms must be present | `"machine AND learning"` |
| `OR` | Either term can be present | `"machine OR learning"` |
| `NOT` | Exclude documents with term | `"machine NOT deep"` |
| `WEAKAND(N, ...)` | At least N terms must match | `"WEAKAND(2, a, b, c)"` |
| (implicit) | Adjacent terms use `AND` | `"machine learning"` = `"machine AND learning"` |

### GCL Examples

#### Basic Boolean Operators

Adjacent terms without operators are implicitly joined with `AND`.

```gcl
// AND - both terms required
var _results = index.search_boolean("machine AND learning", 10);

// OR - either term matches
var _results = index.search_boolean("machine OR learning", 10);

// NOT - exclude documents with term
var _results = index.search_boolean("machine NOT deep", 10);
```

#### Complex Queries with Parentheses

```gcl
// Parentheses for grouping
var _results = index.search_boolean("(machine OR deep) AND learning", 10);

// Nested queries
var _results = index.search_boolean(
    "((neural OR deep) AND network) NOT convolutional", 10
);

// Implicit AND (adjacent terms)
var _results = index.search_boolean("machine learning algorithms", 10);
// Equivalent to: machine AND learning AND algorithms
```

#### WEAKAND (Minimum-Should-Match)

`WEAKAND(N, term1, term2, ...)` requires at least N of the listed terms to match. This is useful when you want flexible matching without requiring all terms.

```gcl
var _results = index.search_boolean("WEAKAND(2, machine, learning, neural, network)", 10);
// Matches documents containing any 2+ of the 4 terms
```

```gcl
// At least 2 of 3 terms
var _results = index.search_boolean("WEAKAND(2, quick, brown, fox)", 10);

// At least 1 term (broadest match)
var _results = index.search_boolean("WEAKAND(1, apple, banana, cherry)", 10);

// All 3 required (equivalent to AND)
var _results = index.search_boolean("WEAKAND(3, alpha, beta, gamma)", 10);
```

#### Real-World Examples

```gcl
// Research papers
"(neural OR deep) AND (network OR learning) NOT survey"

// Legal documents
"privacy AND (gdpr OR regulation) NOT marketing"

// Product search
"(laptop OR notebook) AND (intel OR amd) NOT refurbished"

// Medical records
"diabetes AND (type1 OR type2) NOT gestational"
```

#### Pattern: Required + Optional + Excluded

A common pattern combines required terms (`AND`), alternative terms (`OR`), and exclusions (`NOT`):

```
<required terms> AND (<alternative1> OR <alternative2>) NOT <excluded>
```

This gives users precise control over which documents match without needing to write code-level filters.

### Configuration Parameters

No mode-specific configuration parameters. Boolean search uses the index's existing inverted index and BM25 configuration for scoring matched documents.

---

## Proximity Search

Finds documents where two terms appear within a specified token distance. Uses a two-pointer scan on sorted position arrays for efficient minimum distance computation.

### When to Use

- **Compound terms and collocations** -- finding "new york" even when separated by other tokens.
- **Related term co-occurrence** -- ensuring "cause" and "effect" appear near each other.
- **Same-sentence or same-paragraph context** -- using larger distances (10-50) for thematic co-occurrence.

### Algorithm

```
// Algorithm: two-pointer merge on sorted position arrays
// Given positions1 = [2, 7, 15] and positions2 = [4, 8, 20]

i = 0, j = 0, minDist = infinity
while i < len(positions1) and j < len(positions2):
    dist = |positions1[i] - positions2[j]|
    minDist = min(minDist, dist)
    if positions1[i] < positions2[j]:
        i++
    else:
        j++

// Example: |2-4| = 2, |7-4| = 3, |7-8| = 1 -> minDist = 1

// Score: max(0.0, 1.0 - (minDist / (distance + 1)))
// Closer terms score higher (minDist=1, distance=5 -> score=0.833)
```

### GCL Examples

```gcl
// Terms within N positions of each other
var _results = index.search_proximity("machine", "learning", 10, ProximityOptions { distance: 5 });
// Matches if terms are within 5 tokens:
//   "machine learning"
//   "machine and deep learning"
//   "machine, a subset of learning"
```

The arguments are: `term1`, `term2`, `k` (max results), `ProximityOptions?`.

```gcl
// Tight proximity (adjacent or near)
index.search_proximity("new", "york", 10, ProximityOptions { distance: 2 });
// Matches: "new york", "new york city"

// Loose proximity (same paragraph)
index.search_proximity("climate", "change", 10, ProximityOptions { distance: 50 });

// Collocations
index.search_proximity("artificial", "intelligence", 10, ProximityOptions { distance: 3 });
```

### Choosing Distance Values

| Distance | Use Case | Example |
|----------|----------|---------|
| 1-2 | Compound terms, names | `"new"`, `"york"` |
| 3-5 | Related terms, phrases | `"artificial"`, `"intelligence"` |
| 10-20 | Same sentence context | `"cause"`, `"effect"` |
| 30-50 | Same paragraph context | `"climate"`, `"change"` |

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `ProximityOptions.distance` | `5` | Maximum token distance between the two terms |

---

## Phrase Search

Searches for exact sequences of terms by verifying consecutive positions. Optional slop tolerance allows terms to be out of order or have gaps.

### When to Use

- **Exact phrase matching** -- finding `"machine learning"` as a consecutive sequence, not as separate terms.
- **Quoted search queries** -- when users enclose queries in quotes to indicate exact-phrase intent.
- **Sloppy phrase matching** -- using slop to allow near-phrase matches with intervening words.

### Algorithm

```
// Phrase: "brown fox" (slop=0, exact consecutive)
// Algorithm:
// 1. Tokenize phrase: ["brown", "fox"]
// 2. Find candidate docs (intersection of all term posting lists)
// 3. For each candidate doc:
//    a. Get positions of "brown": [2, 10]
//    b. Get positions of "fox":   [3, 8]
//    c. Check: exists p1 in brown_pos, p2 in fox_pos
//       where p2 = p1 + 1 (consecutive)
//    d. Found: p1=2, p2=3 -> match!
// 4. Score matching docs with BM25

// With slop=2:
// Allow up to 2 positions of deviation per term
// "brown ... fox" or "fox brown" can match
// Each deviation adds to total slop; total must be <= slop parameter
```

### GCL Examples

```gcl
// Terms must appear consecutively
var _results = index.search_phrase("machine learning", 10, null);
// Matches "machine learning" but not "machine deep learning"
```

Phrase search normalizes and tokenizes the query the same way as indexing, so it respects case folding and stop word removal when configured.

```gcl
// Allow up to 2 words between phrase terms
var _results = index.search_phrase("machine learning", 10, PhraseOptions { slop: 2 });
// Matches: "machine learning" (slop 0)
// Matches: "machine deep learning" (slop 1)
// Matches: "machine and deep learning" (slop 2)
```

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `PhraseOptions.slop` | `0` | Maximum number of intervening terms allowed |

---

## Prefix Search

Finds all terms in the vocabulary that start with a given prefix, then collects documents containing any matched term. With edge n-grams enabled, prefix lookup is O(1).

### When to Use

- **Search-as-you-type / autocomplete** -- showing results as the user types partial words.
- **Partial term matching** -- when users enter incomplete terms.
- **Vocabulary exploration** -- discovering what terms exist in the index that start with a given string.

### Algorithm

```
// Without edge n-grams (vocabulary scan):
// 1. Normalize prefix: "qui" -> "qui"
// 2. Scan normalizedTerms index for terms starting with "qui"
//    Matches: "quick", "quiet", "quilt", "quit"
// 3. Union posting lists of all matched terms
// 4. Score by BM25

// With edge n-grams (O(1) lookup):
// 1. Look up edgeNgramTerms.get("qui") directly
// 2. Get the NormalizedTerm node with its posting list
// 3. All documents containing any term starting with "qui" are returned
```

Uses a trie built at `build()` time for O(prefix_length + matches) lookup. A reverse trie is also constructed for leading wildcard patterns like `*tion`.

### GCL Examples

```gcl
// Find documents with terms starting with prefix
var _results = index.search_prefix("algo", 10);
// Matches documents containing: algorithm, algorithms, algorithmic
```

#### Suggest (term completions)

Use `suggest()` to return vocabulary terms that match a prefix. Ideal for search-as-you-type interfaces.

```gcl
// Get vocabulary term completions
var _completions = index.suggest("mach", 10);
// Returns Array<Suggestion> for: machine, machinery, machines, machining

// Useful for search-as-you-type
var input = "neur";
var suggestions = index.suggest(input, 5);
for (var i = 0; i < suggestions.size(); i++) {
    info(suggestions[i].term);
}
// neural, neuron, neuroscience, ...
```

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `edgeNgram.enabled` (config) | `false` | Build edge n-gram index for O(1) prefix lookup |
| `edgeNgram.min` (config) | `2` | Minimum prefix length to index |
| `edgeNgram.max` (config) | `20` | Maximum prefix length to index |

---

## Wildcard Search

Supports `*` (any sequence of characters) and `?` (any single character) patterns matched against the vocabulary. Uses pattern compilation for efficient matching.

### When to Use

- **Flexible pattern matching** -- when the user knows part of a term but not the exact spelling.
- **Suffix matching** -- finding all terms ending with a pattern like `*tion`.
- **Single-character variants** -- finding `"te?t"` to match both `"test"` and `"text"`.

### Algorithm

```
// Pattern: "qu*ck" matches "quick", "quack"
// Pattern: "fo?" matches "fox", "for", "fog"
// Pattern: "*tion" matches "action", "motion", "station"
//
// Algorithm:
// 1. Normalize pattern
// 2. Scan vocabulary, test each term against pattern
// 3. Collect documents from matching term posting lists
// 4. Score by BM25 of matched terms
```

### GCL Examples

```gcl
// Match any term starting with "algo" and ending with "m"
var _results = index.search_wildcard("algo*m", 10);
// Matches: "algorithm"

// Match terms with a single character variant
var _results = index.search_wildcard("colo?r", 10);
// Matches: "colour" (but not "color" -- ? requires exactly one char)

// Match any term ending with "tion"
var _results = index.search_wildcard("*tion", 10);
// Matches: "information", "classification", "generation", ...

// Match terms with a specific pattern
var _results = index.search_wildcard("te?t", 10);
// Matches: "test", "text"
```

### Wildcard Pattern Summary

| Pattern | Matches | Does Not Match |
|---------|---------|----------------|
| `algo*` | algorithm, algorithms, algorithmic | algebra |
| `*ing` | learning, processing, mining | learned |
| `te?t` | test, text | treat |
| `*learn*` | learning, learned, unlearned | lean |
| `n??ral` | neural | natural |

### Configuration Parameters

No mode-specific configuration parameters. Wildcard search operates on the vocabulary built during `build()`.

---

## Span Queries

Positional span queries for fine-grained proximity control. Three span operators are supported:

- **NEAR(a, b, n)** -- terms a and b must appear within n tokens of each other (unordered)
- **ONEAR(a, b, n)** -- ordered NEAR: a must appear before b within n tokens
- **FIRST(term, n)** -- term must appear in the first n positions of the document

### When to Use

- **Ordered proximity constraints** -- when the order of terms matters (ONEAR).
- **Document-head matching** -- finding terms that appear early in a document (FIRST), useful for matching titles or lead sentences.
- **Fine-grained positional control** -- when standard proximity search is not specific enough.

### Algorithm

```
// Algorithm (NEAR):
// 1. Parse span query: NEAR(machine, learning, 3)
// 2. Resolve both terms to posting lists
// 3. Intersect posting lists to find candidate documents
// 4. For each candidate, get position arrays for both terms
// 5. Two-pointer scan: check if any pair of positions is within distance 3
// 6. Return matching documents (score = 1.0)

// ONEAR adds ordering constraint: position(t1) < position(t2)
// FIRST checks: position(term) < window
```

### GCL Examples

Span queries are invoked through the span search API using the span query syntax:

```gcl
// NEAR -- unordered proximity
var _results = index.search_span("NEAR(machine, learning, 3)", 10);

// ONEAR -- ordered proximity (machine must appear before learning)
var _results = index.search_span("ONEAR(machine, learning, 3)", 10);

// FIRST -- term must appear in the first N positions
var _results = index.search_span("FIRST(introduction, 5)", 10);
```

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `n` (in query) | N/A | Window size in tokens for NEAR/ONEAR/FIRST |

---

## DFR Search (Divergence From Randomness)

DFR is an alternative to BM25 based on information theory. It models term distribution as a combination of three components: a basic model of information content, an after-effect of sampling, and a length normalization factor.

### When to Use

- **Alternative to BM25** -- when BM25 does not perform well for your corpus characteristics.
- **Information-theoretic scoring** -- when you want a principled scoring model based on how much a term's distribution in a document diverges from its expected random distribution.
- **Tunable retrieval** -- DFR offers many combinations of basic models, after-effects, and normalizations for fine-grained control.

### Formula

```
// DFR score for term t in document d:
score(t, d) = BasicModel(t) * AfterEffect(t, d) * Normalization(d)

// Basic Models:
//   G    (Geometric):  log2(1 + cf/N)         -- cf = collection frequency
//   In   (Inverse DF): tf * log2((N+1)/(df+0.5))
//   Ine  (Inverse NE): tf * log2((N+1)/ne)    -- ne = expected docs
//   IF   (Inverse TF): tf * log2(1 + avgdl/tf)

// After Effects:
//   Laplace:    1 / (tf + 1)
//   Bernoulli:  (totalTokens - tf + 1) / ((df + 1) * (avgdl + 1))

// Normalizations (document length adjustment):
//   H1:  tf * log2(1 + c * avgdl / docLen)       -- pivoted
//   H2:  tf * log2(1 + c * avgdl / docLen)^2      -- logarithmic
//   H3:  (tf + c * avgdl / docLen) / (1 + c)      -- Dirichlet-style
//   Z:   tf / (tf + c * avgdl / docLen)            -- bounded
```

### GCL Examples

```gcl
var _index = TextIndex<String> {
    config: TextIndexConfig {
        dfr: DFROptions {
            basicModel: DFRBasicModel::G,
            afterEffect: DFRAfterEffect::Laplace,
            normalization: DFRNormalization::H2
        }
    }
};
```

### Configuration Parameters

All under `dfr: DFROptions` in `TextIndexConfig`.

| Parameter | Default | Description |
|-----------|---------|-------------|
| `dfr.basicModel` | `DFRBasicModel::G` | Basic model: `G`, `In`, `Ine`, `IF` |
| `dfr.afterEffect` | `DFRAfterEffect::Laplace` | After-effect: `Laplace`, `Bernoulli` |
| `dfr.normalization` | `DFRNormalization::H2` | Normalization: `H1`, `H2`, `H3`, `Z` |

---

## LM-Dirichlet Search (Language Model)

Language Model scoring with Dirichlet prior smoothing. Models documents as probability distributions over terms and scores queries by their likelihood under the document model.

### When to Use

- **Short queries** -- Dirichlet smoothing works well for short queries where BM25 may over-penalize missing terms.
- **Corpus with short documents** -- the smoothing parameter controls how much to fall back to the collection model for unseen terms.
- **Probabilistic retrieval** -- when you want a language-model-based ranking that naturally handles term absence.

### Formula

```
// LM-Dirichlet score for term t in document d:
score(t, d) = log((tf + mu * P(t|C)) / (docLen + mu))

// Parameters:
//   tf      = term frequency in document d
//   mu      = Dirichlet smoothing parameter (default: 2000)
//   P(t|C)  = collection-level probability: totalTermFreq / totalTokens
//   docLen  = document length in tokens

// Higher mu = more smoothing toward the collection model
//   mu = 500:   minimal smoothing, favors exact document matches
//   mu = 2000:  balanced (default)
//   mu = 5000:  heavy smoothing, works better for short queries
```

### GCL Examples

```gcl
var _index = TextIndex<String> {
    config: TextIndexConfig {
        lmDirichlet: LMDirichletOptions { mu: 2000.0 }
    }
};
```

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `lmDirichlet.mu` | `2000.0` | Dirichlet smoothing parameter; higher = more smoothing toward the collection model |

---

## Phonetic Search

Sound-alike matching using the Double Metaphone algorithm. Finds documents that sound similar to the query even when spelled differently.

### When to Use

- **Name search** -- finding people by name when the spelling is uncertain ("Smith" vs. "Smyth").
- **Sound-alike retrieval** -- matching homophones and near-homophones ("Knight" vs. "Night").
- **Multilingual name matching** -- handling phonetic variations across transliterations.

### Algorithm

```
// Double Metaphone generates phonetic codes:
// "Smith"  -> "SM0"   (primary)
// "Smyth"  -> "SM0"   (primary)  -- matches "Smith"
// "Knight" -> "NT"    (primary)
// "Night"  -> "NT"    (primary)  -- matches "Knight"

// Algorithm handles:
// - Silent prefixes: KN, GN, PN, AE, WR
// - Digraphs: CH, PH, SH, TH
// - Double consonant collapsing
// - Primary and secondary codes for ambiguous pronunciations

// Requires config:
usePhonetic: true  // builds phonetic index during build()

// The phonetic index maps each phonetic code to all terms
// sharing that code, enabling sound-alike retrieval
```

### GCL Examples

```gcl
var index = TextIndex<String> {
    config: TextIndexConfig {
        usePhonetic: true,
        stopWords: StopWordOptions { mode: StopWordMode::none }
    }
};

index.add("Smith and associates", "doc1");
index.add("Smyth consulting group", "doc2");
index.build();

var _results = index.search_phonetic("Smith", 10);
// Matches both "Smith" and "Smyth"
```

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `usePhonetic` (config) | `false` | Enable phonetic index construction during `build()` |

---

## Quorum Search

Minimum-should-match search. Documents must contain at least `minMatch` of the query terms. More flexible than boolean AND (which requires all terms) but more precise than OR (which requires any one term).

### When to Use

- **Flexible multi-term queries** -- when requiring all terms is too strict but any-one-term is too loose.
- **Long user queries** -- where matching most but not all terms is acceptable.
- **Recall-precision trade-off** -- tuning `minMatch` to control the balance.

### Algorithm

```
// Query: "machine learning neural network" with minMatch=2
// Matches documents containing any 2+ of the 4 terms:
//   "machine learning" -- matches (2 of 4)
//   "neural network architecture" -- matches (2 of 4)
//   "machine" alone -- does NOT match (only 1 of 4)

// Score: matchedCount / queryTermCount (0.0 to 1.0 range, representing the fraction of query terms matched)
// Higher minMatch = higher precision, lower recall
```

### GCL Examples

```gcl
var _results = index.search_quorum("machine learning neural network", 10, 2);
// Matches documents containing at least 2 of the 4 query terms
```

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `minMatch` (argument) | N/A | Minimum number of query terms that must be present in a document |

---

## Hybrid Search

Hybrid search combines multiple search modes using score fusion. The unified `search()` method dispatches to any combination of engines and merges results using Reciprocal Rank Fusion (RRF) or linear combination.

### When to Use

- **Best overall relevance** -- combining BM25 keyword matching with exact, fuzzy, or semantic search for comprehensive results.
- **Compensating for individual mode weaknesses** -- BM25 misses synonyms, semantic search misses exact terms; combining them covers both.
- **Tunable multi-signal ranking** -- adjusting weights to balance precision and recall from different engines.

### Fusion Mechanics

#### Reciprocal Rank Fusion (RRF)

RRF merges ranked lists by summing the reciprocal of each result's rank across modes. It is rank-based and does not require score normalization.

```
// RRF score for document d across modes:
RRF(d) = SUM_m weight_m / (k + rank_m(d))

// k = 60 (standard constant)
// weight_m = per-mode weight from SearchOptions
// rank_m(d) = rank of document d in mode m's result list (1-based)
```

#### Linear Fusion

Linear fusion normalizes raw scores from each mode and combines them with weighted linear combination.

```
// Linear fusion:
score(d) = SUM_m weight_m * normalize(score_m(d))
```

Two normalization methods are supported:

| Method | Description |
|--------|-------------|
| `Normalization::minmax` | Scales scores to [0, 1] range using min/max values |
| `Normalization::zscore` | Normalizes by mean and standard deviation |

### GCL Examples

#### Default Hybrid Search

```gcl
// Automatically combines BM25 + exact + fuzzy + semantic
var _results = index.search("machine learning", 10, null);
// Uses RRF fusion with default weights
```

#### Custom Fusion Weights

```gcl
var w = Map<SearchMode, float> {};
w.set(SearchMode::bm25, 0.5);
w.set(SearchMode::exact, 0.3);
w.set(SearchMode::fuzzy, 0.2);

var options = SearchOptions {
    weights: w,
    fusionMethod: FusionMethod::rrf
};

var _results = index.search("query", 10, options);
```

#### Linear Fusion

```gcl
var w = Map<SearchMode, float> {};
w.set(SearchMode::bm25, 0.6);
w.set(SearchMode::semantic, 0.4);

var options = SearchOptions {
    fusionMethod: FusionMethod::linear,
    normalization: Normalization::minmax,  // or zscore
    weights: w
};

var _results = index.search("query", 10, options);
```

#### Hybrid with Semantic

When an embedding function is configured (`config.embed`), hybrid search includes semantic similarity in the fusion.

```gcl
fn my_embed(text: String): Tensor {
    return ai::embed(text, model);
}

var w = Map<SearchMode, float> {};
w.set(SearchMode::bm25, 0.4);
w.set(SearchMode::semantic, 0.6);

var index = TextIndex<String> {
    config: TextIndexConfig {
        embed: my_embed,
        stopWords: StopWordOptions { mode: StopWordMode::default },
        // Weights for hybrid fusion
        fusion: FusionOptions {
            method: FusionMethod::rrf,
            weights: w
        }
    }
};

// Hybrid search combines BM25 + semantic + exact
var results = index.search("artificial intelligence", 10, null);
```

#### Min Score Filtering

```gcl
var options = SearchOptions {
    minScore: 0.3  // Filter results below threshold
};

var _results = index.search("query", 10, options);
```

#### Selecting Search Modes via SearchOptions

The `modes` field in `SearchOptions` lets you select exactly which engines to run:

```gcl
// Single mode -- BM25 only, no fusion
var bm25Only = Array<SearchMode> {};
bm25Only.add(SearchMode::bm25);
var _options = SearchOptions { modes: bm25Only };

// Two modes -- fuses BM25 + phrase
var twoModes = Array<SearchMode> {};
twoModes.add(SearchMode::bm25);
twoModes.add(SearchMode::phrase);
var _options2 = SearchOptions { modes: twoModes };

// Default (null) -- BM25 + exact + fuzzy + semantic
var _options3 = SearchOptions {};
```

### Available Search Modes

| Mode | Description |
|------|-------------|
| `SearchMode::hybrid` | Default multi-mode fusion |
| `SearchMode::bm25` | TF-IDF ranking with length normalization |
| `SearchMode::semantic` | Vector similarity via embeddings (requires AI library) |
| `SearchMode::exact` | Normalized substring matching |
| `SearchMode::fuzzy` | Levenshtein distance matching |
| `SearchMode::boolean` | AND/OR/NOT with parentheses |
| `SearchMode::proximity` | Two-term distance scoring |
| `SearchMode::phrase` | Exact sequence matching |
| `SearchMode::prefix` | Prefix term matching |
| `SearchMode::wildcard` | Wildcard pattern matching |
| `SearchMode::span` | Positional span queries (NEAR/ONEAR/FIRST) |
| `SearchMode::dfr` | Divergence From Randomness scoring |
| `SearchMode::lm_dirichlet` | Language Model with Dirichlet smoothing |
| `SearchMode::phonetic` | Phonetic matching (Double Metaphone) |
| `SearchMode::quorum` | Minimum-should-match queries |

### Configuration Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `fusion.method` | `FusionMethod::rrf` | Fusion method: `rrf` or `linear` |
| `fusion.normalization` | `Normalization::minmax` | Score normalization for linear fusion: `minmax` or `zscore` |
| `fusion.weights[bm25]` | `0.4` | Weight for BM25 mode in fusion |
| `fusion.weights[exact]` | `0.3` | Weight for exact mode in fusion |
| `fusion.weights[fuzzy]` | `0.2` | Weight for fuzzy mode in fusion |
| `fusion.weights[semantic]` | `0.6` | Weight for semantic mode in fusion |
| `SearchOptions.minScore` | `0.0` | Minimum score threshold for result filtering |
| `SearchOptions.modes` | `null` (= BM25 + exact + fuzzy + semantic) | Array of search modes to run |
