# Architecture

Text Search for GreyCat is a full-text search engine built on GreyCat's persistent graph database. It combines **inverted index** techniques from classical information retrieval with GreyCat's native graph storage primitives (`node`, `nodeIndex`, `nodeList`) to deliver ranked search across 15 modes without any external dependencies.

## Overview

### Core Concepts

- **Inverted Index** -- Maps each normalized term to the set of documents containing it, with term frequency counts. This is the backbone of BM25, boolean, phrase, and prefix search.
- **Bidirectional Links** -- Every `NormalizedTerm` links to its `IndexEntry` documents via compact posting arrays (`postingEntries`), and every `IndexEntry` links back to its `NormalizedTerm`s via compact forward index arrays (`entryTermNodes`/`entryTermTFs`). This enables both term-to-doc lookups (search) and doc-to-term lookups (More Like This, MMR similarity).
- **Positional Index** -- Each document stores the exact positions where each term appears via packed position arrays (`positionData`, `positionOffsets`, `positionCounts`) -- a flat concatenated array with per-term offsets instead of nested nodeIndex. This powers phrase search, proximity search, and span queries (NEAR/ONEAR/FIRST).
- **Vector Index** -- Optional AI embeddings stored in a `VectorIndex` for semantic similarity search. Supports full-document and chunk-based embeddings with configurable chunking strategies.
- **Term Variants** -- The `NormalizedTerm` type stores the casefolded, stemmed, accent-stripped form used for matching. The original surface form is tracked at the tokenizer level via the `TokenInfo.original` field rather than as a separate graph type, keeping the data model compact.

### Design Principles

- **Precomputation over runtime cost** -- BM25 scoring uses a 256-bucket TF cache built during `build()`, enabling O(1) TF normalization lookup per document. Compact posting arrays with fieldnorm bucket IDs avoid per-document computation at query time.
- **Memory deduplication** -- All text is stored as `node<String>`, which GreyCat automatically deduplicates. Identical strings share a single graph node.
- **Hot paths in C** -- Character classification, string manipulation, accent stripping, BM25 scoring, DFR/LM-Dirichlet scoring, tokenization, Porter stemming, phonetic encoding, phrase/proximity/span positional verification, regex matching, and decay functions are implemented as native C functions for performance.
- **Cache once, use everywhere** -- Character maps, punctuation sets, combining marks, and HTML entity maps are built once per index lifetime and passed to every operation.

## Data Model

The graph structure centers on `TextIndex<T>`, which owns two primary indices and several auxiliary structures. Here is the complete type hierarchy:

```
// Graph structure: TextIndex owns two primary nodeIndex maps

TextIndex<T>
  // Configuration
  +-- config:           TextIndexConfig
  |
  // Primary indices
  +-- entries:          nodeIndex<String, node<IndexEntry>>?
  +-- normalizedTerms:  nodeIndex<String, node<NormalizedTerm>>?
  |
  // Statistics
  +-- totalEntries:     int?
  +-- totalTokens:      int?
  +-- totalTerms:       int?
  +-- avgTokenCount:    float?
  +-- built:            bool?
  +-- dirtyCount:       int?
  |
  // Auxiliary indices
  +-- contentHashes:    nodeIndex<String, node<IndexEntry>>?
  +-- trigramIndex:     nodeIndex<String, node<TrigramPostings>>?
  +-- edgeNgramTerms:   nodeIndex<String, node<NormalizedTerm>>?
  +-- resolvedFields:   Array<FieldConfig>?          // lazy cache: explicit config.fields, or auto-discovered String fields on T at first add_fields()
  +-- vectorIndex:      node<VectorIndex<node<IndexEntry>>>?
  +-- phoneticIndex:    nodeIndex<String, node<PhoneticPostings>>?
  +-- trieRoot:         node<TrieNode>?                    // trie for prefix/wildcard search
  +-- reverseTrieRoot:  node<TrieNode>?                    // reverse trie for leading wildcard search
  |
  // Cached maps (built once, reused)
  +-- cachedCharToChar:       Map<char, String>?
  +-- cachedCharToString:     Map<char, String>?
  +-- cachedPunctuationChars: Map<char, bool>?
  +-- cachedCombiningMarks:   Map<char, bool>?
  +-- cachedStopWordMap:      Map<String, bool>?
  +-- cachedTFCache:          Array<float>?                 // 256-entry BM25 TF normalization cache
  +-- normalizedSynonyms:     Map<String, Array<String>>?

IndexEntry
  +-- value:           any?               // stored document value (type T or node<T>)
  +-- normalizedText:  String             // normalized text (plain String)
  +-- contentHash:     String?            // SHA-256 for dedup
  +-- tokenCount:      int                // document length in tokens
  +-- entryTermNodes:  Array<node<NormalizedTerm>>?  // compact forward index: term refs
  +-- entryTermTFs:    Array<int>?                   // compact forward index: TFs (parallel)
  +-- positionData:    Array<int>?                   // packed positions: all positions flat
  +-- positionOffsets: Array<int>?                   // packed positions: start offset per term
  +-- positionCounts:  Array<int>?                   // packed positions: count per term
  +-- chunks:          nodeList<node<IndexChunk>>?
  +-- vector:          node<Tensor>?

NormalizedTerm extends Term
  +-- text:              node<String>     // deduplicated term string
  +-- originalForm:      String?          // best original (pre-stemmed) form
  +-- idf:               float?           // inverse document frequency
  +-- maxTermScore:      float?           // WAND upper bound
  +-- postingEntries:    Array<node<IndexEntry>>?  // compact posting list
  +-- postingTFs:        Array<int>?               // term frequencies (parallel)
  +-- postingDocLens:    Array<float>?             // document lengths (parallel)
  +-- postingFieldnormIds: Array<int>?             // fieldnorm buckets (parallel)
  +-- postingBlockMaxScores: Array<float>?         // block WAND scores
```

### Why This Design?

#### Bidirectional Links

Every `NormalizedTerm` has compact posting arrays (`postingEntries`) pointing to the documents that contain it (term -> doc). Every `IndexEntry` has compact forward index arrays (`entryTermNodes`/`entryTermTFs`) pointing back to the terms it contains (doc -> term). This bidirectionality enables:

- **Search**: given a query term, find all documents (forward lookup)
- **More Like This**: given a document, find its terms, pick the top TF-IDF terms, then search for similar documents
- **MMR diversity**: compute Jaccard similarity between two documents by intersecting their term sets
- **Explain**: given a document and query, compute per-term BM25 contribution

#### Normalized Terms and Original Forms

`NormalizedTerm` holds the casefolded, stemmed, accent-stripped form used for matching. The original surface form is tracked at the tokenizer level via `TokenInfo`, which has both `text` (normalized) and `original` (surface form) fields. This means search is accent-insensitive and case-insensitive, while highlighting and display use the original characters -- without requiring a separate `OriginalTerm` graph type.

#### Positional Index

Packed position arrays store all positions in a single flat `Array<int>` (`positionData`) with per-term offset (`positionOffsets`) and count (`positionCounts`) arrays. This avoids the overhead of nested `Array<Array<int>>` which cannot be persisted in graph nodes. Without pre-stored positions, phrase search would require re-tokenizing documents at query time. With positions pre-stored, phrase matching is a two-pointer merge on sorted integer arrays.

#### Hybrid Storage: nodeIndex and Compact Arrays

GreyCat's `nodeIndex<K, V>` is a persistent sorted map with O(log n) lookup, insertion, and iteration. It serves as the primary index for top-level lookups (`entries`, `normalizedTerms`). For posting lists and forward indices, compact parallel arrays (`postingEntries`/`postingTFs`, `entryTermNodes`/`entryTermTFs`) provide better cache locality and lower overhead than nested nodeIndex structures. This hybrid approach uses nodeIndex where flexible key-based lookup is needed, and flat arrays where sequential access and compact storage matter most.

### Core Graph Types

#### IndexChunk

Chunk record used by chunk-level semantic search. Populated by `add()` when `TextIndexConfig.chunking.strategy != none` and `config.embed != null`. The chunker (`TextChunker::chunk`) splits the original text, each chunk is embedded individually, and the vector is pushed into the index's `chunkVectorIndex` (a `VectorIndex<node<IndexChunk>>`). At query time, `search_semantic` over-fetches chunks and deduplicates to the best-scoring chunk per parent document.

```
IndexChunk
  content:    node<String>
  parent:     node<IndexEntry>
  vector:     node<Tensor>?
  tokenCount: int
  position:   int
```

#### TrigramPostings

Used for fast fuzzy pre-filtering.

```
TrigramPostings
  terms: Array<node<NormalizedTerm>>
```

#### PhoneticPostings

Posting list for phonetic codes, mapping each code to term texts sharing that pronunciation.

```
PhoneticPostings
  terms: Array<String>
```

#### TrieNode

Trie (prefix tree) node for O(prefix_len + matches) prefix/wildcard search.

```
TrieNode
  children:   Map<int, node<TrieNode>>?  // codepoint → child
  terms:      Array<node<NormalizedTerm>>?
  isTerminal: bool?                       // nullable for ABI safety; treat null as false
```

#### TextIndexConfig

Configuration type with standalone fields and nested option blocks controlling all aspects of indexing and search.

```
TextIndexConfig
  // standalone fields
  embed | synonyms | fields
  deduplicateContent | fuzzyMaxTextLength | usePhonetic
  // nested option blocks
  tokenization | stopWords | bm25 | fusion
  typoTolerance | edgeNgram | shortCircuit | diversify
  chunking | dfr | lmDirichlet | highlight
```

#### SearchOptions

Per-query overrides for configuration parameters.

```
SearchOptions  (@volatile)
  modes | weights | fusionMethod | normalization | rrf_k
  fuzzy | phrase | proximity         (nested options)
  typoTolerance | minScore | diversify
  offset | filter | termBoosts
```

### Relationships

| From | To | Relationship | Description |
|------|----|--------------|-------------|
| TextIndex | IndexEntry | `entries` | Ownership -- all indexed documents |
| TextIndex | NormalizedTerm | `normalizedTerms` | Ownership -- term vocabulary |
| TextIndex | TextIndexConfig | `config` | Configuration reference |
| TextIndex | TrigramPostings | `trigramIndex` | Fuzzy search trigrams |
| IndexEntry | NormalizedTerm | `entryTermNodes` | Term references (Array\<node\<NormalizedTerm\>\>) |
| IndexEntry | NormalizedTerm | `positionData/Offsets/Counts` | Positional offsets (Array\<int\>) |
| IndexEntry | IndexChunk | `chunks` | Document chunks for semantic search |
| IndexChunk | IndexEntry | `parent` | Back-reference to parent document |
| NormalizedTerm | IndexEntry | `postingEntries` | Posting list (Array\<node\<IndexEntry\>\>) |

### Volatile Result Types

All result types are annotated `@volatile` (not persisted to the graph).

#### TextResult

Unified search result returned by every public search method on `TextIndex<T>` (`search`, `search_bm25`, `search_semantic`, `search_exact`, `search_fuzzy`, `search_boolean`, `search_proximity`, `search_phrase`, `search_prefix`, `search_wildcard`, `search_span`, `search_dfr`, `search_lm_dirichlet`, `search_phonetic`, `search_quorum`, `more_like_this`, ...). Each engine produces this same shape; per-mode score breakdowns are merged into the single `score` value during fusion.

| Field | Type | Notes |
|-------|------|-------|
| `key` | `String` | Document key |
| `value` | `any?` | Stored document value (type T) |
| `score` | `float` | Relevance score (higher = more relevant) |
| `matchedTerms` | `Array<String>?` | Query terms that matched in this document |
| `chunkKey` | `String?` | `"<key>#<position>"` when chunk-level semantic search is enabled (config.chunking.strategy != none + config.embed != null). Null on the document-level path. |

#### BM25Result

Internal scoring record used by `BM25Engine` (and by helper engines that piggyback on the BM25 scorer such as Wildcard, Prefix, Boolean, and Quorum) before results are mapped to `TextResult` for the public API. Defined in `engine/bm25_engine.gcl`.

| Field | Type |
|-------|------|
| `key` | `String` |
| `value` | `any?` |
| `score` | `float` |
| `matchedTerms` | `Array<String>` |

#### ScoreExplanation

BM25 scoring debug breakdown with per-term IDF, TF, and contribution details.

| Field | Type |
|-------|------|
| `totalScore` | `float` |
| `terms` | `Array<TermExplanation>` |
| `variant` | `BM25Variant` |
| `k1` | `float` |
| `b` | `float` |
| `docLen` | `int` |
| `avgDocLen` | `float` |

#### TextIndexStats

Index statistics summary: corpus size, vocabulary size, average length.

| Field | Type |
|-------|------|
| `totalEntries` | `int` |
| `totalTerms` | `int` |
| `avgTokenCount` | `float` |

#### TextEntry

Batch input entry for `add_batch()` with optional pre-computed vector.

| Field | Type |
|-------|------|
| `key` | `String` |
| `value` | `any?` |
| `vector` | `Tensor?` |

### Enumerations

#### StopWordMode

Stop word handling strategy for the indexing pipeline.

Values: `none`, `auto`, `default`, `custom`

#### BM25Variant

BM25 IDF scoring formula variant selection.

Values: `lucene`, `plus`, `bm25l`, `atire`, `robertson`

#### FusionMethod

Score fusion algorithm for combining multi-mode results.

Values: `rrf`, `linear`

#### Normalization

Score normalization methods for linear fusion.

Values: `minmax`, `zscore`

#### ChunkStrategy

Text chunking strategies for semantic search and RAG.

Values: `none`, `fixed`, `sentence`, `paragraph`, `recursive`

#### SearchMode

Available search strategies for querying the index. 15 modes total.

Values: `hybrid`, `bm25`, `semantic`, `exact`, `fuzzy`, `boolean`, `proximity`, `phrase`, `prefix`, `wildcard`, `span`, `dfr`, `lm_dirichlet`, `phonetic`, `quorum`

#### TextSearchLanguage

Supported languages for stop words and text processing. 33 languages.

Values: `ar`, `bg`, `ca`, `cs`, `da`, `de`, `el`, `en`, `es`, `fa`, `fi`, `fr`, `gu`, `he`, `hi`, `hu`, `id`, `it`, `ja`, `ko`, `ms`, `nl`, `no`, `pl`, `pt`, `ro`, `ru`, `sk`, `sv`, `tr`, `uk`, `vi`, `zh`

#### BooleanOperator

Boolean query operators for the recursive descent parser.

Values: `AND`, `OR`, `NOT`, `WEAKAND`

## Indexing Pipeline

When you call `index.add(key, value)`, the text goes through a multi-step normalization and indexing pipeline. Each step transforms the text closer to its canonical form for matching.

### Indexing Pipeline Diagram

From raw text input to a fully searchable inverted index with compact posting arrays and TF cache. Every hot-path operation runs in native C.

#### Phase 1 -- Ingestion

**01. Document Input**

**Methods:** `add(key, value)` / `add_batch()` / `add_fields()`

Raw text enters the pipeline via `TextIndex.add(key, value)`. For structured documents, `add_fields(value: T)` accepts a typed document; per-field text is read off `value` via the `FieldConfig.f` typed field references (auto-discovered from `T` if `config.fields` is null). The first `String` of each configured field is concatenated to drive normalization/indexing, and at query time `search_bm25_f`, `search_faceted`, `AggregationEngine`, and `FunctionScoreEngine` read fields directly off `IndexEntry.value` via the same field refs — there is no separate "field data" side index. `add_batch()` accepts `Array<TextEntry>`.

**Related types:** `TextIndex<T>`, `TextEntry`, `FieldConfig`

**02. Content Deduplication**

**Method:** SHA-256 hash

When `config.deduplicateContent` is set, `Crypto::sha256hex(key)` is computed and looked up in `contentHashes`; an existing hit short-circuits the rest of the pipeline.

#### Phase 2 -- Normalization

**03. Advanced Normalization**

**Method:** `TextNormalizer.normalize_advanced_with_maps()` (companion to `normalize()` / `normalize_with_maps()`)

Strips HTML tags, removes URLs and email addresses, normalizes quotes and dashes, removes control characters, limits repeating chars. Configurable via `NormOptions`. Single fused pass for quotes + control + repeat.

**Native:** `text_normalizer.c` (`strip_html_tags`, `strip_urls`, fused quote/control/repeat pass)

**04. Character Mapping**

**Method:** `CharMap.apply()`

170+ Unicode-to-ASCII character mappings: em/en dashes to hyphens, ligatures, fullwidth to ASCII, smart quotes, mathematical symbols, Cyrillic/Greek look-alikes. Custom mappings via `TextIndexConfig.tokenization.charMap`. Built once via `ensure_cached_maps()` and reused for every document and query. Source: `util/char_map.gcl`, native init in `char_map_native.c`.

**05. Case Folding**

**Method:** `nfkd_casefold()`

NFKD Unicode decomposition followed by lowercasing. Accent stripping maps diacritics to base characters (e.g., cafe -> cafe). Controlled by `caseFold` config flag.

**Native:** `text_normalizer.c` -- NFKD, lowercase, accent strip

**06. Whitespace Collapse**

**Method:** `StringUtils.collapse_whitespace()`

All runs of whitespace (spaces, tabs, newlines) collapsed to a single space. Leading and trailing whitespace trimmed. Single-pass O(N) in C.

**Native:** `string_utils.c`

#### Phase 3 -- Tokenization

**07. Tokenize & Position Track**

**Methods:** `TextTokenizer.tokenize()`, `tokenize_normalized()`, `tokenize_with_maps()`, `tokenize_with_originals()`

Splits normalized text into terms, returning `Array<TokenInfo>` (with positional offsets baked into each `TokenInfo`) or `Map<String, TermFrequency>`. Strips punctuation, filters by min/max term length, checks `has_alphanumeric()`, optionally removes numeric tokens. Recorded positional offsets enable phrase search, proximity search, slop matching, and span queries.

**Native:** `string_utils.c` -- split on separators, strip punctuation, length filter, numeric filter, positional offsets

**08. Stemming**

**Method:** `PorterStemmer.stem()`

Optional Porter stemmer reduces terms to root forms: "running" -> "run", "connections" -> "connect", "computational" -> "comput". 5-step suffix stripping. Improves recall at the cost of some precision.

**Native:** `porter_stemmer.c` -- Optional, 5-step suffix stripping

**09. Stop Word Filtering**

**Method:** `StopWords::init_stop_words_map()` populates a `Map<String, bool>` looked up via `StopWords::is_stop_word_map()`. Auto-detection uses `StopWords::detect_auto_stop_words_map()`.

Removes high-frequency function words. Checked against the original pre-stemmed form, not the stemmed form. Four modes (`StopWordMode`): `none`, `default` (33 languages), `custom`, `auto` (df / N >= `stopWords.autoThreshold`, default 0.85, computed at `build()`).

#### Phase 4 -- Index Construction

**10. Inverted Index**

**Structure:** `nodeIndex<String, node<NormalizedTerm>>`

For each unique term: get-or-create `NormalizedTerm`, append the entry to its forward index (`entryTermNodes` + `entryTermTFs` parallel arrays plus `termIndexMap` for O(1) lookup), and append packed positions (`positionData` + `positionOffsets` + `positionCounts`). Term text is stored as `node<String>`, sharing storage across documents.

**11. Semantic Indexing**

**Method:** `VectorIndex.add()`

When `config.embed` is set, text is chunked per `ChunkStrategy` (fixed / sentence / paragraph / recursive, with `chunking.size` and `chunking.overlap`), each chunk is embedded via the supplied function, and vectors are appended to `vectorIndex` (a `node<VectorIndex<node<IndexEntry>>>`).

#### Phase 5 -- Build (IDF + Precompute)

**12. IDF Computation & BM25 Precompute**

**Methods:** `build()`

Computes average document length. Auto-detects stop words if `StopWordMode::auto`. Pre-normalizes synonyms. Iterates all terms, computes IDF using the configured variant, then **builds compact posting arrays** (`postingEntries`, `postingTFs`, `postingFieldnormIds`) and a **256-bucket TF normalization cache** for O(1) per-document scoring at query time. Also computes `maxTermScore` per term and block-level max scores (`postingBlockMaxScores`) for WAND pruning. Builds trigram index for fuzzy pre-filtering and trie/reverse-trie for prefix/wildcard search.

**IDF Formulas**

| Variant | Formula | Notes |
|---------|---------|-------|
| Lucene (default) | `log(1 + (N - df + 0.5) / (df + 0.5))` | Always positive, even for high-frequency terms |
| BM25+ | `log((N + 1) / df)` | Adds delta to prevent zero TF contribution |
| BM25L | `log((N + 1) / (df + 0.5))` | Better for long documents |
| ATIRE | `log(N / df)` | Simplest variant |
| Robertson | `log((N - df + 0.5) / (df + 0.5))` | Original BM25 -- can go negative |

**Native:** `bm25_scorer.c` -- TF cache, compact posting arrays, auto stop words, trigramIndex, WAND maxTermScore

#### Legend

| Tag | Meaning |
|-----|---------|
| **C native** | Implemented in native C for performance |
| **Core type** | Fundamental data model type |
| **Data structure** | Index or storage structure |
| **Algorithm** | Algorithmic component |
| **Config-dependent** | Behavior controlled by configuration |
| **Optional step** | Only runs when enabled |
| **AI / Embedding** | Requires AI model |

### Build Phase Detail

`build()` walks the vocabulary once and materialises the structures that turn the raw graph into a queryable index. Steps:

1. **Average doc length:** `avgTokenCount = totalTokens / totalEntries`, used by BM25 length normalization.
2. **Auto stop words:** when `StopWordMode::auto`, terms with `df / totalEntries >= autoThreshold` (default 0.85) join `cachedStopWordMap` (`StopWords::detect_auto_stop_words_map`).
3. **IDF per term**, using the variant selected by `config.bm25.variant` (formulas in Phase 5 above).
4. **Compact posting arrays** on every `NormalizedTerm`: `postingEntries`, `postingTFs`, `postingDocLens`, `postingFieldnormIds`, sorted by entry ref.
5. **256-bucket TF cache** (`cachedTFCache`, Tantivy-style log length encoding): each entry maps a fieldnorm bucket id to the precomputed denominator `k1 * (1 - b + b * decoded_length / avgDocLen)`. TF normalization at query time is one array load.
6. **WAND upper bounds:** `maxTermScore` per term plus per-64-entry `postingBlockMaxScores` enable both global and block-level pruning during `search_compact`.
7. **Trigram index** (`trigramIndex: nodeIndex<String, node<TrigramPostings>>`) when `config.typoTolerance.enabled`. Built once, queried per fuzzy search to narrow Levenshtein candidates from O(V) to ~1-10% of V.
8. **Edge n-gram index** (`edgeNgramTerms`) when `config.edgeNgram.enabled`: prefix lengths `[edgeNgram.min, edgeNgram.max]` of each term land in a `nodeIndex` for O(1) prefix lookup.
9. **Trie + reverse trie** (`trieRoot`, `reverseTrieRoot`) over the normalized vocabulary for `search_prefix` (O(prefix_len + matches)) and leading-wildcard `search_wildcard`.
10. **Phonetic index** (`build_phonetic_index`) when `config.usePhonetic`: maps Double Metaphone codes to term strings.
11. **Synonym pre-normalization** (`normalizedSynonyms`): synonyms are normalized once so query-time expansion is a map lookup.

## Score Fusion

Hybrid search combines per-mode result lists into one ranking. Worked examples and configuration knobs live in [search-modes.md > Hybrid Search](./search-modes.md#hybrid-search); this section records only the formulas the engines implement.

### Reciprocal Rank Fusion (RRF)

Rank-based, independent of raw score magnitude. Implemented in `scoring/fusion.gcl` (`Fusion::rrf`).

```
score(d) = SUM_i weight_i / (k + rank_i(d))

// k = config.fusion.rrf.k (default 60)
// rank_i(d) = 1-indexed rank of d in mode i's result list; omitted if d is absent
```

When `config.fusion.rrf.topRankBonus` is set (default), rank-1 hits get `topBonus` (default +0.05) and ranks within `nearTopCutoff` get `nearTopBonus` (default +0.02).

### Linear Fusion

Score-based, requires per-mode normalization first. Native normalizers live in `fusion_accel.c` (`normalize_min_max`, `normalize_z_score`) and operate in-place over `Array<BM25Result>`.

```
score(d) = SUM_i weight_i * normalize(score_i(d))

minmax(s) = (s - min) / (max - min)        // collapses to 0.5 when range = 0
zscore(s) = (s - mean) / stddev            // collapses to 0 when stddev = 0
```

### Search Architecture Diagram

15 search modes (including hybrid). Each non-hybrid mode is an independent engine. Hybrid search fuses multiple signals via RRF or linear combination, with optional MMR diversity re-ranking.

<svg width="1050" height="640" viewBox="0 0 1050 640" fill="none" xmlns="http://www.w3.org/2000/svg">
  <defs>
    <pattern id="grid" width="24" height="24" patternUnits="userSpaceOnUse">
      <path d="M 24 0 L 0 0 0 24" fill="none" stroke="#1e1e24" stroke-width="0.5"/>
    </pattern>
    <marker id="arrow" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
      <path d="M0,0 L8,3 L0,6" fill="none" stroke="#4a4a52" stroke-width="1.2"/>
    </marker>
  </defs>
  <rect width="1050" height="640" fill="#0c0c0f"/>
  <rect width="1050" height="640" fill="url(#grid)"/>

  <!-- Query Input -->
  <rect x="30" y="270" width="140" height="52" rx="6" fill="#1e1e24" stroke="#c4f562" stroke-width="1.5"/>
  <text x="100" y="292" text-anchor="middle" fill="#c4f562" font-size="11" font-weight="500">QUERY INPUT</text>
  <text x="100" y="308" text-anchor="middle" fill="#8a8a8e" font-size="10">"quick brown fox"</text>

  <!-- Query Processing -->
  <rect x="210" y="262" width="130" height="68" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="275" y="284" text-anchor="middle" fill="#e8e6e3" font-size="11" font-weight="500">Query Processing</text>
  <text x="275" y="300" text-anchor="middle" fill="#8a8a8e" font-size="9">normalize</text>
  <text x="275" y="312" text-anchor="middle" fill="#8a8a8e" font-size="9">tokenize + expand</text>
  <text x="275" y="324" text-anchor="middle" fill="#8a8a8e" font-size="9">synonyms</text>

  <line x1="170" y1="296" x2="207" y2="296" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow)"/>

  <!-- ENGINE BOXES -->
  <rect x="400" y="16" width="200" height="40" rx="5" fill="#16161a" stroke="#c4f562" stroke-width="1"/>
  <text x="500" y="32" text-anchor="middle" fill="#c4f562" font-size="11" font-weight="500">BM25 Engine</text>
  <text x="500" y="46" text-anchor="middle" fill="#8a8a8e" font-size="9">precomputed scores / 5 variants / WAND</text>

  <rect x="400" y="68" width="200" height="40" rx="5" fill="#16161a" stroke="#c4f562" stroke-width="1"/>
  <text x="500" y="84" text-anchor="middle" fill="#c4f562" font-size="11" font-weight="500">BM25F Engine</text>
  <text x="500" y="98" text-anchor="middle" fill="#8a8a8e" font-size="9">multi-field weighted scoring</text>

  <rect x="400" y="120" width="200" height="40" rx="5" fill="#16161a" stroke="#f562b0" stroke-width="1"/>
  <text x="500" y="136" text-anchor="middle" fill="#f562b0" font-size="11" font-weight="500">Semantic Engine</text>
  <text x="500" y="150" text-anchor="middle" fill="#8a8a8e" font-size="9">vector similarity (AI embed)</text>

  <rect x="400" y="172" width="200" height="40" rx="5" fill="#16161a" stroke="#62b0f5" stroke-width="1"/>
  <text x="500" y="188" text-anchor="middle" fill="#62b0f5" font-size="11" font-weight="500">Exact Engine</text>
  <text x="500" y="202" text-anchor="middle" fill="#8a8a8e" font-size="9">normalized substring match</text>

  <rect x="400" y="224" width="200" height="40" rx="5" fill="#16161a" stroke="#f5a262" stroke-width="1"/>
  <text x="500" y="240" text-anchor="middle" fill="#f5a262" font-size="11" font-weight="500">Fuzzy Engine</text>
  <text x="500" y="254" text-anchor="middle" fill="#8a8a8e" font-size="9">Levenshtein / Jaro-Winkler / trigrams</text>

  <rect x="400" y="276" width="200" height="40" rx="5" fill="#16161a" stroke="#f56262" stroke-width="1"/>
  <text x="500" y="292" text-anchor="middle" fill="#f56262" font-size="11" font-weight="500">Boolean Engine</text>
  <text x="500" y="306" text-anchor="middle" fill="#8a8a8e" font-size="9">AND / OR / NOT / AST tree eval</text>

  <rect x="400" y="328" width="200" height="40" rx="5" fill="#16161a" stroke="#62e5d0" stroke-width="1"/>
  <text x="500" y="344" text-anchor="middle" fill="#62e5d0" font-size="11" font-weight="500">Proximity Engine</text>
  <text x="500" y="358" text-anchor="middle" fill="#8a8a8e" font-size="9">term distance scoring</text>

  <rect x="400" y="380" width="200" height="40" rx="5" fill="#16161a" stroke="#b08af5" stroke-width="1"/>
  <text x="500" y="396" text-anchor="middle" fill="#b08af5" font-size="11" font-weight="500">Phrase Engine</text>
  <text x="500" y="410" text-anchor="middle" fill="#8a8a8e" font-size="9">positional sequence + slop</text>

  <rect x="400" y="432" width="200" height="40" rx="5" fill="#16161a" stroke="#a8e662" stroke-width="1"/>
  <text x="500" y="448" text-anchor="middle" fill="#a8e662" font-size="11" font-weight="500">Prefix Engine</text>
  <text x="500" y="462" text-anchor="middle" fill="#8a8a8e" font-size="9">prefix match / suggest / edge n-gram</text>

  <rect x="400" y="484" width="200" height="40" rx="5" fill="#16161a" stroke="#f5e262" stroke-width="1"/>
  <text x="500" y="500" text-anchor="middle" fill="#f5e262" font-size="11" font-weight="500">Wildcard Engine</text>
  <text x="500" y="514" text-anchor="middle" fill="#8a8a8e" font-size="9">* and ? on vocabulary</text>

  <rect x="400" y="536" width="200" height="40" rx="5" fill="#16161a" stroke="#8a8af5" stroke-width="1"/>
  <text x="500" y="552" text-anchor="middle" fill="#8a8af5" font-size="11" font-weight="500">Span Engine</text>
  <text x="500" y="566" text-anchor="middle" fill="#8a8a8e" font-size="9">NEAR / ONEAR / FIRST</text>

  <rect x="400" y="588" width="200" height="40" rx="5" fill="#16161a" stroke="#62b0f5" stroke-width="1"/>
  <text x="500" y="604" text-anchor="middle" fill="#62b0f5" font-size="11" font-weight="500">More Like This</text>
  <text x="500" y="618" text-anchor="middle" fill="#8a8a8e" font-size="9">top TF-IDF term extraction</text>

  <!-- Fan-out lines -->
  <line x1="340" y1="285" x2="397" y2="36"  stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="287" x2="397" y2="88"  stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="289" x2="397" y2="140" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="291" x2="397" y2="192" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="293" x2="397" y2="244" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="295" x2="397" y2="296" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="297" x2="397" y2="348" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="299" x2="397" y2="400" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="301" x2="397" y2="452" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="303" x2="397" y2="504" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="305" x2="397" y2="556" stroke="#2a2a32" stroke-width="0.8"/>
  <line x1="340" y1="307" x2="397" y2="608" stroke="#2a2a32" stroke-width="0.8"/>

  <!-- FUSION BOX -->
  <rect x="660" y="220" width="140" height="130" rx="6" fill="#16161a" stroke="#c4f562" stroke-width="1.5"/>
  <text x="730" y="248" text-anchor="middle" fill="#c4f562" font-size="12" font-weight="500">Score Fusion</text>
  <text x="730" y="268" text-anchor="middle" fill="#8a8a8e" font-size="9">RRF or Linear</text>
  <line x1="680" y1="280" x2="780" y2="280" stroke="#2a2a32" stroke-width="0.5"/>
  <text x="730" y="298" text-anchor="middle" fill="#8a8a8e" font-size="9">fusion.weights[bm25]: 0.4</text>
  <text x="730" y="312" text-anchor="middle" fill="#8a8a8e" font-size="9">fusion.weights[semantic]: 0.6</text>
  <text x="730" y="326" text-anchor="middle" fill="#8a8a8e" font-size="9">fusion.weights[exact]: 0.3</text>
  <text x="730" y="340" text-anchor="middle" fill="#8a8a8e" font-size="9">fusion.weights[fuzzy]: 0.2</text>

  <!-- Fan-in: engines to fusion -->
  <line x1="600" y1="36"  x2="657" y2="260" stroke="#2a2a32" stroke-width="0.8" stroke-dasharray="4,3"/>
  <line x1="600" y1="140" x2="657" y2="268" stroke="#2a2a32" stroke-width="0.8" stroke-dasharray="4,3"/>
  <line x1="600" y1="192" x2="657" y2="280" stroke="#2a2a32" stroke-width="0.8" stroke-dasharray="4,3"/>
  <line x1="600" y1="244" x2="657" y2="290" stroke="#2a2a32" stroke-width="0.8" stroke-dasharray="4,3"/>
  <line x1="600" y1="348" x2="657" y2="310" stroke="#2a2a32" stroke-width="0.8" stroke-dasharray="4,3"/>

  <!-- POST PROCESSING -->
  <rect x="850" y="220" width="160" height="130" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="930" y="248" text-anchor="middle" fill="#e8e6e3" font-size="11" font-weight="500">Post-Processing</text>
  <line x1="870" y1="258" x2="990" y2="258" stroke="#2a2a32" stroke-width="0.5"/>
  <text x="930" y="278" text-anchor="middle" fill="#62e5d0" font-size="10">Score Normalize</text>
  <text x="930" y="296" text-anchor="middle" fill="#8a8a8e" font-size="9">min-max / z-score</text>
  <text x="930" y="316" text-anchor="middle" fill="#b08af5" font-size="10">MMR Diversity</text>
  <text x="930" y="334" text-anchor="middle" fill="#8a8a8e" font-size="9">lambda*relevance - (1-lambda)*similarity</text>

  <line x1="800" y1="285" x2="847" y2="285" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow)"/>

  <!-- OUTPUT -->
  <rect x="880" y="390" width="120" height="52" rx="6" fill="#1e1e24" stroke="#c4f562" stroke-width="1.5"/>
  <text x="940" y="412" text-anchor="middle" fill="#c4f562" font-size="11" font-weight="500">RESULTS</text>
  <text x="940" y="428" text-anchor="middle" fill="#8a8a8e" font-size="10">Array&lt;TextResult&gt;</text>

  <line x1="930" y1="350" x2="930" y2="387" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow)"/>

  <!-- Short circuit -->
  <rect x="700" y="385" width="120" height="34" rx="4" fill="rgba(245,162,98,0.08)" stroke="rgba(245,162,98,0.2)" stroke-width="1"/>
  <text x="760" y="399" text-anchor="middle" fill="#f5a262" font-size="9" font-weight="500">SHORT CIRCUIT</text>
  <text x="760" y="411" text-anchor="middle" fill="#8a8a8e" font-size="8">score &ge; 0.85, gap &ge; 0.15</text>
  <line x1="760" y1="419" x2="877" y2="416" stroke="rgba(245,162,98,0.3)" stroke-width="1" stroke-dasharray="3,3"/>

  <!-- Utilities -->
  <rect x="880" y="460" width="120" height="68" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="940" y="480" text-anchor="middle" fill="#8a8a8e" font-size="10">+ Utilities</text>
  <text x="940" y="496" text-anchor="middle" fill="#62b0f5" font-size="9">snippet()</text>
  <text x="940" y="508" text-anchor="middle" fill="#62b0f5" font-size="9">snippets()</text>
  <text x="940" y="520" text-anchor="middle" fill="#62b0f5" font-size="9">explain()</text>
</svg>

The diagram shows the 11 highest-traffic engines plus the BM25F multi-field
variant and More-Like-This. Four additional engines follow the same dispatch
pattern but specialize the scoring: **DFR** (Divergence From Randomness),
**LM-Dirichlet** (language model smoothing), **Phonetic** (Double Metaphone),
and **Quorum** (minimum-should-match). They sit in the same engine layer as
BM25 and reuse the same posting lists; see the per-mode entries in the table
below.

### Search Modes

User-facing semantics, query syntax, and examples for each mode live in [search-modes.md](./search-modes.md). The table below summarises the algorithmic backbone and native helper consumed by each engine.

| Mode | Method | Algorithm / data structure | Native helper |
|------|--------|----------------------------|----------------|
| BM25 | `search_bm25(query, k)` | Compact posting arrays + 256-bucket TF cache + WAND (`maxTermScore`, `postingBlockMaxScores`); 5 IDF variants | `bm25_scorer.c` (`search_compact`) |
| BM25F | `search_bm25_f(query, k)` | Per-field length normalization (`FieldConfig.fieldB`), weighted field-score sum then IDF | `bm25_scorer.c` |
| Semantic | `search_semantic(query, k)` | Nearest-neighbor over `VectorIndex<IndexEntry>` (doc-level) or `VectorIndex<IndexChunk>` (chunk-level, when `config.chunking.strategy != none`) | -- (user-provided embed function) |
| Exact | `search_exact(query, k)` | Normalized substring scan over `IndexEntry.normalizedText` | -- |
| Fuzzy | `search_fuzzy(query, k, options?)` | Banded Levenshtein on keys (`FuzzyMode::key`) or trigram-prefiltered vocabulary scan (`FuzzyMode::term`) via `trigramIndex` | `fuzzy_matcher.c` |
| Boolean | `search_boolean(query, k)` | `BooleanParser` AST + sorted-array set ops (intersect / union / difference) on posting refs | `boolean_accel.c`, `boolean_prefix_accel.c` |
| Proximity | `search_proximity(term1, term2, k, options?)` | Two-pointer min-distance on packed `positionData`; distance from `ProximityOptions.distance` | `proximity_search.c` |
| Phrase | `search_phrase(phrase, k, options?)` | Two-pointer positional verification with `PhraseOptions.slop`; BM25-scored | `phrase_engine.c`, `bm25_scorer.c` (`search_phrase_compact`) |
| Prefix | `search_prefix(prefix, k)` | Trie walk (`trieRoot`) or sorted-vocabulary range via `search_vocab_scan`; optional `edgeNgramTerms` for O(1) | `prefix_engine.c` |
| Wildcard | `search_wildcard(pattern, k)` | Codepoint-level `*` / `?` match (`match_full_pattern_cps`) over vocabulary; reverse trie for leading wildcards | `wildcard_matcher.c` |
| Span | `search_span(query, k)` | `SpanParser` AST + monotonic-pointer span verification (`NEAR` / `ONEAR` / `FIRST`) | `span_engine.c` |
| DFR | `search_dfr(query, k)` | 4 BasicModels x 2 AfterEffects x 4 Normalizations | `dfr_scorer.c` (`search_dfr_compact`) |
| LM-Dirichlet | `search_lm_dirichlet(query, k)` | `log((tf + mu * P(w|C)) / (docLen + mu))` Dirichlet smoothing | `lm_dirichlet_accel.c` (`search_lm_dirichlet_compact`) |
| Phonetic | `search_phonetic(query, k)` | Double Metaphone codes via `phoneticIndex`; requires `config.usePhonetic` | `phonetic_engine.c` |
| Quorum | `search_quorum(query, k, minMatch)` | Posting-list union, filter by matched-term count; BM25 score on the matched subset | `bm25_scorer.c` |

## Optimization & Performance

### Optimization Techniques

| Technique | Where | What it skips / saves |
|-----------|-------|-----------------------|
| Short-circuit | `model/text_index.gcl::dispatch_mode`; gated by `config.shortCircuit` | Skips non-BM25 engines when BM25 already has a clear winner: `s0/(s0+s1) >= minScore` and gap `>= minGap`. |
| Cached char maps | `ensure_cached_maps()`, stored on `TextIndex` | Reuses CharMap, punctuation set, combining marks, smart-quote map, HTML entity map across every `add()` and `search()`. |
| Pre-resolved term nodes | `search_with_term_nodes` / `search_compact` entrypoints | One `normalizedTerms.get` per query term instead of one per scored doc. |
| 256-bucket TF cache | `precompute_tf_cache` (`bm25_scorer.c`) | TF normalization becomes one array load (`cachedTFCache[fieldnormId]`) instead of `k1 * (1 - b + b * len/avgLen)`. |
| Compact postings | `postingEntries`/`postingTFs`/`postingFieldnormIds`/`postingBlockMaxScores` on `NormalizedTerm` | Sequential, cache-friendly scoring; AVX2 batch scoring in `compute_batch_scores`. |
| Compact forward index | `entryTermNodes`/`entryTermTFs` on `IndexEntry`, packed positions | O(1) `get_term_positions` via `termIndexMap`, fast Jaccard for MMR/MoreLikeThis. |
| WAND + block-max | `maxTermScore` (per term) and `postingBlockMaxScores` (per 64-entry block) | Skips documents/blocks whose upper bound cannot beat the current k-th score (theta). |
| MMR diversity | `scoring/mmr.gcl` (`MMR::apply`); Jaccard over `entryTermNodes` | Greedy `lambda * relevance - (1 - lambda) * max_sim` re-ranking. |
| Synonym pre-norm | `normalizedSynonyms` populated at `build()` | Query-time synonym expansion is a map lookup. |
| Content dedup | `contentHashes: nodeIndex<String, node<IndexEntry>>` keyed by `Crypto::sha256hex` | O(log n) duplicate skip before tokenization. |
| Trigram pre-filter | `trigramIndex` + `extract_trigrams` (`string_utils.c`) | Fuzzy candidates reduced from O(V) to ~1-10% of V before Levenshtein. |
| Trie + reverse trie | `trieRoot`, `reverseTrieRoot` built in `build_trie_index` / `build_reverse_trie_index` | O(prefix_len + matches) prefix search; leading-wildcard support without full vocabulary scan. |
| Edge n-gram index | `edgeNgramTerms` when `config.edgeNgram.enabled` | O(1) prefix lookup on a fixed-length range. |

The MMR formula:

```
MMR(d) = lambda * relevance(d) - (1 - lambda) * max_sim(d, selected)
sim(d1, d2) = |terms(d1) intersect terms(d2)| / |terms(d1) union terms(d2)|
```

### Native C Hot Paths

29 C source files implement the performance-critical operations and are wired through `nativegen.c`. Compiled with `-O3`, LTO, and optional SSE2/AVX2/NEON SIMD vectorization:

| Module | Operations | SIMD |
|--------|------------|------|
| `string_utils.c` | Character replace/strip, separator replacement, whitespace collapse, alphanumeric check, word count, trigram extraction | SSE2/AVX2 |
| `text_normalizer.c` | HTML/URL/email stripping, accent removal, quote normalization, entity decoding | -- |
| `bm25_scorer.c` | BM25 scoring (5 variants), batch scoring, IDF, TF cache, `search_compact` (fused multi-term search with WAND), `search_boolean_and` (fused boolean-AND BM25), `search_phrase_compact` (fused phrase search) | AVX2 |
| `porter_stemmer.c` | Porter stemming (single + batch), 7-step suffix stripping | -- |
| `dfr_scorer.c` | DFR scoring (4 models x 2 effects x 4 norms), batch scoring, `search_dfr_compact` (fused DFR search) | AVX2 |
| `lm_dirichlet_accel.c` | Language Model Dirichlet smoothing, batch scoring, `search_lm_dirichlet_compact` (fused LM-Dirichlet search) | AVX2 |
| `phonetic_engine.c` | Double Metaphone encoding (primary + secondary codes) | -- |
| `phrase_engine.c` | Phrase positional verification with slop tolerance | -- |
| `span_engine.c` | NEAR/ONEAR span position checking | -- |
| `proximity_search.c` | Two-pointer minimum distance on sorted position arrays | -- |
| `boolean_accel.c` | Adaptive sorted intersection with galloping search | -- |
| `regex_accel.c` | Iterative NFA regex simulation with greedy backtracking | -- |
| `regex_utils_native.c` | UTF-8↔codepoint conversion with byte-offset tracking, regex match iteration into `RegexMatch` arrays | -- |
| `function_score_accel.c` | Gaussian/Linear/Exponential decay functions, `compute_batch_decay` (batch decay computation) | -- |
| `fusion_accel.c` | Min-max and z-score normalization of `BM25Result` arrays, in-place | -- |
| `fuzzy_matcher.c` | Banded Levenshtein edit distance, typo max-edits policy (Meilisearch-style length thresholds) | -- |
| `char_tables.c` | O(1) lookup tables for character classification (bitsets, hash sets, maps) | -- |
| `char_map_native.c` | Native construction of default char maps, smart-quote map, and combining-marks set without GCL `Map::put` overhead | -- |
| `prefix_engine.c` | Prefix term scoring, autocomplete from terms, trie prefix collection, sorted vocabulary prefix range search | -- |
| `boolean_prefix_accel.c` | Posting list sort by ref, set operations (intersect/union/difference) on node refs, fused boolean query execution | -- |
| `wildcard_matcher.c` | Codepoint-level wildcard pattern matching (`*` and `?`) | -- |
| `ranking_rules_native.c` | qsort-based multi-criteria ranking rules sort (words/typo/proximity/attribute/sort/exactness) | -- |
| `doc_reader_native.c` | `DocReader::resolve_if_node` — unwraps `any?` to its underlying object when it is a `node<T>`, enabling typed multi-field ingestion (`add_fields`) for both `T` and `node<T>` documents | -- |
| `stop_words_native.c` | Native initialization of stop-word maps for all 33 languages (single table, zero per-call GCL allocations) | -- |
| `text_chunker_native.c` | Native fixed/sentence/paragraph/recursive chunking with word-counted spans | -- |
| `text_parser_native.c` | Native section parsing (markdown headings, code blocks, prose) into `ParsedSection` arrays | -- |
| `tokenizer_native.c` | `TokenInfo → Map<String, TermFrequency>` frequency-map construction with hash cache for hot-term reuse | -- |

All C modules use `char_tables.h` for O(1) character classification via precomputed lookup tables. SIMD modules include scalar fallbacks for non-SIMD platforms.

### Query Execution Flow

Trace a hybrid search from query input through normalization, parallel engine execution, score fusion, and post-processing to final ranked results.

<svg width="1050" height="800" viewBox="0 0 1050 800" fill="none" xmlns="http://www.w3.org/2000/svg">
  <!-- Background grid -->
  <defs>
    <pattern id="grid2" width="24" height="24" patternUnits="userSpaceOnUse">
      <path d="M 24 0 L 0 0 0 24" fill="none" stroke="#1e1e24" stroke-width="0.5"/>
    </pattern>
    <marker id="arrow2" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
      <path d="M0,0 L8,3 L0,6" fill="none" stroke="#4a4a52" stroke-width="1.2"/>
    </marker>
    <marker id="arrow-accent2" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
      <path d="M0,0 L8,3 L0,6" fill="none" stroke="#c4f562" stroke-width="1.2"/>
    </marker>
    <marker id="arrow-orange2" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
      <path d="M0,0 L8,3 L0,6" fill="none" stroke="#f5a262" stroke-width="1.2"/>
    </marker>
    <marker id="arrow-teal2" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
      <path d="M0,0 L8,3 L0,6" fill="none" stroke="#62e5d0" stroke-width="1.2"/>
    </marker>
    <marker id="arrow-purple2" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
      <path d="M0,0 L8,3 L0,6" fill="none" stroke="#b08af5" stroke-width="1.2"/>
    </marker>
    <marker id="arrow-blue2" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto">
      <path d="M0,0 L8,3 L0,6" fill="none" stroke="#62b0f5" stroke-width="1.2"/>
    </marker>
  </defs>
  <rect width="1050" height="800" fill="#0c0c0f"/>
  <rect width="1050" height="800" fill="url(#grid2)"/>

  <!-- COLUMN LABELS -->
  <text x="110" y="24" text-anchor="middle" fill="#5a5a5e" font-size="9" letter-spacing="0.12em">QUERY PROCESSING</text>
  <text x="485" y="24" text-anchor="middle" fill="#5a5a5e" font-size="9" letter-spacing="0.12em">PARALLEL ENGINES</text>
  <text x="850" y="24" text-anchor="middle" fill="#5a5a5e" font-size="9" letter-spacing="0.12em">POST-PROCESSING</text>

  <!-- LEFT COLUMN: QUERY PROCESSING -->

  <!-- Box 1: Query Input -->
  <rect x="30" y="48" width="160" height="56" rx="6" fill="#1e1e24" stroke="#c4f562" stroke-width="1.5"/>
  <text x="110" y="70" text-anchor="middle" fill="#c4f562" font-size="11" font-weight="500">QUERY INPUT</text>
  <text x="110" y="86" text-anchor="middle" fill="#8a8a8e" font-size="9">search(query, k, options?)</text>
  <text x="110" y="98" text-anchor="middle" fill="#62b0f5" font-size="9">"quick brown fox"</text>

  <line x1="110" y1="104" x2="110" y2="126" stroke="#c4f562" stroke-width="1.2" marker-end="url(#arrow-accent2)"/>

  <!-- Box 2: Normalize -->
  <rect x="30" y="130" width="160" height="62" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="110" y="150" text-anchor="middle" fill="#e8e6e3" font-size="11" font-weight="500">Normalize</text>
  <text x="110" y="166" text-anchor="middle" fill="#8a8a8e" font-size="9">casefold + charmap</text>
  <text x="110" y="180" text-anchor="middle" fill="#8a8a8e" font-size="9">collapse whitespace</text>

  <line x1="110" y1="192" x2="110" y2="214" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow2)"/>

  <!-- Box 3: Tokenize -->
  <rect x="30" y="218" width="160" height="68" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="110" y="238" text-anchor="middle" fill="#e8e6e3" font-size="11" font-weight="500">Tokenize</text>
  <text x="110" y="254" text-anchor="middle" fill="#8a8a8e" font-size="9">split + filter + stem</text>
  <text x="110" y="270" text-anchor="middle" fill="#62b0f5" font-size="9">["quick", "brown", "fox"]</text>
  <text x="110" y="282" text-anchor="middle" fill="#8a8a8e" font-size="8">+ positional offsets</text>

  <line x1="110" y1="286" x2="110" y2="308" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow2)"/>

  <!-- Box 4: Expand Synonyms -->
  <rect x="30" y="312" width="160" height="50" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="110" y="334" text-anchor="middle" fill="#e8e6e3" font-size="11" font-weight="500">Expand Synonyms</text>
  <text x="110" y="350" text-anchor="middle" fill="#8a8a8e" font-size="9">add synonym terms</text>

  <line x1="110" y1="362" x2="110" y2="384" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow2)"/>

  <!-- Box 5: Resolve Term Nodes -->
  <rect x="30" y="388" width="160" height="56" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="110" y="410" text-anchor="middle" fill="#e8e6e3" font-size="11" font-weight="500">Resolve Term Nodes</text>
  <text x="110" y="426" text-anchor="middle" fill="#8a8a8e" font-size="9">lookup normalizedTerms</text>
  <text x="110" y="438" text-anchor="middle" fill="#8a8a8e" font-size="8">nodeIndex&lt;String, NormalizedTerm&gt;</text>

  <!-- MIDDLE: PARALLEL ENGINE EXECUTION -->

  <!-- BM25 -->
  <rect x="310" y="130" width="200" height="52" rx="5" fill="#16161a" stroke="#c4f562" stroke-width="1"/>
  <text x="410" y="150" text-anchor="middle" fill="#c4f562" font-size="11" font-weight="500">BM25 Engine</text>
  <text x="410" y="166" text-anchor="middle" fill="#8a8a8e" font-size="9">precomputed score lookup</text>
  <text x="410" y="178" text-anchor="middle" fill="#62b0f5" font-size="8">O(1) per term per doc</text>

  <!-- Exact -->
  <rect x="310" y="198" width="200" height="52" rx="5" fill="#16161a" stroke="#62b0f5" stroke-width="1"/>
  <text x="410" y="218" text-anchor="middle" fill="#62b0f5" font-size="11" font-weight="500">Exact Engine</text>
  <text x="410" y="234" text-anchor="middle" fill="#8a8a8e" font-size="9">normalized substring match</text>
  <text x="410" y="246" text-anchor="middle" fill="#8a8a8e" font-size="8">scored by match position</text>

  <!-- Fuzzy -->
  <rect x="310" y="266" width="200" height="52" rx="5" fill="#16161a" stroke="#f5a262" stroke-width="1"/>
  <text x="410" y="286" text-anchor="middle" fill="#f5a262" font-size="11" font-weight="500">Fuzzy Engine</text>
  <text x="410" y="302" text-anchor="middle" fill="#8a8a8e" font-size="9">Levenshtein on keys</text>
  <text x="410" y="314" text-anchor="middle" fill="#8a8a8e" font-size="8">edit distance scoring</text>

  <!-- Semantic (dashed border - optional) -->
  <rect x="310" y="334" width="200" height="52" rx="5" fill="#16161a" stroke="#f562b0" stroke-width="1" stroke-dasharray="5,3"/>
  <text x="410" y="354" text-anchor="middle" fill="#f562b0" font-size="11" font-weight="500">Semantic Engine</text>
  <text x="410" y="370" text-anchor="middle" fill="#8a8a8e" font-size="9">embed query + vector search</text>
  <text x="410" y="382" text-anchor="middle" fill="#62e5d0" font-size="8">optional (requires AI model)</text>

  <!-- Fan-out lines from Resolve to Engines -->
  <path d="M 190 416 C 240 416, 260 156, 307 156" fill="none" stroke="#2a2a32" stroke-width="1"/>
  <path d="M 190 416 C 240 416, 260 224, 307 224" fill="none" stroke="#2a2a32" stroke-width="1"/>
  <path d="M 190 416 C 240 416, 260 292, 307 292" fill="none" stroke="#2a2a32" stroke-width="1"/>
  <path d="M 190 416 C 240 416, 260 360, 307 360" fill="none" stroke="#2a2a32" stroke-width="1" stroke-dasharray="5,3"/>

  <!-- Fan-out label -->
  <text x="248" y="310" text-anchor="middle" fill="#5a5a5e" font-size="8" transform="rotate(-90, 248, 310)">PARALLEL DISPATCH</text>

  <!-- Fan-in lines from Engines to Decision diamond -->
  <line x1="510" y1="156" x2="545" y2="248" stroke="#c4f562" stroke-width="1" opacity="0.6"/>
  <line x1="510" y1="224" x2="545" y2="260" stroke="#62b0f5" stroke-width="1" opacity="0.6"/>
  <line x1="510" y1="292" x2="545" y2="272" stroke="#f5a262" stroke-width="1" opacity="0.6"/>
  <line x1="510" y1="360" x2="545" y2="284" stroke="#f562b0" stroke-width="1" opacity="0.6" stroke-dasharray="5,3"/>

  <!-- DECISION DIAMOND: Short Circuit? -->
  <polygon points="590,266 545,226 590,186 635,226" fill="#1e1e24" stroke="#f5a262" stroke-width="1.2"/>
  <text x="590" y="222" text-anchor="middle" fill="#f5a262" font-size="9" font-weight="500">Short</text>
  <text x="590" y="234" text-anchor="middle" fill="#f5a262" font-size="9" font-weight="500">Circuit?</text>

  <!-- Yes branch -->
  <line x1="635" y1="226" x2="700" y2="226" stroke="#f5a262" stroke-width="1" marker-end="url(#arrow-orange2)"/>
  <text x="667" y="218" text-anchor="middle" fill="#f5a262" font-size="8" font-weight="500">YES</text>

  <rect x="703" y="206" width="140" height="40" rx="4" fill="rgba(245,162,98,0.06)" stroke="rgba(245,162,98,0.2)" stroke-width="1"/>
  <text x="773" y="222" text-anchor="middle" fill="#f5a262" font-size="8" font-weight="500">SKIP TO RESULTS</text>
  <text x="773" y="236" text-anchor="middle" fill="#8a8a8e" font-size="8">score &ge; 0.85, gap &ge; 0.15</text>

  <path d="M 773 246 L 773 280 C 773 300, 850 440, 850 530" fill="none" stroke="rgba(245,162,98,0.3)" stroke-width="1" stroke-dasharray="4,3"/>

  <!-- No branch -->
  <line x1="590" y1="266" x2="590" y2="310" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow2)"/>
  <text x="600" y="290" fill="#8a8a8e" font-size="8">NO</text>

  <!-- RIGHT COLUMN: POST-PROCESSING -->

  <!-- Box: Score Fusion -->
  <rect x="510" y="316" width="160" height="62" rx="6" fill="#16161a" stroke="#b08af5" stroke-width="1.2"/>
  <text x="590" y="338" text-anchor="middle" fill="#b08af5" font-size="11" font-weight="500">Score Fusion</text>
  <text x="590" y="354" text-anchor="middle" fill="#8a8a8e" font-size="9">RRF or Linear weighted</text>
  <text x="590" y="368" text-anchor="middle" fill="#8a8a8e" font-size="8">combine engine rankings</text>

  <line x1="590" y1="378" x2="590" y2="402" stroke="#b08af5" stroke-width="1.2" marker-end="url(#arrow-purple2)"/>

  <!-- Box: Normalize Scores -->
  <rect x="510" y="406" width="160" height="50" rx="6" fill="#16161a" stroke="#b08af5" stroke-width="1"/>
  <text x="590" y="428" text-anchor="middle" fill="#b08af5" font-size="11" font-weight="500">Normalize Scores</text>
  <text x="590" y="444" text-anchor="middle" fill="#8a8a8e" font-size="9">min-max or z-score</text>

  <line x1="590" y1="456" x2="590" y2="480" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow2)"/>

  <!-- Decision diamond: Diversify? -->
  <polygon points="590,530 545,490 590,450 635,490" fill="#1e1e24" stroke="#62e5d0" stroke-width="1.2"/>
  <text x="590" y="488" text-anchor="middle" fill="#62e5d0" font-size="9" font-weight="500">Diversify?</text>

  <!-- Yes branch to MMR -->
  <line x1="635" y1="490" x2="700" y2="490" stroke="#62e5d0" stroke-width="1" marker-end="url(#arrow-teal2)"/>
  <text x="667" y="482" text-anchor="middle" fill="#62e5d0" font-size="8" font-weight="500">YES</text>

  <!-- Box: MMR Re-ranking -->
  <rect x="703" y="466" width="180" height="50" rx="6" fill="#16161a" stroke="#62e5d0" stroke-width="1"/>
  <text x="793" y="486" text-anchor="middle" fill="#62e5d0" font-size="11" font-weight="500">MMR Re-ranking</text>
  <text x="793" y="504" text-anchor="middle" fill="#8a8a8e" font-size="9">lambda*relevance - (1-lambda)*similarity</text>

  <path d="M 793 516 L 793 542 L 610 542 L 610 556" fill="none" stroke="#62e5d0" stroke-width="1" marker-end="url(#arrow-teal2)"/>

  <!-- No branch -->
  <line x1="590" y1="530" x2="590" y2="556" stroke="#4a4a52" stroke-width="1.2" marker-end="url(#arrow2)"/>
  <text x="600" y="548" fill="#8a8a8e" font-size="8">NO</text>

  <!-- Box: Apply Filters -->
  <rect x="510" y="560" width="160" height="56" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="590" y="582" text-anchor="middle" fill="#e8e6e3" font-size="11" font-weight="500">Apply Filters</text>
  <text x="590" y="598" text-anchor="middle" fill="#8a8a8e" font-size="9">minScore threshold</text>
  <text x="590" y="610" text-anchor="middle" fill="#8a8a8e" font-size="9">offset pagination</text>

  <line x1="590" y1="616" x2="590" y2="640" stroke="#c4f562" stroke-width="1.5" marker-end="url(#arrow-accent2)"/>

  <!-- RESULTS BOX -->
  <rect x="510" y="644" width="160" height="60" rx="8" fill="#1e1e24" stroke="#c4f562" stroke-width="2"/>
  <text x="590" y="668" text-anchor="middle" fill="#c4f562" font-size="14" font-weight="500">RESULTS</text>
  <text x="590" y="690" text-anchor="middle" fill="#8a8a8e" font-size="10">Array&lt;TextResult&gt;</text>

  <circle cx="590" cy="644" r="3" fill="#f5a262" opacity="0.4"/>

  <!-- Utilities from results -->
  <line x1="670" y1="674" x2="730" y2="674" stroke="#2a2a32" stroke-width="1"/>

  <rect x="733" y="636" width="170" height="80" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="818" y="658" text-anchor="middle" fill="#8a8a8e" font-size="10">Post-result Utilities</text>
  <line x1="750" y1="665" x2="886" y2="665" stroke="#2a2a32" stroke-width="0.5"/>
  <text x="818" y="682" text-anchor="middle" fill="#62b0f5" font-size="10">snippet()</text>
  <text x="818" y="698" text-anchor="middle" fill="#62b0f5" font-size="10">snippets()</text>
  <text x="818" y="714" text-anchor="middle" fill="#62b0f5" font-size="10">explain()</text>

  <!-- Parallel execution bracket -->
  <rect x="300" y="118" width="220" height="278" rx="8" fill="none" stroke="#2a2a32" stroke-width="1" stroke-dasharray="6,4"/>
  <text x="410" y="408" text-anchor="middle" fill="#5a5a5e" font-size="8" letter-spacing="0.08em">CONCURRENT EXECUTION</text>

  <!-- Annotation -->
  <text x="110" y="468" text-anchor="middle" fill="#5a5a5e" font-size="8" font-style="italic">query normalized once,</text>
  <text x="110" y="480" text-anchor="middle" fill="#5a5a5e" font-size="8" font-style="italic">shared across all engines</text>

  <!-- LEGEND -->
  <rect x="30" y="500" width="190" height="160" rx="6" fill="#16161a" stroke="#2a2a32" stroke-width="1"/>
  <text x="48" y="522" fill="#e8e6e3" font-size="11" font-weight="500">Legend</text>
  <line x1="48" y1="530" x2="200" y2="530" stroke="#2a2a32" stroke-width="0.5"/>

  <line x1="48" y1="548" x2="72" y2="548" stroke="#c4f562" stroke-width="1.5"/>
  <text x="80" y="552" fill="#8a8a8e" font-size="9">Main flow (accent)</text>

  <rect x="48" y="562" width="12" height="8" rx="2" fill="none" stroke="#62b0f5" stroke-width="1"/>
  <text x="66" y="570" fill="#8a8a8e" font-size="9">Engine (blue)</text>

  <rect x="48" y="580" width="12" height="8" rx="2" fill="none" stroke="#b08af5" stroke-width="1"/>
  <text x="66" y="588" fill="#8a8a8e" font-size="9">Scoring (purple)</text>

  <rect x="48" y="598" width="12" height="8" rx="2" fill="none" stroke="#62e5d0" stroke-width="1"/>
  <text x="66" y="606" fill="#8a8a8e" font-size="9">Diversity (teal)</text>

  <rect x="48" y="616" width="12" height="8" rx="2" fill="none" stroke="#f5a262" stroke-width="1"/>
  <text x="66" y="624" fill="#8a8a8e" font-size="9">Optimization (orange)</text>

  <line x1="48" y1="640" x2="72" y2="640" stroke="#f562b0" stroke-width="1" stroke-dasharray="5,3"/>
  <text x="80" y="644" fill="#8a8a8e" font-size="9">Optional (dashed)</text>
</svg>

### Query Processing Steps

For a single `search(query, k, options?)` call:

1. **Normalize** — same pipeline as indexing (CharMap, NFKD casefold, whitespace collapse) using the cached maps.
2. **Tokenize** — split + filter + optional stem; positional offsets recorded for phrase/proximity even when not used by all engines.
3. **Expand synonyms** — `normalizedSynonyms` lookup (built at `build()`).
4. **Resolve term nodes** — `normalizedTerms.get(...)` once per term, then passed to every engine via `search_with_term_nodes` / `search_compact` style entrypoints.
5. **Dispatch active engines** — `dispatch_mode` runs the requested modes; results land as `Array<BM25Result>` (or per-mode equivalent) to be fused.
6. **Short-circuit check** — if `config.shortCircuit.enabled` and `bm25Results[0].score / (bm25Results[0].score + bm25Results[1].score) >= shortCircuit.minScore` with the next-rank gap `>= shortCircuit.minGap`, return BM25 directly.
7. **Fuse** — RRF or linear (see [Score Fusion](#score-fusion)).
8. **Normalize fused scores** — `Fusion::normalize_min_max` / `normalize_z_score` (in-place, native).
9. **MMR diversify** (when `config.diversify.enabled` or `SearchOptions.diversify`) — Jaccard over `entryTermNodes`, lambda from `diversify.lambda`.
10. **Apply filters** — `SearchOptions.minScore`, `offset`, and `filter` predicate.

### Performance Analysis

#### Time Complexity

| Operation | Time Complexity | Notes |
|-----------|----------------|-------|
| `add(key, value)` | O(t * log V) | t = tokens in document, V = vocabulary size. Each token requires nodeIndex lookups/inserts. |
| `build()` | O(V * D_avg) | Iterates all terms, builds compact posting arrays, TF cache, trie, and computes IDF/WAND bounds. D_avg = avg posting list length. |
| `search_bm25(q, k)` | O(q * D_q) | q = query terms, D_q = total postings for query terms. With TF cache: O(1) per-document scoring via fieldnorm bucket lookup. WAND pruning skips low-scoring blocks. |
| `search_exact(q, k)` | O(N * \|d_avg\|) | Scans all N documents, substring match on each. N = total documents. |
| `search_fuzzy(q, k, options?)` | O(N * max(\|q\|, \|d\|)) | Levenshtein/Jaro-Winkler against each document key (`FuzzyMode::key`, default). |
| `search_fuzzy(q, k, FuzzyOptions { mode: term })` | O(q * C * max(\|q_t\|, \|v\|)) | Term-level mode with trigram pre-filter: C = candidate terms (typically << V). |
| `search_phrase(q, k, options?)` | O(q * D_q * P) | Intersection of posting lists, then position verification. P = avg positions per term per doc. Slop carried in `PhraseOptions.slop`. |
| `search_boolean(q, k)` | O(q * D_q) | Set operations on posting lists. AND = intersection, OR = union, NOT = complement. |
| `search_proximity(t1, t2, k, options?)` | O(D_intersect * P) | Two-pointer scan on position arrays for intersection of two term posting lists. Distance carried in `ProximityOptions.distance`. |
| `search_prefix(p, k)` | O(1) or O(V) | O(1) with edge n-grams; O(V) without (vocabulary scan). |
| `search_wildcard(p, k)` | O(V * \|p\|) | Pattern match against each vocabulary term. |
| `search_semantic(q, k)` | O(N * d) | Vector similarity search. d = embedding dimension. N = documents (or chunks). |
| `search(q, k, opts)` | O(sum of active modes) | Hybrid search runs active modes, then fuses. Short-circuit may reduce to O(BM25) only. |
| `more_like_this(key, k)` | O(t_doc + BM25) | Extract top TF-IDF terms from document, then BM25 search with those terms. |
| MMR re-ranking | O(k^2 * T) | k = results to re-rank, T = avg terms per doc for Jaccard similarity. |
| RRF fusion | O(R * log R) | R = total results across all modes. Linear merge + sort. |

#### Space Complexity

- **Inverted index**: O(T) where T = total (term, document) pairs across all documents
- **Positional index**: O(P) where P = total token occurrences (every position stored)
- **TF Cache**: O(256) floats + O(T) fieldnorm bucket IDs (1 byte each)
- **Trie**: O(V * L_avg) nodes where L_avg = average term length
- **Trigram index**: O(V * 3) postings (each term has ~|term|-2 trigrams, each pointing back)
- **Edge n-grams**: O(V * L) where L = avg prefix length range
- **Vector index**: O(N * d) where d = embedding dimension (typically 384-768)
- **String deduplication**: `node<String>` ensures identical strings share storage automatically. Normalized texts, term strings, and chunk contents all benefit from this.
- **Content hash index**: O(N) string hashes when deduplication is enabled

## Type Relationships

Field-level shapes for graph types (`TextIndex`, `IndexEntry`, `NormalizedTerm`, `IndexChunk`, `TrigramPostings`, `PhoneticPostings`, `TrieNode`) and config types (`TextIndexConfig`, `SearchOptions`) are listed in [Data Model](#data-model). The "Volatile Result Types" subsection there enumerates `TextResult`, `BM25Result`, `ScoreExplanation`, `TextIndexStats`, and `TextEntry`. The relationship table at [Relationships](#relationships) lists the persistent edges between these types.

Highlights worth re-stating in one place:

- **Bidirectional**: every term carries `postingEntries`/`postingTFs`/`postingDocLens`/`postingFieldnormIds`/`postingBlockMaxScores`; every entry carries `entryTermNodes`/`entryTermTFs` (sorted, parallel) plus packed `positionData`/`positionOffsets`/`positionCounts` and a `termIndexMap` for O(1) term-to-slot lookup.
- **Compact-array preference**: parallel `Array<...>` is used wherever sequential access dominates (postings, forward index, positions). `nodeIndex` is reserved for top-level keyed lookup (`entries`, `normalizedTerms`, `trigramIndex`, `edgeNgramTerms`, `phoneticIndex`, `contentHashes`).
- **String dedup**: term text and chunk content are stored as `node<String>`, sharing storage across documents.

## Module Map

How the GCL modules and 29 native C source files connect. Layered from the top-level orchestrator through engines, scoring, processing, and utilities down to the C native layer.

### Architecture Layers

| Layer | Modules | Purpose |
|-------|---------|---------|
| **Orchestrator** | TextIndex\<T\>, Document | Top-level entry point, coordinates all operations |
| **Engines** | BM25, Boolean, Fuzzy, Phrase, Proximity, Prefix, Wildcard, Span, Phonetic, Suggest, Percolate, Quorum, Curation | Independent search mode implementations |
| **Scoring** | Fusion, MMR, DFR, LMDirichlet, FunctionScore, RankingRules | Score combination, alternative models, and re-ranking |
| **Processing** | TextNormalizer, TextTokenizer, TextChunker, TextParser, PorterStemmer, BooleanParser, SpanParser | Text pipeline and query parsing |
| **Utilities** | CharMap, StopWords, StringUtils, RegexUtils, SnippetExtractor | Low-level helpers and data |
| **C Native** | 29 source files: string_utils, text_normalizer, bm25_scorer, porter_stemmer, dfr_scorer, lm_dirichlet_accel, phonetic_engine, phrase_engine, span_engine, proximity_search, boolean_accel, regex_accel, regex_utils_native, function_score_accel, fusion_accel, fuzzy_matcher, wildcard_matcher, char_tables, char_map_native, prefix_engine, boolean_prefix_accel, ranking_rules_native, stop_words_native, text_chunker_native, text_parser_native, tokenizer_native, doc_reader_native, nativegen, link_native | Performance-critical hot paths |

### Orchestrator

#### TextIndex\<T\>

**File:** `model/text_index.gcl`

The main generic index type. Orchestrates all search modes, manages the inverted index, coordinates normalization/tokenization, and holds configuration. Entry point for all indexing and search operations.

**Exports:** `TextIndex<T>`, `TextIndexConfig`, `TextResult`, `SearchOptions`, `TextEntry`, `TextIndexStats`, `ScoreExplanation`

**Key methods:** `add()`, `build()`, `search()`, `search_bm25()`, `search_phrase()`, `search_semantic()`, `ensure_cached_maps()`

#### Document

**File:** `model/document.gcl`

Structured document types with section and sentence decomposition. Provides hierarchical text organization for section-level search and structured ingestion.

**Exports:** `Document`, `Section`, `Sentence`

### Engines

#### BM25Engine

**File:** `engine/bm25_engine.gcl`

BM25 scoring with precomputed per-term per-document scores. Supports 5 IDF variants (Lucene, BM25+, BM25L, ATIRE, Robertson), term boosting, and document subset search. Includes fused native methods: `search_compact` (multi-term with WAND), `search_boolean_and` (fused boolean-AND BM25), and `search_phrase_compact` (fused phrase search).

**Exports:** `BM25Engine`, `BM25Variant`, `BM25Result`

**Native dependency:** `bm25_scorer.c`

#### BooleanEngine

**File:** `engine/boolean_engine.gcl`

Evaluates boolean query ASTs (AND/OR/NOT with parentheses) against the inverted index. Delegates parsing to BooleanParser. Results scored by BM25.

**Exports:** `BooleanEngine`

**Dependencies:** BooleanParser

#### FuzzyEngine

**File:** `engine/fuzzy_engine.gcl`

Levenshtein edit-distance fuzzy search with trigram pre-filtering and Meilisearch-style typo tolerance. Supports whole-key and term-level fuzzy with trigram pre-filtering. Jaro-Winkler distance also available.

**Exports:** `FuzzyEngine`

**Native dependency:** `fuzzy_matcher.c`

#### PhraseEngine

**File:** `engine/phrase_engine.gcl`

Exact phrase matching via positional verification. Tokens must appear in consecutive positions (slop=0) or within N positions of expected (slop=N).

**Exports:** `PhraseEngine`

#### ProximityEngine

**File:** `engine/proximity_engine.gcl`

Finds documents where two terms appear within N positions. Score inversely proportional to distance. Uses positional offset arrays from the inverted index.

**Exports:** `ProximityEngine`

**Native dependency:** `proximity_search.c`

#### PrefixEngine

**File:** `engine/prefix_engine.gcl`

Prefix matching across the term vocabulary. Also provides autocomplete functionality returning matching terms directly for type-ahead search UI. Supports edge n-gram indexing for O(1) prefix lookups.

**Exports:** `PrefixEngine`

#### WildcardEngine

**File:** `engine/wildcard_engine.gcl`

Pattern matching with `*` (any sequence) and `?` (any char) against the term vocabulary. Internally scores via `BM25Result` and is mapped to `TextResult` (with `matchedTerms` populated) for the public API.

**Exports:** `WildcardEngine`

#### SpanEngine

**File:** `engine/span_engine.gcl`

Span queries for positional search: NEAR (unordered proximity), ONEAR (ordered proximity), and FIRST (term must appear in first N positions).

**Exports:** `SpanEngine`, `SpanQuery`

**Dependencies:** SpanParser

#### PhoneticEngine

**File:** `engine/phonetic_engine.gcl`

Sound-alike matching via Double Metaphone encoding. Builds a phonetic index mapping phonetic codes to terms during `build()`.

**Exports:** `PhoneticEngine`

**Native dependency:** `phonetic_engine.c`

#### SuggestEngine

**File:** `engine/suggest_engine.gcl`

Auto-complete and "did you mean?" spell correction. Prefix matching on vocabulary with document-frequency ranking. Trigram-based candidate selection for spell correction.

**Exports:** `SuggestEngine`, `Suggestion`, `DidYouMeanResult`

#### QuorumEngine

**File:** `engine/quorum_engine.gcl`

Minimum-should-match search. Documents must contain at least `minMatch` of the query terms. Scored by `matchedCount / queryTermCount` (range 0.0 – 1.0).

**Exports:** `QuorumEngine`

#### CurationEngine

**File:** `engine/curation_engine.gcl`

Manual ranking adjustments: pin documents to specific positions, boost/suppress scores, or remove documents from results entirely.

**Exports:** `CurationHelper`

#### PercolateEngine

**File:** `engine/percolate_engine.gcl`

Reverse search: register stored queries, then match incoming documents against them. Useful for alerting and content routing where documents arrive after queries are defined.

**Exports:** `PercolateEngine`, `PercolateIndex`, `PercolatedQuery`

### Scoring

#### Fusion

**File:** `scoring/fusion.gcl`

Score fusion for hybrid search. Reciprocal Rank Fusion (RRF) combines rankings independently of raw score magnitudes. Linear combination applies weighted normalized scores (min-max or z-score).

**Exports:** `Fusion`, `FusionInput`, `FusionAccum` (the `FusionMethod` enum it consumes is declared in `model/text_index.gcl`)

#### MMR

**File:** `scoring/mmr.gcl`

Maximal Marginal Relevance for diversity re-ranking. Balances relevance against redundancy using a lambda parameter. Reduces near-duplicate results in the final output.

**Exports:** `MMR`

#### DFREngine

**File:** `scoring/dfr_engine.gcl`

Divergence From Randomness scoring model. Configurable with 4 BasicModels (G/In/Ine/IF), 2 AfterEffects (Laplace/Bernoulli), and 4 Normalizations (H1/H2/H3/Z).

**Exports:** `DFREngine`, `DFRBasicModel`, `DFRAfterEffect`, `DFRNormalization`

**Native dependency:** `dfr_scorer.c`

#### LMDirichletEngine

**File:** `scoring/lm_dirichlet_engine.gcl`

Language Model scoring with Dirichlet prior smoothing. Alternative to BM25 for probabilistic ranking.

**Native dependency:** `lm_dirichlet_accel.c`

#### FunctionScoreEngine

**File:** `scoring/function_score.gcl`

Decay functions (gaussian, linear, exponential) and field value factors for result re-scoring based on numeric metadata fields.

**Exports:** `FunctionScoreEngine`, `FunctionScoreConfig`, `DecayFunction`, `DecayType`, `FieldValueFactor`

**Native dependency:** `function_score_accel.c`

#### RankingRulesEngine

**File:** `scoring/ranking_rules.gcl`

Meilisearch-style multi-factor ranking pipeline. Applies ordered ranking rules (words, typo, proximity, attribute, sort, exactness) for fine-grained result ordering.

**Exports:** `RankingRulesEngine`, `RankingRule`, `RankingCandidate`

### Processing

#### TextNormalizer

**File:** `processing/text_normalizer.gcl`

Multi-stage text normalization: HTML tag stripping, URL removal, Unicode NFKD decomposition, accent stripping, case folding, whitespace collapse. Configurable via NormOptions.

**Exports:** `TextNormalizer`, `NormOptions`

**Native dependency:** `text_normalizer.c`

**Dependencies:** CharMap, StringUtils

#### TextTokenizer

**File:** `processing/text_tokenizer.gcl`

Splits normalized text into terms with positional tracking. Handles punctuation stripping, min/max term length filtering, numeric token filtering. Records positional offsets for phrase and proximity search.

**Exports:** `TextTokenizer`, `TokenizerAccel`, `TokenInfo`, `TermFrequency`

The tokenizer returns either `Array<TokenInfo>` (carrying the normalized text, original surface form, and positional offset for each token) via `tokenize`, `tokenize_normalized`, `tokenize_with_maps`, and `tokenize_with_originals`'s underlying core, or `Map<String, TermFrequency>` (term → aggregated frequency / position list) via `tokenize_with_originals` and `tokenize_with_originals_normalized_cached`.

**Dependencies:** PorterStemmer, StringUtils, CharMap

#### TextChunker

**File:** `processing/text_chunker.gcl`

Splits text into chunks for semantic embedding. Strategies: fixed-size, sentence-boundary, paragraph-boundary, recursive. Controls chunk size and overlap for vector indexing.

**Exports:** `TextChunker` (the `ChunkStrategy` enum it consumes is declared in `model/text_index.gcl`)

**Dependencies:** StringUtils

#### PorterStemmer

**File:** `processing/stemmer.gcl`

Porter stemming algorithm for English. 5-step suffix stripping reduces terms to root forms. Improves recall by matching morphological variants.

**Exports:** `PorterStemmer`

**Native dependency:** `porter_stemmer.c`

#### BooleanParser

**File:** `parser/boolean_parser.gcl`

Recursive descent parser for boolean query syntax. Tokenizes query strings into AND/OR/NOT operators with parenthesized grouping, producing an AST for BooleanEngine evaluation.

**Exports:** `BooleanParser`, `BooleanQuery`, `ParseResult`, `BooleanOperator` (enum: `AND`, `OR`, `NOT`, `WEAKAND`)

#### SpanParser

**File:** `parser/span_parser.gcl`

Parses span query syntax (NEAR, ONEAR, FIRST operators) into structured SpanQuery objects for the SpanEngine to execute.

**Exports:** `SpanParser`

#### SnippetExtractor

**File:** `util/snippet.gcl`

Query-aware snippet extraction. Locates the best text window around matched terms, produces highlighted excerpts with configurable length and marker tags.

**Exports:** `SnippetExtractor`

**Dependencies:** TextTokenizer, TextNormalizer

### Utilities

#### CharMap

**File:** `util/char_map.gcl`

170+ Unicode-to-ASCII character mappings. Normalizes ligatures, fullwidth characters, smart quotes, mathematical symbols, em-dashes, and more. Supports custom user-defined mappings.

**Exports:** `CharMap`

#### StopWords

**File:** `util/stop_words.gcl`

Stop word lists for 33 languages. Four modes: none, default (language-specific), custom (user-provided), auto (terms in greater than 85% of documents). Removes high-frequency noise terms.

**Exports:** `StopWords` (the `StopWordMode` enum it consumes is declared in `model/text_index.gcl`)

#### StringUtils

**File:** `processing/string_utils.gcl`

String helper functions implemented in native C for performance. Whitespace collapse, character replacement, stripping, and other hot-path string operations.

**Exports:** `StringUtils`

**Native dependency:** `string_utils.c`

#### TextParser

**File:** `processing/text_parser.gcl`

Section parser that splits raw input into `ParsedSection { sectionType, content, title, startLine, endLine }` arrays. Detects markdown headings, fenced code blocks, and prose paragraphs on a line-by-line scan. Drives `add_fields` and section-aware chunking.

**Exports:** `TextParser`, `ParsedSection`

**Native dependency:** `text_parser_native.c`

#### RegexUtils

**File:** `util/regex_utils.gcl`

Regex helpers on top of the native NFA engine: `RegexMatch` iteration, UTF-8 boundary-safe codepoint conversions, and byte-offset-accurate match extraction.

**Exports:** `RegexUtils`, `RegexMatch`

**Native dependency:** `regex_accel.c`, `regex_utils_native.c`

#### SearchAccel

**File:** `scoring/search_accel.gcl`

Native top-k helpers shared by the scoring engines: `find_kth_with_threshold` (threshold-driven top-k), `partial_sort_top_k` (partial ranked sort), and `trim_to_k` (in-place truncation).

**Exports:** `SearchAccel`

**Native dependency:** (nativegen-wired)

### C Native Layer

29 C source files implement all performance-critical operations and are wired through `nativegen.c`. Compiled with `-O3`, LTO, and optional SSE2/AVX2/NEON SIMD vectorization.

#### string_utils.c

Character-level operations: replace_chars, replace_separators, strip_punctuation, collapse_whitespace, count_words, has_alphanumeric, extract_trigrams. All single-pass O(N) implementations with optional SIMD acceleration.

#### text_normalizer.c

Native HTML tag stripping, URL/email stripping, accent removal, quote normalization, line break normalization, HTML entity decoding, and fused multi-pass normalization via `normalize_advanced_with_maps`. Avoids multiple string allocations by combining operations.

#### bm25_scorer.c

BM25 term scoring with 5 variant formulas (Lucene, Plus, L, ATIRE, Robertson). Includes compute_term_score, compute_idf, compute_batch_scores (AVX2-vectorized), precompute_tf_cache, search_compact (fused multi-term search with WAND), search_boolean_and (fused boolean-AND BM25 search), and search_phrase_compact (fused phrase search). SIMD intrinsics process 4-8 doubles per cycle.

#### porter_stemmer.c

Porter stemming algorithm in C. All 7 steps (1a/1b/1c, 2 with 20 suffix rules, 3, 4 with -ion case, 5). Single-word `stem` and batch `stem_batch` functions.

#### dfr_scorer.c

DFR (Divergence From Randomness) scoring. Supports 4 BasicModels (G/In/Ine/IF) x 2 AfterEffects (Laplace/Bernoulli) x 4 Normalizations (H1/H2/H3/Z). Single and batch scoring functions, plus `search_dfr_compact` (fused DFR search).

#### lm_dirichlet_accel.c

Language Model scoring with Dirichlet smoothing. Formula: `log((tf + mu * P(w|C)) / (docLen + mu))` with NaN guard. Single and batch scoring, plus `search_lm_dirichlet_compact` (fused LM-Dirichlet search).

#### phonetic_engine.c

Double Metaphone phonetic encoding. Handles silent prefixes (KN/GN/PN/AE/WR), digraphs (CH/PH/SH/TH), and double consonant collapsing. Returns primary and optional secondary codes.

#### phrase_engine.c

Phrase positional verification with slop tolerance. Two-pointer merge on sorted position arrays. O(n+m) amortized complexity.

#### span_engine.c

NEAR/ONEAR proximity span queries. Ordered and unordered paths with two-pointer monotonic pointer amortization.

#### proximity_search.c

Minimum distance between two sorted position arrays via two-pointer merge. Used for proximity scoring.

#### boolean_accel.c

Adaptive sorted array intersection with galloping search for skewed sizes. O(m log(n/m)) when one array is 16x+ larger.

#### regex_accel.c

Iterative NFA regex simulation with greedy backtracking. Supports `.` `*` `+` `?` `[abc]` `[a-z]` `[^abc]` and `\`-escapes on int32 codepoints.

#### function_score_accel.c

Gaussian, linear, and exponential decay functions for function scoring. Includes `compute_batch_decay` for batch decay computation. Pure math, zero allocation.

#### fuzzy_matcher.c

Banded Levenshtein edit distance (`banded_levenshtein`) with O(n * (2*maxEdits+1)) complexity for fuzzy matching. Also implements Meilisearch-style typo tolerance policy: `compute_typo_max_edits` returns 0/1/2 max edits based on term length.

#### nativegen.c / nativegen.h

Auto-generated native function registration. Links every `native fn` declared in GCL to its C implementation via `gc_program__link_type_fn` and exposes type/field/enum symbols (`gc_<type>__<field>`) that native modules read directly for zero-reflection field access.

#### link_native.c

Library lifecycle hooks (`lib_start`, `lib_stop`, `worker_start`, `worker_stop`). Hoists the BM25 TF-bucket lookup-table init out of the double-checked-locking fast path so the first search does not pay for bucket initialization. Per-function wiring is now fully handled by `nativegen.c`.

#### prefix_engine.c

Native prefix search operations: `score_matching_terms_native` (IDF-based scoring of prefix-matching terms), `autocomplete_from_terms_native` (return matching term strings), and `trie_collect_prefix_native` (collect term nodes from trie by prefix). The vocabulary-range fallback path is the GCL `PrefixEngine::search_vocab_scan`, which scans `normalizedTerms` directly when no trie or edge-ngram index is available.

#### boolean_prefix_accel.c

Native boolean operations on posting lists: `sort_postings_by_ref` (sort postings by entry reference for merge operations), `intersect_node_refs` / `union_node_refs` / `difference_node_refs` (set operations on `Array<node<IndexEntry>>`), and `execute_boolean_native` (fused boolean query execution from compiled operation arrays).

#### wildcard_matcher.c

Codepoint-level wildcard pattern matching via `match_full_pattern_cps`. Supports `*` (any sequence) and `?` (any single character) on int32 codepoint arrays for correct Unicode handling.

#### char_tables.c / char_tables.h

Static O(1) lookup tables for character classification: ASCII punctuation/whitespace/control bitsets, Unicode combining marks hash set, smart quote replacement map (16 entries), HTML entity map (22 named entities), and binary search helpers.

#### char_map_native.c

Native construction of `CharMap` defaults: the default char replacement map (smart quotes, dashes, ligatures, diacritics), the Unicode combining-marks set used by `strip_accents`, and configurable smart-quote / punctuation overlays. Builds GCL `Map<char, String>` and `Map<char, bool>` structures in a single pass without per-entry `Map::put` overhead.

#### ranking_rules_native.c

Native qsort-based sort for `RankingRulesEngine::apply()`. Reads `RankingCandidate` fields directly through `nativegen.h` object offsets (matchedWordCount, typoCount, minProximity, firstMatchPosition, isExactMatch, sortValue) and dispatches comparator decisions via the compiled `Array<RankingRule>` (words/typo/proximity/attribute/sort/exactness). Replaces the GCL merge sort.

#### doc_reader_native.c

`DocReader::resolve_if_node` — resolves a generic `any?` value to its underlying object when it is a `node<T>`. Used by typed multi-field ingestion (`add_fields`) so the same code path works whether `T` or `node<T>` is passed. Returns the resolved object directly for plain objects, and null for primitives/enums.

#### regex_utils_native.c

UTF-8 ↔ codepoint conversion with byte-offset tracking, plus regex match iteration that materializes `RegexMatch` arrays (matched/startPos/endPos/text). Delegates the per-match NFA step to `regex_accel.c`'s `match_pattern_internal` so that offsets are always byte-accurate on multi-byte input.

#### fusion_accel.c

`Fusion::normalize_min_max` and `Fusion::normalize_z_score` for `Array<BM25Result>`. Two-pass in-place rewrite of the `score` field via nativegen field offsets — min/max (or mean/stddev) in the first pass, rewrite in the second. Short-circuits to 0.5 (min-max) or 0 (z-score) when the range is zero.

#### stop_words_native.c

Native initialization of per-language stop-word maps for the 33 supported languages. Inlines every language table as a packed C array of interned strings and populates a single `Map<String, bool>` in one pass, avoiding thousands of GCL `Map::put` calls at each `build()`.

#### text_chunker_native.c

Native implementations of the four chunking strategies (`fixed`, `sentence`, `paragraph`, `recursive`). Emits `ChunkInfo { content, position, startChar, endChar }` directly without round-tripping through GCL strings; word counts are computed inline on the input buffer.

#### text_parser_native.c

Native section parsing into `ParsedSection { sectionType, content, title, startLine, endLine }`. Detects markdown headings, fenced code blocks, and contiguous prose on a line-by-line scan with in-place span trimming.

#### tokenizer_native.c

`TextTokenizer::build_freq_map` — builds `Map<String, TermFrequency>` from a `TokenInfo[]` array. Uses an open-addressing 256-slot cache keyed by FNV-1a hash so that repeated occurrences of hot terms skip the full `Map::get` path (typical docs reuse terms in ~80% of token positions).

### Dependency Summary

| GCL Module | Depends On (GCL) | Depends On (C Native) |
|------------|-------------------|----------------------|
| TextIndex\<T\> | All engines, Fusion, MMR, TextNormalizer, TextTokenizer, StopWords, SnippetExtractor | -- |
| BM25Engine | -- | bm25_scorer.c |
| BooleanEngine | BooleanParser | boolean_accel.c |
| FuzzyEngine | -- | fuzzy_matcher.c |
| ProximityEngine | -- | proximity_search.c |
| SpanEngine | SpanParser | span_engine.c |
| PhraseEngine | -- | phrase_engine.c |
| DFREngine | -- | dfr_scorer.c |
| LMDirichletEngine | -- | lm_dirichlet_accel.c |
| PhoneticEngine | -- | phonetic_engine.c |
| FunctionScoreEngine | -- | function_score_accel.c |
| PrefixEngine | -- | prefix_engine.c |
| BooleanAccel | -- | boolean_prefix_accel.c |
| WildcardEngine | -- | wildcard_matcher.c |
| TokenizerAccel | -- | (nativegen-wired) |
| TextNormalizer | CharMap, StringUtils | text_normalizer.c |
| TextTokenizer | PorterStemmer, StringUtils, CharMap | tokenizer_native.c |
| TextChunker | StringUtils | text_chunker_native.c |
| TextParser | StringUtils | text_parser_native.c |
| PorterStemmer | -- | porter_stemmer.c |
| SnippetExtractor | TextTokenizer, TextNormalizer | -- |
| DocReader | -- | doc_reader_native.c |
| StringUtils | -- | string_utils.c |
| CharMap | -- | char_map_native.c |
| StopWords | -- | stop_words_native.c |
| RegexUtils | -- | regex_accel.c, regex_utils_native.c |
| Fusion | -- | fusion_accel.c |
| RankingRulesEngine | -- | ranking_rules_native.c |
| SearchAccel | -- | (nativegen-wired top-k helpers) |
