# Architecture Text Search for GreyCat is a full-text search engine built on GreyCat's persistent graph database. It combines **inverted index** techniques from classical information retrieval with GreyCat's native graph storage primitives (`node`, `nodeIndex`, `nodeList`) to deliver ranked search across 15 modes without any external dependencies. ## Overview ### Core Concepts - **Inverted Index** -- Maps each normalized term to the set of documents containing it, with term frequency counts. This is the backbone of BM25, boolean, phrase, and prefix search. - **Bidirectional Links** -- Every `NormalizedTerm` links to its `IndexEntry` documents via compact posting arrays (`postingEntries`), and every `IndexEntry` links back to its `NormalizedTerm`s via compact forward index arrays (`entryTermNodes`/`entryTermTFs`). This enables both term-to-doc lookups (search) and doc-to-term lookups (More Like This, MMR similarity). - **Positional Index** -- Each document stores the exact positions where each term appears via packed position arrays (`positionData`, `positionOffsets`, `positionCounts`) -- a flat concatenated array with per-term offsets instead of nested nodeIndex. This powers phrase search, proximity search, and span queries (NEAR/ONEAR/FIRST). - **Vector Index** -- Optional AI embeddings stored in a `VectorIndex` for semantic similarity search. Supports full-document and chunk-based embeddings with configurable chunking strategies. - **Term Variants** -- The `NormalizedTerm` type stores the casefolded, stemmed, accent-stripped form used for matching. The original surface form is tracked at the tokenizer level via the `TokenInfo.original` field rather than as a separate graph type, keeping the data model compact. ### Design Principles - **Precomputation over runtime cost** -- BM25 scoring uses a 256-bucket TF cache built during `build()`, enabling O(1) TF normalization lookup per document. Compact posting arrays with fieldnorm bucket IDs avoid per-document computation at query time. - **Memory deduplication** -- All text is stored as `node`, which GreyCat automatically deduplicates. Identical strings share a single graph node. - **Hot paths in C** -- Character classification, string manipulation, accent stripping, BM25 scoring, DFR/LM-Dirichlet scoring, tokenization, Porter stemming, phonetic encoding, phrase/proximity/span positional verification, regex matching, and decay functions are implemented as native C functions for performance. - **Cache once, use everywhere** -- Character maps, punctuation sets, combining marks, and HTML entity maps are built once per index lifetime and passed to every operation. ## Data Model The graph structure centers on `TextIndex`, which owns two primary indices and several auxiliary structures. Here is the complete type hierarchy: ``` // Graph structure: TextIndex owns two primary nodeIndex maps TextIndex // Configuration +-- config: TextIndexConfig | // Primary indices +-- entries: nodeIndex>? +-- normalizedTerms: nodeIndex>? | // Statistics +-- totalEntries: int? +-- totalTokens: int? +-- totalTerms: int? +-- avgTokenCount: float? +-- built: bool? +-- dirtyCount: int? | // Auxiliary indices +-- contentHashes: nodeIndex>? +-- trigramIndex: nodeIndex>? +-- edgeNgramTerms: nodeIndex>? +-- resolvedFields: Array? // lazy cache: explicit config.fields, or auto-discovered String fields on T at first add_fields() +-- vectorIndex: node>>? +-- phoneticIndex: nodeIndex>? +-- trieRoot: node? // trie for prefix/wildcard search +-- reverseTrieRoot: node? // reverse trie for leading wildcard search | // Cached maps (built once, reused) +-- cachedCharToChar: Map? +-- cachedCharToString: Map? +-- cachedPunctuationChars: Map? +-- cachedCombiningMarks: Map? +-- cachedStopWordMap: Map? +-- cachedTFCache: Array? // 256-entry BM25 TF normalization cache +-- normalizedSynonyms: Map>? IndexEntry +-- value: any? // stored document value (type T or node) +-- normalizedText: String // normalized text (plain String) +-- contentHash: String? // SHA-256 for dedup +-- tokenCount: int // document length in tokens +-- entryTermNodes: Array>? // compact forward index: term refs +-- entryTermTFs: Array? // compact forward index: TFs (parallel) +-- positionData: Array? // packed positions: all positions flat +-- positionOffsets: Array? // packed positions: start offset per term +-- positionCounts: Array? // packed positions: count per term +-- chunks: nodeList>? +-- vector: node? NormalizedTerm extends Term +-- text: node // deduplicated term string +-- originalForm: String? // best original (pre-stemmed) form +-- idf: float? // inverse document frequency +-- maxTermScore: float? // WAND upper bound +-- postingEntries: Array>? // compact posting list +-- postingTFs: Array? // term frequencies (parallel) +-- postingDocLens: Array? // document lengths (parallel) +-- postingFieldnormIds: Array? // fieldnorm buckets (parallel) +-- postingBlockMaxScores: Array? // block WAND scores ``` ### Why This Design? #### Bidirectional Links Every `NormalizedTerm` has compact posting arrays (`postingEntries`) pointing to the documents that contain it (term -> doc). Every `IndexEntry` has compact forward index arrays (`entryTermNodes`/`entryTermTFs`) pointing back to the terms it contains (doc -> term). This bidirectionality enables: - **Search**: given a query term, find all documents (forward lookup) - **More Like This**: given a document, find its terms, pick the top TF-IDF terms, then search for similar documents - **MMR diversity**: compute Jaccard similarity between two documents by intersecting their term sets - **Explain**: given a document and query, compute per-term BM25 contribution #### Normalized Terms and Original Forms `NormalizedTerm` holds the casefolded, stemmed, accent-stripped form used for matching. The original surface form is tracked at the tokenizer level via `TokenInfo`, which has both `text` (normalized) and `original` (surface form) fields. This means search is accent-insensitive and case-insensitive, while highlighting and display use the original characters -- without requiring a separate `OriginalTerm` graph type. #### Positional Index Packed position arrays store all positions in a single flat `Array` (`positionData`) with per-term offset (`positionOffsets`) and count (`positionCounts`) arrays. This avoids the overhead of nested `Array>` which cannot be persisted in graph nodes. Without pre-stored positions, phrase search would require re-tokenizing documents at query time. With positions pre-stored, phrase matching is a two-pointer merge on sorted integer arrays. #### Hybrid Storage: nodeIndex and Compact Arrays GreyCat's `nodeIndex` is a persistent sorted map with O(log n) lookup, insertion, and iteration. It serves as the primary index for top-level lookups (`entries`, `normalizedTerms`). For posting lists and forward indices, compact parallel arrays (`postingEntries`/`postingTFs`, `entryTermNodes`/`entryTermTFs`) provide better cache locality and lower overhead than nested nodeIndex structures. This hybrid approach uses nodeIndex where flexible key-based lookup is needed, and flat arrays where sequential access and compact storage matter most. ### Core Graph Types #### IndexChunk Chunk record used by chunk-level semantic search. Populated by `add()` when `TextIndexConfig.chunking.strategy != none` and `config.embed != null`. The chunker (`TextChunker::chunk`) splits the original text, each chunk is embedded individually, and the vector is pushed into the index's `chunkVectorIndex` (a `VectorIndex>`). At query time, `search_semantic` over-fetches chunks and deduplicates to the best-scoring chunk per parent document. ``` IndexChunk content: node parent: node vector: node? tokenCount: int position: int ``` #### TrigramPostings Used for fast fuzzy pre-filtering. ``` TrigramPostings terms: Array> ``` #### PhoneticPostings Posting list for phonetic codes, mapping each code to term texts sharing that pronunciation. ``` PhoneticPostings terms: Array ``` #### TrieNode Trie (prefix tree) node for O(prefix_len + matches) prefix/wildcard search. ``` TrieNode children: Map>? // codepoint → child terms: Array>? isTerminal: bool? // nullable for ABI safety; treat null as false ``` #### TextIndexConfig Configuration type with standalone fields and nested option blocks controlling all aspects of indexing and search. ``` TextIndexConfig // standalone fields embed | synonyms | fields deduplicateContent | fuzzyMaxTextLength | usePhonetic // nested option blocks tokenization | stopWords | bm25 | fusion typoTolerance | edgeNgram | shortCircuit | diversify chunking | dfr | lmDirichlet | highlight ``` #### SearchOptions Per-query overrides for configuration parameters. ``` SearchOptions (@volatile) modes | weights | fusionMethod | normalization | rrf_k fuzzy | phrase | proximity (nested options) typoTolerance | minScore | diversify offset | filter | termBoosts ``` ### Relationships | From | To | Relationship | Description | |------|----|--------------|-------------| | TextIndex | IndexEntry | `entries` | Ownership -- all indexed documents | | TextIndex | NormalizedTerm | `normalizedTerms` | Ownership -- term vocabulary | | TextIndex | TextIndexConfig | `config` | Configuration reference | | TextIndex | TrigramPostings | `trigramIndex` | Fuzzy search trigrams | | IndexEntry | NormalizedTerm | `entryTermNodes` | Term references (Array\\>) | | IndexEntry | NormalizedTerm | `positionData/Offsets/Counts` | Positional offsets (Array\) | | IndexEntry | IndexChunk | `chunks` | Document chunks for semantic search | | IndexChunk | IndexEntry | `parent` | Back-reference to parent document | | NormalizedTerm | IndexEntry | `postingEntries` | Posting list (Array\\>) | ### Volatile Result Types All result types are annotated `@volatile` (not persisted to the graph). #### TextResult Unified search result returned by every public search method on `TextIndex` (`search`, `search_bm25`, `search_semantic`, `search_exact`, `search_fuzzy`, `search_boolean`, `search_proximity`, `search_phrase`, `search_prefix`, `search_wildcard`, `search_span`, `search_dfr`, `search_lm_dirichlet`, `search_phonetic`, `search_quorum`, `more_like_this`, ...). Each engine produces this same shape; per-mode score breakdowns are merged into the single `score` value during fusion. | Field | Type | Notes | |-------|------|-------| | `key` | `String` | Document key | | `value` | `any?` | Stored document value (type T) | | `score` | `float` | Relevance score (higher = more relevant) | | `matchedTerms` | `Array?` | Query terms that matched in this document | | `chunkKey` | `String?` | `"#"` when chunk-level semantic search is enabled (config.chunking.strategy != none + config.embed != null). Null on the document-level path. | #### BM25Result Internal scoring record used by `BM25Engine` (and by helper engines that piggyback on the BM25 scorer such as Wildcard, Prefix, Boolean, and Quorum) before results are mapped to `TextResult` for the public API. Defined in `engine/bm25_engine.gcl`. | Field | Type | |-------|------| | `key` | `String` | | `value` | `any?` | | `score` | `float` | | `matchedTerms` | `Array` | #### ScoreExplanation BM25 scoring debug breakdown with per-term IDF, TF, and contribution details. | Field | Type | |-------|------| | `totalScore` | `float` | | `terms` | `Array` | | `variant` | `BM25Variant` | | `k1` | `float` | | `b` | `float` | | `docLen` | `int` | | `avgDocLen` | `float` | #### TextIndexStats Index statistics summary: corpus size, vocabulary size, average length. | Field | Type | |-------|------| | `totalEntries` | `int` | | `totalTerms` | `int` | | `avgTokenCount` | `float` | #### TextEntry Batch input entry for `add_batch()` with optional pre-computed vector. | Field | Type | |-------|------| | `key` | `String` | | `value` | `any?` | | `vector` | `Tensor?` | ### Enumerations #### StopWordMode Stop word handling strategy for the indexing pipeline. Values: `none`, `auto`, `default`, `custom` #### BM25Variant BM25 IDF scoring formula variant selection. Values: `lucene`, `plus`, `bm25l`, `atire`, `robertson` #### FusionMethod Score fusion algorithm for combining multi-mode results. Values: `rrf`, `linear` #### Normalization Score normalization methods for linear fusion. Values: `minmax`, `zscore` #### ChunkStrategy Text chunking strategies for semantic search and RAG. Values: `none`, `fixed`, `sentence`, `paragraph`, `recursive` #### SearchMode Available search strategies for querying the index. 15 modes total. Values: `hybrid`, `bm25`, `semantic`, `exact`, `fuzzy`, `boolean`, `proximity`, `phrase`, `prefix`, `wildcard`, `span`, `dfr`, `lm_dirichlet`, `phonetic`, `quorum` #### TextSearchLanguage Supported languages for stop words and text processing. 33 languages. Values: `ar`, `bg`, `ca`, `cs`, `da`, `de`, `el`, `en`, `es`, `fa`, `fi`, `fr`, `gu`, `he`, `hi`, `hu`, `id`, `it`, `ja`, `ko`, `ms`, `nl`, `no`, `pl`, `pt`, `ro`, `ru`, `sk`, `sv`, `tr`, `uk`, `vi`, `zh` #### BooleanOperator Boolean query operators for the recursive descent parser. Values: `AND`, `OR`, `NOT`, `WEAKAND` ## Indexing Pipeline When you call `index.add(key, value)`, the text goes through a multi-step normalization and indexing pipeline. Each step transforms the text closer to its canonical form for matching. ### Indexing Pipeline Diagram From raw text input to a fully searchable inverted index with compact posting arrays and TF cache. Every hot-path operation runs in native C. #### Phase 1 -- Ingestion **01. Document Input** **Methods:** `add(key, value)` / `add_batch()` / `add_fields()` Raw text enters the pipeline via `TextIndex.add(key, value)`. For structured documents, `add_fields(value: T)` accepts a typed document; per-field text is read off `value` via the `FieldConfig.f` typed field references (auto-discovered from `T` if `config.fields` is null). The first `String` of each configured field is concatenated to drive normalization/indexing, and at query time `search_bm25_f`, `search_faceted`, `AggregationEngine`, and `FunctionScoreEngine` read fields directly off `IndexEntry.value` via the same field refs — there is no separate "field data" side index. `add_batch()` accepts `Array`. **Related types:** `TextIndex`, `TextEntry`, `FieldConfig` **02. Content Deduplication** **Method:** SHA-256 hash When `config.deduplicateContent` is set, `Crypto::sha256hex(key)` is computed and looked up in `contentHashes`; an existing hit short-circuits the rest of the pipeline. #### Phase 2 -- Normalization **03. Advanced Normalization** **Method:** `TextNormalizer.normalize_advanced_with_maps()` (companion to `normalize()` / `normalize_with_maps()`) Strips HTML tags, removes URLs and email addresses, normalizes quotes and dashes, removes control characters, limits repeating chars. Configurable via `NormOptions`. Single fused pass for quotes + control + repeat. **Native:** `text_normalizer.c` (`strip_html_tags`, `strip_urls`, fused quote/control/repeat pass) **04. Character Mapping** **Method:** `CharMap.apply()` 170+ Unicode-to-ASCII character mappings: em/en dashes to hyphens, ligatures, fullwidth to ASCII, smart quotes, mathematical symbols, Cyrillic/Greek look-alikes. Custom mappings via `TextIndexConfig.tokenization.charMap`. Built once via `ensure_cached_maps()` and reused for every document and query. Source: `util/char_map.gcl`, native init in `char_map_native.c`. **05. Case Folding** **Method:** `nfkd_casefold()` NFKD Unicode decomposition followed by lowercasing. Accent stripping maps diacritics to base characters (e.g., cafe -> cafe). Controlled by `caseFold` config flag. **Native:** `text_normalizer.c` -- NFKD, lowercase, accent strip **06. Whitespace Collapse** **Method:** `StringUtils.collapse_whitespace()` All runs of whitespace (spaces, tabs, newlines) collapsed to a single space. Leading and trailing whitespace trimmed. Single-pass O(N) in C. **Native:** `string_utils.c` #### Phase 3 -- Tokenization **07. Tokenize & Position Track** **Methods:** `TextTokenizer.tokenize()`, `tokenize_normalized()`, `tokenize_with_maps()`, `tokenize_with_originals()` Splits normalized text into terms, returning `Array` (with positional offsets baked into each `TokenInfo`) or `Map`. Strips punctuation, filters by min/max term length, checks `has_alphanumeric()`, optionally removes numeric tokens. Recorded positional offsets enable phrase search, proximity search, slop matching, and span queries. **Native:** `string_utils.c` -- split on separators, strip punctuation, length filter, numeric filter, positional offsets **08. Stemming** **Method:** `PorterStemmer.stem()` Optional Porter stemmer reduces terms to root forms: "running" -> "run", "connections" -> "connect", "computational" -> "comput". 5-step suffix stripping. Improves recall at the cost of some precision. **Native:** `porter_stemmer.c` -- Optional, 5-step suffix stripping **09. Stop Word Filtering** **Method:** `StopWords::init_stop_words_map()` populates a `Map` looked up via `StopWords::is_stop_word_map()`. Auto-detection uses `StopWords::detect_auto_stop_words_map()`. Removes high-frequency function words. Checked against the original pre-stemmed form, not the stemmed form. Four modes (`StopWordMode`): `none`, `default` (33 languages), `custom`, `auto` (df / N >= `stopWords.autoThreshold`, default 0.85, computed at `build()`). #### Phase 4 -- Index Construction **10. Inverted Index** **Structure:** `nodeIndex>` For each unique term: get-or-create `NormalizedTerm`, append the entry to its forward index (`entryTermNodes` + `entryTermTFs` parallel arrays plus `termIndexMap` for O(1) lookup), and append packed positions (`positionData` + `positionOffsets` + `positionCounts`). Term text is stored as `node`, sharing storage across documents. **11. Semantic Indexing** **Method:** `VectorIndex.add()` When `config.embed` is set, text is chunked per `ChunkStrategy` (fixed / sentence / paragraph / recursive, with `chunking.size` and `chunking.overlap`), each chunk is embedded via the supplied function, and vectors are appended to `vectorIndex` (a `node>>`). #### Phase 5 -- Build (IDF + Precompute) **12. IDF Computation & BM25 Precompute** **Methods:** `build()` Computes average document length. Auto-detects stop words if `StopWordMode::auto`. Pre-normalizes synonyms. Iterates all terms, computes IDF using the configured variant, then **builds compact posting arrays** (`postingEntries`, `postingTFs`, `postingFieldnormIds`) and a **256-bucket TF normalization cache** for O(1) per-document scoring at query time. Also computes `maxTermScore` per term and block-level max scores (`postingBlockMaxScores`) for WAND pruning. Builds trigram index for fuzzy pre-filtering and trie/reverse-trie for prefix/wildcard search. **IDF Formulas** | Variant | Formula | Notes | |---------|---------|-------| | Lucene (default) | `log(1 + (N - df + 0.5) / (df + 0.5))` | Always positive, even for high-frequency terms | | BM25+ | `log((N + 1) / df)` | Adds delta to prevent zero TF contribution | | BM25L | `log((N + 1) / (df + 0.5))` | Better for long documents | | ATIRE | `log(N / df)` | Simplest variant | | Robertson | `log((N - df + 0.5) / (df + 0.5))` | Original BM25 -- can go negative | **Native:** `bm25_scorer.c` -- TF cache, compact posting arrays, auto stop words, trigramIndex, WAND maxTermScore #### Legend | Tag | Meaning | |-----|---------| | **C native** | Implemented in native C for performance | | **Core type** | Fundamental data model type | | **Data structure** | Index or storage structure | | **Algorithm** | Algorithmic component | | **Config-dependent** | Behavior controlled by configuration | | **Optional step** | Only runs when enabled | | **AI / Embedding** | Requires AI model | ### Build Phase Detail `build()` walks the vocabulary once and materialises the structures that turn the raw graph into a queryable index. Steps: 1. **Average doc length:** `avgTokenCount = totalTokens / totalEntries`, used by BM25 length normalization. 2. **Auto stop words:** when `StopWordMode::auto`, terms with `df / totalEntries >= autoThreshold` (default 0.85) join `cachedStopWordMap` (`StopWords::detect_auto_stop_words_map`). 3. **IDF per term**, using the variant selected by `config.bm25.variant` (formulas in Phase 5 above). 4. **Compact posting arrays** on every `NormalizedTerm`: `postingEntries`, `postingTFs`, `postingDocLens`, `postingFieldnormIds`, sorted by entry ref. 5. **256-bucket TF cache** (`cachedTFCache`, Tantivy-style log length encoding): each entry maps a fieldnorm bucket id to the precomputed denominator `k1 * (1 - b + b * decoded_length / avgDocLen)`. TF normalization at query time is one array load. 6. **WAND upper bounds:** `maxTermScore` per term plus per-64-entry `postingBlockMaxScores` enable both global and block-level pruning during `search_compact`. 7. **Trigram index** (`trigramIndex: nodeIndex>`) when `config.typoTolerance.enabled`. Built once, queried per fuzzy search to narrow Levenshtein candidates from O(V) to ~1-10% of V. 8. **Edge n-gram index** (`edgeNgramTerms`) when `config.edgeNgram.enabled`: prefix lengths `[edgeNgram.min, edgeNgram.max]` of each term land in a `nodeIndex` for O(1) prefix lookup. 9. **Trie + reverse trie** (`trieRoot`, `reverseTrieRoot`) over the normalized vocabulary for `search_prefix` (O(prefix_len + matches)) and leading-wildcard `search_wildcard`. 10. **Phonetic index** (`build_phonetic_index`) when `config.usePhonetic`: maps Double Metaphone codes to term strings. 11. **Synonym pre-normalization** (`normalizedSynonyms`): synonyms are normalized once so query-time expansion is a map lookup. ## Score Fusion Hybrid search combines per-mode result lists into one ranking. Worked examples and configuration knobs live in [search-modes.md > Hybrid Search](./search-modes.md#hybrid-search); this section records only the formulas the engines implement. ### Reciprocal Rank Fusion (RRF) Rank-based, independent of raw score magnitude. Implemented in `scoring/fusion.gcl` (`Fusion::rrf`). ``` score(d) = SUM_i weight_i / (k + rank_i(d)) // k = config.fusion.rrf.k (default 60) // rank_i(d) = 1-indexed rank of d in mode i's result list; omitted if d is absent ``` When `config.fusion.rrf.topRankBonus` is set (default), rank-1 hits get `topBonus` (default +0.05) and ranks within `nearTopCutoff` get `nearTopBonus` (default +0.02). ### Linear Fusion Score-based, requires per-mode normalization first. Native normalizers live in `fusion_accel.c` (`normalize_min_max`, `normalize_z_score`) and operate in-place over `Array`. ``` score(d) = SUM_i weight_i * normalize(score_i(d)) minmax(s) = (s - min) / (max - min) // collapses to 0.5 when range = 0 zscore(s) = (s - mean) / stddev // collapses to 0 when stddev = 0 ``` ### Search Architecture Diagram 15 search modes (including hybrid). Each non-hybrid mode is an independent engine. Hybrid search fuses multiple signals via RRF or linear combination, with optional MMR diversity re-ranking. The diagram shows the 11 highest-traffic engines plus the BM25F multi-field variant and More-Like-This. Four additional engines follow the same dispatch pattern but specialize the scoring: **DFR** (Divergence From Randomness), **LM-Dirichlet** (language model smoothing), **Phonetic** (Double Metaphone), and **Quorum** (minimum-should-match). They sit in the same engine layer as BM25 and reuse the same posting lists; see the per-mode entries in the table below. ### Search Modes User-facing semantics, query syntax, and examples for each mode live in [search-modes.md](./search-modes.md). The table below summarises the algorithmic backbone and native helper consumed by each engine. | Mode | Method | Algorithm / data structure | Native helper | |------|--------|----------------------------|----------------| | BM25 | `search_bm25(query, k)` | Compact posting arrays + 256-bucket TF cache + WAND (`maxTermScore`, `postingBlockMaxScores`); 5 IDF variants | `bm25_scorer.c` (`search_compact`) | | BM25F | `search_bm25_f(query, k)` | Per-field length normalization (`FieldConfig.fieldB`), weighted field-score sum then IDF | `bm25_scorer.c` | | Semantic | `search_semantic(query, k)` | Nearest-neighbor over `VectorIndex` (doc-level) or `VectorIndex` (chunk-level, when `config.chunking.strategy != none`) | -- (user-provided embed function) | | Exact | `search_exact(query, k)` | Normalized substring scan over `IndexEntry.normalizedText` | -- | | Fuzzy | `search_fuzzy(query, k, options?)` | Banded Levenshtein on keys (`FuzzyMode::key`) or trigram-prefiltered vocabulary scan (`FuzzyMode::term`) via `trigramIndex` | `fuzzy_matcher.c` | | Boolean | `search_boolean(query, k)` | `BooleanParser` AST + sorted-array set ops (intersect / union / difference) on posting refs | `boolean_accel.c`, `boolean_prefix_accel.c` | | Proximity | `search_proximity(term1, term2, k, options?)` | Two-pointer min-distance on packed `positionData`; distance from `ProximityOptions.distance` | `proximity_search.c` | | Phrase | `search_phrase(phrase, k, options?)` | Two-pointer positional verification with `PhraseOptions.slop`; BM25-scored | `phrase_engine.c`, `bm25_scorer.c` (`search_phrase_compact`) | | Prefix | `search_prefix(prefix, k)` | Trie walk (`trieRoot`) or sorted-vocabulary range via `search_vocab_scan`; optional `edgeNgramTerms` for O(1) | `prefix_engine.c` | | Wildcard | `search_wildcard(pattern, k)` | Codepoint-level `*` / `?` match (`match_full_pattern_cps`) over vocabulary; reverse trie for leading wildcards | `wildcard_matcher.c` | | Span | `search_span(query, k)` | `SpanParser` AST + monotonic-pointer span verification (`NEAR` / `ONEAR` / `FIRST`) | `span_engine.c` | | DFR | `search_dfr(query, k)` | 4 BasicModels x 2 AfterEffects x 4 Normalizations | `dfr_scorer.c` (`search_dfr_compact`) | | LM-Dirichlet | `search_lm_dirichlet(query, k)` | `log((tf + mu * P(w|C)) / (docLen + mu))` Dirichlet smoothing | `lm_dirichlet_accel.c` (`search_lm_dirichlet_compact`) | | Phonetic | `search_phonetic(query, k)` | Double Metaphone codes via `phoneticIndex`; requires `config.usePhonetic` | `phonetic_engine.c` | | Quorum | `search_quorum(query, k, minMatch)` | Posting-list union, filter by matched-term count; BM25 score on the matched subset | `bm25_scorer.c` | ## Optimization & Performance ### Optimization Techniques | Technique | Where | What it skips / saves | |-----------|-------|-----------------------| | Short-circuit | `model/text_index.gcl::dispatch_mode`; gated by `config.shortCircuit` | Skips non-BM25 engines when BM25 already has a clear winner: `s0/(s0+s1) >= minScore` and gap `>= minGap`. | | Cached char maps | `ensure_cached_maps()`, stored on `TextIndex` | Reuses CharMap, punctuation set, combining marks, smart-quote map, HTML entity map across every `add()` and `search()`. | | Pre-resolved term nodes | `search_with_term_nodes` / `search_compact` entrypoints | One `normalizedTerms.get` per query term instead of one per scored doc. | | 256-bucket TF cache | `precompute_tf_cache` (`bm25_scorer.c`) | TF normalization becomes one array load (`cachedTFCache[fieldnormId]`) instead of `k1 * (1 - b + b * len/avgLen)`. | | Compact postings | `postingEntries`/`postingTFs`/`postingFieldnormIds`/`postingBlockMaxScores` on `NormalizedTerm` | Sequential, cache-friendly scoring; AVX2 batch scoring in `compute_batch_scores`. | | Compact forward index | `entryTermNodes`/`entryTermTFs` on `IndexEntry`, packed positions | O(1) `get_term_positions` via `termIndexMap`, fast Jaccard for MMR/MoreLikeThis. | | WAND + block-max | `maxTermScore` (per term) and `postingBlockMaxScores` (per 64-entry block) | Skips documents/blocks whose upper bound cannot beat the current k-th score (theta). | | MMR diversity | `scoring/mmr.gcl` (`MMR::apply`); Jaccard over `entryTermNodes` | Greedy `lambda * relevance - (1 - lambda) * max_sim` re-ranking. | | Synonym pre-norm | `normalizedSynonyms` populated at `build()` | Query-time synonym expansion is a map lookup. | | Content dedup | `contentHashes: nodeIndex>` keyed by `Crypto::sha256hex` | O(log n) duplicate skip before tokenization. | | Trigram pre-filter | `trigramIndex` + `extract_trigrams` (`string_utils.c`) | Fuzzy candidates reduced from O(V) to ~1-10% of V before Levenshtein. | | Trie + reverse trie | `trieRoot`, `reverseTrieRoot` built in `build_trie_index` / `build_reverse_trie_index` | O(prefix_len + matches) prefix search; leading-wildcard support without full vocabulary scan. | | Edge n-gram index | `edgeNgramTerms` when `config.edgeNgram.enabled` | O(1) prefix lookup on a fixed-length range. | The MMR formula: ``` MMR(d) = lambda * relevance(d) - (1 - lambda) * max_sim(d, selected) sim(d1, d2) = |terms(d1) intersect terms(d2)| / |terms(d1) union terms(d2)| ``` ### Native C Hot Paths 29 C source files implement the performance-critical operations and are wired through `nativegen.c`. Compiled with `-O3`, LTO, and optional SSE2/AVX2/NEON SIMD vectorization: | Module | Operations | SIMD | |--------|------------|------| | `string_utils.c` | Character replace/strip, separator replacement, whitespace collapse, alphanumeric check, word count, trigram extraction | SSE2/AVX2 | | `text_normalizer.c` | HTML/URL/email stripping, accent removal, quote normalization, entity decoding | -- | | `bm25_scorer.c` | BM25 scoring (5 variants), batch scoring, IDF, TF cache, `search_compact` (fused multi-term search with WAND), `search_boolean_and` (fused boolean-AND BM25), `search_phrase_compact` (fused phrase search) | AVX2 | | `porter_stemmer.c` | Porter stemming (single + batch), 7-step suffix stripping | -- | | `dfr_scorer.c` | DFR scoring (4 models x 2 effects x 4 norms), batch scoring, `search_dfr_compact` (fused DFR search) | AVX2 | | `lm_dirichlet_accel.c` | Language Model Dirichlet smoothing, batch scoring, `search_lm_dirichlet_compact` (fused LM-Dirichlet search) | AVX2 | | `phonetic_engine.c` | Double Metaphone encoding (primary + secondary codes) | -- | | `phrase_engine.c` | Phrase positional verification with slop tolerance | -- | | `span_engine.c` | NEAR/ONEAR span position checking | -- | | `proximity_search.c` | Two-pointer minimum distance on sorted position arrays | -- | | `boolean_accel.c` | Adaptive sorted intersection with galloping search | -- | | `regex_accel.c` | Iterative NFA regex simulation with greedy backtracking | -- | | `regex_utils_native.c` | UTF-8↔codepoint conversion with byte-offset tracking, regex match iteration into `RegexMatch` arrays | -- | | `function_score_accel.c` | Gaussian/Linear/Exponential decay functions, `compute_batch_decay` (batch decay computation) | -- | | `fusion_accel.c` | Min-max and z-score normalization of `BM25Result` arrays, in-place | -- | | `fuzzy_matcher.c` | Banded Levenshtein edit distance, typo max-edits policy (Meilisearch-style length thresholds) | -- | | `char_tables.c` | O(1) lookup tables for character classification (bitsets, hash sets, maps) | -- | | `char_map_native.c` | Native construction of default char maps, smart-quote map, and combining-marks set without GCL `Map::put` overhead | -- | | `prefix_engine.c` | Prefix term scoring, autocomplete from terms, trie prefix collection, sorted vocabulary prefix range search | -- | | `boolean_prefix_accel.c` | Posting list sort by ref, set operations (intersect/union/difference) on node refs, fused boolean query execution | -- | | `wildcard_matcher.c` | Codepoint-level wildcard pattern matching (`*` and `?`) | -- | | `ranking_rules_native.c` | qsort-based multi-criteria ranking rules sort (words/typo/proximity/attribute/sort/exactness) | -- | | `doc_reader_native.c` | `DocReader::resolve_if_node` — unwraps `any?` to its underlying object when it is a `node`, enabling typed multi-field ingestion (`add_fields`) for both `T` and `node` documents | -- | | `stop_words_native.c` | Native initialization of stop-word maps for all 33 languages (single table, zero per-call GCL allocations) | -- | | `text_chunker_native.c` | Native fixed/sentence/paragraph/recursive chunking with word-counted spans | -- | | `text_parser_native.c` | Native section parsing (markdown headings, code blocks, prose) into `ParsedSection` arrays | -- | | `tokenizer_native.c` | `TokenInfo → Map` frequency-map construction with hash cache for hot-term reuse | -- | All C modules use `char_tables.h` for O(1) character classification via precomputed lookup tables. SIMD modules include scalar fallbacks for non-SIMD platforms. ### Query Execution Flow Trace a hybrid search from query input through normalization, parallel engine execution, score fusion, and post-processing to final ranked results. ### Query Processing Steps For a single `search(query, k, options?)` call: 1. **Normalize** — same pipeline as indexing (CharMap, NFKD casefold, whitespace collapse) using the cached maps. 2. **Tokenize** — split + filter + optional stem; positional offsets recorded for phrase/proximity even when not used by all engines. 3. **Expand synonyms** — `normalizedSynonyms` lookup (built at `build()`). 4. **Resolve term nodes** — `normalizedTerms.get(...)` once per term, then passed to every engine via `search_with_term_nodes` / `search_compact` style entrypoints. 5. **Dispatch active engines** — `dispatch_mode` runs the requested modes; results land as `Array` (or per-mode equivalent) to be fused. 6. **Short-circuit check** — if `config.shortCircuit.enabled` and `bm25Results[0].score / (bm25Results[0].score + bm25Results[1].score) >= shortCircuit.minScore` with the next-rank gap `>= shortCircuit.minGap`, return BM25 directly. 7. **Fuse** — RRF or linear (see [Score Fusion](#score-fusion)). 8. **Normalize fused scores** — `Fusion::normalize_min_max` / `normalize_z_score` (in-place, native). 9. **MMR diversify** (when `config.diversify.enabled` or `SearchOptions.diversify`) — Jaccard over `entryTermNodes`, lambda from `diversify.lambda`. 10. **Apply filters** — `SearchOptions.minScore`, `offset`, and `filter` predicate. ### Performance Analysis #### Time Complexity | Operation | Time Complexity | Notes | |-----------|----------------|-------| | `add(key, value)` | O(t * log V) | t = tokens in document, V = vocabulary size. Each token requires nodeIndex lookups/inserts. | | `build()` | O(V * D_avg) | Iterates all terms, builds compact posting arrays, TF cache, trie, and computes IDF/WAND bounds. D_avg = avg posting list length. | | `search_bm25(q, k)` | O(q * D_q) | q = query terms, D_q = total postings for query terms. With TF cache: O(1) per-document scoring via fieldnorm bucket lookup. WAND pruning skips low-scoring blocks. | | `search_exact(q, k)` | O(N * \|d_avg\|) | Scans all N documents, substring match on each. N = total documents. | | `search_fuzzy(q, k, options?)` | O(N * max(\|q\|, \|d\|)) | Levenshtein/Jaro-Winkler against each document key (`FuzzyMode::key`, default). | | `search_fuzzy(q, k, FuzzyOptions { mode: term })` | O(q * C * max(\|q_t\|, \|v\|)) | Term-level mode with trigram pre-filter: C = candidate terms (typically << V). | | `search_phrase(q, k, options?)` | O(q * D_q * P) | Intersection of posting lists, then position verification. P = avg positions per term per doc. Slop carried in `PhraseOptions.slop`. | | `search_boolean(q, k)` | O(q * D_q) | Set operations on posting lists. AND = intersection, OR = union, NOT = complement. | | `search_proximity(t1, t2, k, options?)` | O(D_intersect * P) | Two-pointer scan on position arrays for intersection of two term posting lists. Distance carried in `ProximityOptions.distance`. | | `search_prefix(p, k)` | O(1) or O(V) | O(1) with edge n-grams; O(V) without (vocabulary scan). | | `search_wildcard(p, k)` | O(V * \|p\|) | Pattern match against each vocabulary term. | | `search_semantic(q, k)` | O(N * d) | Vector similarity search. d = embedding dimension. N = documents (or chunks). | | `search(q, k, opts)` | O(sum of active modes) | Hybrid search runs active modes, then fuses. Short-circuit may reduce to O(BM25) only. | | `more_like_this(key, k)` | O(t_doc + BM25) | Extract top TF-IDF terms from document, then BM25 search with those terms. | | MMR re-ranking | O(k^2 * T) | k = results to re-rank, T = avg terms per doc for Jaccard similarity. | | RRF fusion | O(R * log R) | R = total results across all modes. Linear merge + sort. | #### Space Complexity - **Inverted index**: O(T) where T = total (term, document) pairs across all documents - **Positional index**: O(P) where P = total token occurrences (every position stored) - **TF Cache**: O(256) floats + O(T) fieldnorm bucket IDs (1 byte each) - **Trie**: O(V * L_avg) nodes where L_avg = average term length - **Trigram index**: O(V * 3) postings (each term has ~|term|-2 trigrams, each pointing back) - **Edge n-grams**: O(V * L) where L = avg prefix length range - **Vector index**: O(N * d) where d = embedding dimension (typically 384-768) - **String deduplication**: `node` ensures identical strings share storage automatically. Normalized texts, term strings, and chunk contents all benefit from this. - **Content hash index**: O(N) string hashes when deduplication is enabled ## Type Relationships Field-level shapes for graph types (`TextIndex`, `IndexEntry`, `NormalizedTerm`, `IndexChunk`, `TrigramPostings`, `PhoneticPostings`, `TrieNode`) and config types (`TextIndexConfig`, `SearchOptions`) are listed in [Data Model](#data-model). The "Volatile Result Types" subsection there enumerates `TextResult`, `BM25Result`, `ScoreExplanation`, `TextIndexStats`, and `TextEntry`. The relationship table at [Relationships](#relationships) lists the persistent edges between these types. Highlights worth re-stating in one place: - **Bidirectional**: every term carries `postingEntries`/`postingTFs`/`postingDocLens`/`postingFieldnormIds`/`postingBlockMaxScores`; every entry carries `entryTermNodes`/`entryTermTFs` (sorted, parallel) plus packed `positionData`/`positionOffsets`/`positionCounts` and a `termIndexMap` for O(1) term-to-slot lookup. - **Compact-array preference**: parallel `Array<...>` is used wherever sequential access dominates (postings, forward index, positions). `nodeIndex` is reserved for top-level keyed lookup (`entries`, `normalizedTerms`, `trigramIndex`, `edgeNgramTerms`, `phoneticIndex`, `contentHashes`). - **String dedup**: term text and chunk content are stored as `node`, sharing storage across documents. ## Module Map How the GCL modules and 29 native C source files connect. Layered from the top-level orchestrator through engines, scoring, processing, and utilities down to the C native layer. ### Architecture Layers | Layer | Modules | Purpose | |-------|---------|---------| | **Orchestrator** | TextIndex\, Document | Top-level entry point, coordinates all operations | | **Engines** | BM25, Boolean, Fuzzy, Phrase, Proximity, Prefix, Wildcard, Span, Phonetic, Suggest, Percolate, Quorum, Curation | Independent search mode implementations | | **Scoring** | Fusion, MMR, DFR, LMDirichlet, FunctionScore, RankingRules | Score combination, alternative models, and re-ranking | | **Processing** | TextNormalizer, TextTokenizer, TextChunker, TextParser, PorterStemmer, BooleanParser, SpanParser | Text pipeline and query parsing | | **Utilities** | CharMap, StopWords, StringUtils, RegexUtils, SnippetExtractor | Low-level helpers and data | | **C Native** | 29 source files: string_utils, text_normalizer, bm25_scorer, porter_stemmer, dfr_scorer, lm_dirichlet_accel, phonetic_engine, phrase_engine, span_engine, proximity_search, boolean_accel, regex_accel, regex_utils_native, function_score_accel, fusion_accel, fuzzy_matcher, wildcard_matcher, char_tables, char_map_native, prefix_engine, boolean_prefix_accel, ranking_rules_native, stop_words_native, text_chunker_native, text_parser_native, tokenizer_native, doc_reader_native, nativegen, link_native | Performance-critical hot paths | ### Orchestrator #### TextIndex\ **File:** `model/text_index.gcl` The main generic index type. Orchestrates all search modes, manages the inverted index, coordinates normalization/tokenization, and holds configuration. Entry point for all indexing and search operations. **Exports:** `TextIndex`, `TextIndexConfig`, `TextResult`, `SearchOptions`, `TextEntry`, `TextIndexStats`, `ScoreExplanation` **Key methods:** `add()`, `build()`, `search()`, `search_bm25()`, `search_phrase()`, `search_semantic()`, `ensure_cached_maps()` #### Document **File:** `model/document.gcl` Structured document types with section and sentence decomposition. Provides hierarchical text organization for section-level search and structured ingestion. **Exports:** `Document`, `Section`, `Sentence` ### Engines #### BM25Engine **File:** `engine/bm25_engine.gcl` BM25 scoring with precomputed per-term per-document scores. Supports 5 IDF variants (Lucene, BM25+, BM25L, ATIRE, Robertson), term boosting, and document subset search. Includes fused native methods: `search_compact` (multi-term with WAND), `search_boolean_and` (fused boolean-AND BM25), and `search_phrase_compact` (fused phrase search). **Exports:** `BM25Engine`, `BM25Variant`, `BM25Result` **Native dependency:** `bm25_scorer.c` #### BooleanEngine **File:** `engine/boolean_engine.gcl` Evaluates boolean query ASTs (AND/OR/NOT with parentheses) against the inverted index. Delegates parsing to BooleanParser. Results scored by BM25. **Exports:** `BooleanEngine` **Dependencies:** BooleanParser #### FuzzyEngine **File:** `engine/fuzzy_engine.gcl` Levenshtein edit-distance fuzzy search with trigram pre-filtering and Meilisearch-style typo tolerance. Supports whole-key and term-level fuzzy with trigram pre-filtering. Jaro-Winkler distance also available. **Exports:** `FuzzyEngine` **Native dependency:** `fuzzy_matcher.c` #### PhraseEngine **File:** `engine/phrase_engine.gcl` Exact phrase matching via positional verification. Tokens must appear in consecutive positions (slop=0) or within N positions of expected (slop=N). **Exports:** `PhraseEngine` #### ProximityEngine **File:** `engine/proximity_engine.gcl` Finds documents where two terms appear within N positions. Score inversely proportional to distance. Uses positional offset arrays from the inverted index. **Exports:** `ProximityEngine` **Native dependency:** `proximity_search.c` #### PrefixEngine **File:** `engine/prefix_engine.gcl` Prefix matching across the term vocabulary. Also provides autocomplete functionality returning matching terms directly for type-ahead search UI. Supports edge n-gram indexing for O(1) prefix lookups. **Exports:** `PrefixEngine` #### WildcardEngine **File:** `engine/wildcard_engine.gcl` Pattern matching with `*` (any sequence) and `?` (any char) against the term vocabulary. Internally scores via `BM25Result` and is mapped to `TextResult` (with `matchedTerms` populated) for the public API. **Exports:** `WildcardEngine` #### SpanEngine **File:** `engine/span_engine.gcl` Span queries for positional search: NEAR (unordered proximity), ONEAR (ordered proximity), and FIRST (term must appear in first N positions). **Exports:** `SpanEngine`, `SpanQuery` **Dependencies:** SpanParser #### PhoneticEngine **File:** `engine/phonetic_engine.gcl` Sound-alike matching via Double Metaphone encoding. Builds a phonetic index mapping phonetic codes to terms during `build()`. **Exports:** `PhoneticEngine` **Native dependency:** `phonetic_engine.c` #### SuggestEngine **File:** `engine/suggest_engine.gcl` Auto-complete and "did you mean?" spell correction. Prefix matching on vocabulary with document-frequency ranking. Trigram-based candidate selection for spell correction. **Exports:** `SuggestEngine`, `Suggestion`, `DidYouMeanResult` #### QuorumEngine **File:** `engine/quorum_engine.gcl` Minimum-should-match search. Documents must contain at least `minMatch` of the query terms. Scored by `matchedCount / queryTermCount` (range 0.0 – 1.0). **Exports:** `QuorumEngine` #### CurationEngine **File:** `engine/curation_engine.gcl` Manual ranking adjustments: pin documents to specific positions, boost/suppress scores, or remove documents from results entirely. **Exports:** `CurationHelper` #### PercolateEngine **File:** `engine/percolate_engine.gcl` Reverse search: register stored queries, then match incoming documents against them. Useful for alerting and content routing where documents arrive after queries are defined. **Exports:** `PercolateEngine`, `PercolateIndex`, `PercolatedQuery` ### Scoring #### Fusion **File:** `scoring/fusion.gcl` Score fusion for hybrid search. Reciprocal Rank Fusion (RRF) combines rankings independently of raw score magnitudes. Linear combination applies weighted normalized scores (min-max or z-score). **Exports:** `Fusion`, `FusionInput`, `FusionAccum` (the `FusionMethod` enum it consumes is declared in `model/text_index.gcl`) #### MMR **File:** `scoring/mmr.gcl` Maximal Marginal Relevance for diversity re-ranking. Balances relevance against redundancy using a lambda parameter. Reduces near-duplicate results in the final output. **Exports:** `MMR` #### DFREngine **File:** `scoring/dfr_engine.gcl` Divergence From Randomness scoring model. Configurable with 4 BasicModels (G/In/Ine/IF), 2 AfterEffects (Laplace/Bernoulli), and 4 Normalizations (H1/H2/H3/Z). **Exports:** `DFREngine`, `DFRBasicModel`, `DFRAfterEffect`, `DFRNormalization` **Native dependency:** `dfr_scorer.c` #### LMDirichletEngine **File:** `scoring/lm_dirichlet_engine.gcl` Language Model scoring with Dirichlet prior smoothing. Alternative to BM25 for probabilistic ranking. **Native dependency:** `lm_dirichlet_accel.c` #### FunctionScoreEngine **File:** `scoring/function_score.gcl` Decay functions (gaussian, linear, exponential) and field value factors for result re-scoring based on numeric metadata fields. **Exports:** `FunctionScoreEngine`, `FunctionScoreConfig`, `DecayFunction`, `DecayType`, `FieldValueFactor` **Native dependency:** `function_score_accel.c` #### RankingRulesEngine **File:** `scoring/ranking_rules.gcl` Meilisearch-style multi-factor ranking pipeline. Applies ordered ranking rules (words, typo, proximity, attribute, sort, exactness) for fine-grained result ordering. **Exports:** `RankingRulesEngine`, `RankingRule`, `RankingCandidate` ### Processing #### TextNormalizer **File:** `processing/text_normalizer.gcl` Multi-stage text normalization: HTML tag stripping, URL removal, Unicode NFKD decomposition, accent stripping, case folding, whitespace collapse. Configurable via NormOptions. **Exports:** `TextNormalizer`, `NormOptions` **Native dependency:** `text_normalizer.c` **Dependencies:** CharMap, StringUtils #### TextTokenizer **File:** `processing/text_tokenizer.gcl` Splits normalized text into terms with positional tracking. Handles punctuation stripping, min/max term length filtering, numeric token filtering. Records positional offsets for phrase and proximity search. **Exports:** `TextTokenizer`, `TokenizerAccel`, `TokenInfo`, `TermFrequency` The tokenizer returns either `Array` (carrying the normalized text, original surface form, and positional offset for each token) via `tokenize`, `tokenize_normalized`, `tokenize_with_maps`, and `tokenize_with_originals`'s underlying core, or `Map` (term → aggregated frequency / position list) via `tokenize_with_originals` and `tokenize_with_originals_normalized_cached`. **Dependencies:** PorterStemmer, StringUtils, CharMap #### TextChunker **File:** `processing/text_chunker.gcl` Splits text into chunks for semantic embedding. Strategies: fixed-size, sentence-boundary, paragraph-boundary, recursive. Controls chunk size and overlap for vector indexing. **Exports:** `TextChunker` (the `ChunkStrategy` enum it consumes is declared in `model/text_index.gcl`) **Dependencies:** StringUtils #### PorterStemmer **File:** `processing/stemmer.gcl` Porter stemming algorithm for English. 5-step suffix stripping reduces terms to root forms. Improves recall by matching morphological variants. **Exports:** `PorterStemmer` **Native dependency:** `porter_stemmer.c` #### BooleanParser **File:** `parser/boolean_parser.gcl` Recursive descent parser for boolean query syntax. Tokenizes query strings into AND/OR/NOT operators with parenthesized grouping, producing an AST for BooleanEngine evaluation. **Exports:** `BooleanParser`, `BooleanQuery`, `ParseResult`, `BooleanOperator` (enum: `AND`, `OR`, `NOT`, `WEAKAND`) #### SpanParser **File:** `parser/span_parser.gcl` Parses span query syntax (NEAR, ONEAR, FIRST operators) into structured SpanQuery objects for the SpanEngine to execute. **Exports:** `SpanParser` #### SnippetExtractor **File:** `util/snippet.gcl` Query-aware snippet extraction. Locates the best text window around matched terms, produces highlighted excerpts with configurable length and marker tags. **Exports:** `SnippetExtractor` **Dependencies:** TextTokenizer, TextNormalizer ### Utilities #### CharMap **File:** `util/char_map.gcl` 170+ Unicode-to-ASCII character mappings. Normalizes ligatures, fullwidth characters, smart quotes, mathematical symbols, em-dashes, and more. Supports custom user-defined mappings. **Exports:** `CharMap` #### StopWords **File:** `util/stop_words.gcl` Stop word lists for 33 languages. Four modes: none, default (language-specific), custom (user-provided), auto (terms in greater than 85% of documents). Removes high-frequency noise terms. **Exports:** `StopWords` (the `StopWordMode` enum it consumes is declared in `model/text_index.gcl`) #### StringUtils **File:** `processing/string_utils.gcl` String helper functions implemented in native C for performance. Whitespace collapse, character replacement, stripping, and other hot-path string operations. **Exports:** `StringUtils` **Native dependency:** `string_utils.c` #### TextParser **File:** `processing/text_parser.gcl` Section parser that splits raw input into `ParsedSection { sectionType, content, title, startLine, endLine }` arrays. Detects markdown headings, fenced code blocks, and prose paragraphs on a line-by-line scan. Drives `add_fields` and section-aware chunking. **Exports:** `TextParser`, `ParsedSection` **Native dependency:** `text_parser_native.c` #### RegexUtils **File:** `util/regex_utils.gcl` Regex helpers on top of the native NFA engine: `RegexMatch` iteration, UTF-8 boundary-safe codepoint conversions, and byte-offset-accurate match extraction. **Exports:** `RegexUtils`, `RegexMatch` **Native dependency:** `regex_accel.c`, `regex_utils_native.c` #### SearchAccel **File:** `scoring/search_accel.gcl` Native top-k helpers shared by the scoring engines: `find_kth_with_threshold` (threshold-driven top-k), `partial_sort_top_k` (partial ranked sort), and `trim_to_k` (in-place truncation). **Exports:** `SearchAccel` **Native dependency:** (nativegen-wired) ### C Native Layer 29 C source files implement all performance-critical operations and are wired through `nativegen.c`. Compiled with `-O3`, LTO, and optional SSE2/AVX2/NEON SIMD vectorization. #### string_utils.c Character-level operations: replace_chars, replace_separators, strip_punctuation, collapse_whitespace, count_words, has_alphanumeric, extract_trigrams. All single-pass O(N) implementations with optional SIMD acceleration. #### text_normalizer.c Native HTML tag stripping, URL/email stripping, accent removal, quote normalization, line break normalization, HTML entity decoding, and fused multi-pass normalization via `normalize_advanced_with_maps`. Avoids multiple string allocations by combining operations. #### bm25_scorer.c BM25 term scoring with 5 variant formulas (Lucene, Plus, L, ATIRE, Robertson). Includes compute_term_score, compute_idf, compute_batch_scores (AVX2-vectorized), precompute_tf_cache, search_compact (fused multi-term search with WAND), search_boolean_and (fused boolean-AND BM25 search), and search_phrase_compact (fused phrase search). SIMD intrinsics process 4-8 doubles per cycle. #### porter_stemmer.c Porter stemming algorithm in C. All 7 steps (1a/1b/1c, 2 with 20 suffix rules, 3, 4 with -ion case, 5). Single-word `stem` and batch `stem_batch` functions. #### dfr_scorer.c DFR (Divergence From Randomness) scoring. Supports 4 BasicModels (G/In/Ine/IF) x 2 AfterEffects (Laplace/Bernoulli) x 4 Normalizations (H1/H2/H3/Z). Single and batch scoring functions, plus `search_dfr_compact` (fused DFR search). #### lm_dirichlet_accel.c Language Model scoring with Dirichlet smoothing. Formula: `log((tf + mu * P(w|C)) / (docLen + mu))` with NaN guard. Single and batch scoring, plus `search_lm_dirichlet_compact` (fused LM-Dirichlet search). #### phonetic_engine.c Double Metaphone phonetic encoding. Handles silent prefixes (KN/GN/PN/AE/WR), digraphs (CH/PH/SH/TH), and double consonant collapsing. Returns primary and optional secondary codes. #### phrase_engine.c Phrase positional verification with slop tolerance. Two-pointer merge on sorted position arrays. O(n+m) amortized complexity. #### span_engine.c NEAR/ONEAR proximity span queries. Ordered and unordered paths with two-pointer monotonic pointer amortization. #### proximity_search.c Minimum distance between two sorted position arrays via two-pointer merge. Used for proximity scoring. #### boolean_accel.c Adaptive sorted array intersection with galloping search for skewed sizes. O(m log(n/m)) when one array is 16x+ larger. #### regex_accel.c Iterative NFA regex simulation with greedy backtracking. Supports `.` `*` `+` `?` `[abc]` `[a-z]` `[^abc]` and `\`-escapes on int32 codepoints. #### function_score_accel.c Gaussian, linear, and exponential decay functions for function scoring. Includes `compute_batch_decay` for batch decay computation. Pure math, zero allocation. #### fuzzy_matcher.c Banded Levenshtein edit distance (`banded_levenshtein`) with O(n * (2*maxEdits+1)) complexity for fuzzy matching. Also implements Meilisearch-style typo tolerance policy: `compute_typo_max_edits` returns 0/1/2 max edits based on term length. #### nativegen.c / nativegen.h Auto-generated native function registration. Links every `native fn` declared in GCL to its C implementation via `gc_program__link_type_fn` and exposes type/field/enum symbols (`gc___`) that native modules read directly for zero-reflection field access. #### link_native.c Library lifecycle hooks (`lib_start`, `lib_stop`, `worker_start`, `worker_stop`). Hoists the BM25 TF-bucket lookup-table init out of the double-checked-locking fast path so the first search does not pay for bucket initialization. Per-function wiring is now fully handled by `nativegen.c`. #### prefix_engine.c Native prefix search operations: `score_matching_terms_native` (IDF-based scoring of prefix-matching terms), `autocomplete_from_terms_native` (return matching term strings), and `trie_collect_prefix_native` (collect term nodes from trie by prefix). The vocabulary-range fallback path is the GCL `PrefixEngine::search_vocab_scan`, which scans `normalizedTerms` directly when no trie or edge-ngram index is available. #### boolean_prefix_accel.c Native boolean operations on posting lists: `sort_postings_by_ref` (sort postings by entry reference for merge operations), `intersect_node_refs` / `union_node_refs` / `difference_node_refs` (set operations on `Array>`), and `execute_boolean_native` (fused boolean query execution from compiled operation arrays). #### wildcard_matcher.c Codepoint-level wildcard pattern matching via `match_full_pattern_cps`. Supports `*` (any sequence) and `?` (any single character) on int32 codepoint arrays for correct Unicode handling. #### char_tables.c / char_tables.h Static O(1) lookup tables for character classification: ASCII punctuation/whitespace/control bitsets, Unicode combining marks hash set, smart quote replacement map (16 entries), HTML entity map (22 named entities), and binary search helpers. #### char_map_native.c Native construction of `CharMap` defaults: the default char replacement map (smart quotes, dashes, ligatures, diacritics), the Unicode combining-marks set used by `strip_accents`, and configurable smart-quote / punctuation overlays. Builds GCL `Map` and `Map` structures in a single pass without per-entry `Map::put` overhead. #### ranking_rules_native.c Native qsort-based sort for `RankingRulesEngine::apply()`. Reads `RankingCandidate` fields directly through `nativegen.h` object offsets (matchedWordCount, typoCount, minProximity, firstMatchPosition, isExactMatch, sortValue) and dispatches comparator decisions via the compiled `Array` (words/typo/proximity/attribute/sort/exactness). Replaces the GCL merge sort. #### doc_reader_native.c `DocReader::resolve_if_node` — resolves a generic `any?` value to its underlying object when it is a `node`. Used by typed multi-field ingestion (`add_fields`) so the same code path works whether `T` or `node` is passed. Returns the resolved object directly for plain objects, and null for primitives/enums. #### regex_utils_native.c UTF-8 ↔ codepoint conversion with byte-offset tracking, plus regex match iteration that materializes `RegexMatch` arrays (matched/startPos/endPos/text). Delegates the per-match NFA step to `regex_accel.c`'s `match_pattern_internal` so that offsets are always byte-accurate on multi-byte input. #### fusion_accel.c `Fusion::normalize_min_max` and `Fusion::normalize_z_score` for `Array`. Two-pass in-place rewrite of the `score` field via nativegen field offsets — min/max (or mean/stddev) in the first pass, rewrite in the second. Short-circuits to 0.5 (min-max) or 0 (z-score) when the range is zero. #### stop_words_native.c Native initialization of per-language stop-word maps for the 33 supported languages. Inlines every language table as a packed C array of interned strings and populates a single `Map` in one pass, avoiding thousands of GCL `Map::put` calls at each `build()`. #### text_chunker_native.c Native implementations of the four chunking strategies (`fixed`, `sentence`, `paragraph`, `recursive`). Emits `ChunkInfo { content, position, startChar, endChar }` directly without round-tripping through GCL strings; word counts are computed inline on the input buffer. #### text_parser_native.c Native section parsing into `ParsedSection { sectionType, content, title, startLine, endLine }`. Detects markdown headings, fenced code blocks, and contiguous prose on a line-by-line scan with in-place span trimming. #### tokenizer_native.c `TextTokenizer::build_freq_map` — builds `Map` from a `TokenInfo[]` array. Uses an open-addressing 256-slot cache keyed by FNV-1a hash so that repeated occurrences of hot terms skip the full `Map::get` path (typical docs reuse terms in ~80% of token positions). ### Dependency Summary | GCL Module | Depends On (GCL) | Depends On (C Native) | |------------|-------------------|----------------------| | TextIndex\ | All engines, Fusion, MMR, TextNormalizer, TextTokenizer, StopWords, SnippetExtractor | -- | | BM25Engine | -- | bm25_scorer.c | | BooleanEngine | BooleanParser | boolean_accel.c | | FuzzyEngine | -- | fuzzy_matcher.c | | ProximityEngine | -- | proximity_search.c | | SpanEngine | SpanParser | span_engine.c | | PhraseEngine | -- | phrase_engine.c | | DFREngine | -- | dfr_scorer.c | | LMDirichletEngine | -- | lm_dirichlet_accel.c | | PhoneticEngine | -- | phonetic_engine.c | | FunctionScoreEngine | -- | function_score_accel.c | | PrefixEngine | -- | prefix_engine.c | | BooleanAccel | -- | boolean_prefix_accel.c | | WildcardEngine | -- | wildcard_matcher.c | | TokenizerAccel | -- | (nativegen-wired) | | TextNormalizer | CharMap, StringUtils | text_normalizer.c | | TextTokenizer | PorterStemmer, StringUtils, CharMap | tokenizer_native.c | | TextChunker | StringUtils | text_chunker_native.c | | TextParser | StringUtils | text_parser_native.c | | PorterStemmer | -- | porter_stemmer.c | | SnippetExtractor | TextTokenizer, TextNormalizer | -- | | DocReader | -- | doc_reader_native.c | | StringUtils | -- | string_utils.c | | CharMap | -- | char_map_native.c | | StopWords | -- | stop_words_native.c | | RegexUtils | -- | regex_accel.c, regex_utils_native.c | | Fusion | -- | fusion_accel.c | | RankingRulesEngine | -- | ranking_rules_native.c | | SearchAccel | -- | (nativegen-wired top-k helpers) |