Text Search Library

Full-text search library for GreyCat providing 15 search modes (plus a multi-field BM25F variant and a batch helper), multi-language support (33 languages), and C-accelerated hot paths. The main entry point is TextIndex<T>, a generic type where T is the value stored alongside each document.

Quick Start

// 1. Create an index. The default config is a sensible BM25 + stop words setup.
var index = TextIndex<String> { config: TextIndexConfig {} };

// 2. Add documents: add(text, value)
index.add("Machine learning is a subset of artificial intelligence", "doc1");
index.add("Neural networks learn patterns from data", "doc2");
index.add("Natural language processing enables text understanding", "doc3");

// 3. Build the index (required before searching)
index.build();

// 4. Search
var results = index.search_bm25("machine learning", 10);
for (var i = 0; i < results.size(); i++) {
    var r = results[i];
    info("${r.value} score=${r.score}");
}

Pipeline

Every usage follows the same three-step pattern:

Add documents with add(key, value), add_batch(entries), or add_fields(value) (typed multi-field with field references on the document type; supports T or node<T>)
Build the index with build() – computes IDF, TF caches, tries, and posting arrays
Search with any of the search methods listed below

Use add() after build() for incremental updates. For deletions/updates use remove(key) and update(key, value).

Choosing a Search Mode

Mode	Best for	Method
Hybrid	User-facing search bars (combines modes via fusion)	`search(query, k, options)`
BM25	General keyword search	`search_bm25(query, k)`
BM25F	Multi-field weighted scoring (title/body/tags)	`search_bm25_f(query, k)`
BM25 Batch	Many queries against one index	`search_bm25_batch(queries, k)`
Semantic	Vector similarity using embeddings	`search_semantic(query, k)`
Exact	Known identifiers, error codes, log search	`search_exact(query, k)`
Fuzzy	Typo-tolerant search, name/address lookup	`search_fuzzy(query, k, options)`
Boolean	Advanced search with AND/OR/NOT	`search_boolean(query, k)`
Phrase	Exact word sequences, quoted search	`search_phrase(phrase, k, options)`
Proximity	Two concepts appearing near each other	`search_proximity(term1, term2, k, options)`
Prefix	Search-as-you-type	`search_prefix(prefix, k)`
Wildcard	Pattern matching (`*`, `?`)	`search_wildcard(pattern, k)`
Span	Positional constraints (NEAR, ONEAR, FIRST)	`search_span(spanQuery, k)`
DFR	Alternative scoring for long documents	`search_dfr(query, k)`
LM-Dirichlet	Short queries against long documents, QA	`search_lm_dirichlet(query, k)`
Phonetic	Sound-alike name search (Smith/Smyth)	`search_phonetic(query, k)`
Quorum	At least N of M terms must match	`search_quorum(query, k, minMatch)`

For per-mode details, scoring formulas, and examples see Search Modes.

Key Features

Faceted Search & Aggregations – Term facets, numeric range facets, metric/histogram aggregations
Function Scoring & Curation – Decay functions, field value factors, document pinning/boosting/suppression, ranking rules
Percolation (Reverse Search) – Register queries, match incoming documents against them
Text Processing Utilities – Tokenizer, parser, chunker for preprocessing pipelines

Utility Methods

var _snippet = index.snippet("doc1", "machine learning", null);  // Snippet { text, highlighted }
var _explanation = index.explain("machine learning", "doc1");    // ScoreExplanation
var _suggestions = index.suggest("mach", 5);                     // Array<Suggestion>
var _correction = index.did_you_mean("machin lerning");            // DidYouMeanResult
var _similar = index.more_like_this("doc1", 10, null);             // Array<TextResult>
var _stats = index.stats();                                       // TextIndexStats

snippet() returns a Snippet { text, highlighted }. Use index.snippets(keys, query, options) for batched extraction.

See Utility Methods for full API details.

Configuration

TextIndexConfig is fully optional. All fields are nullable. Standalone fields stay at the top level; everything else is grouped into nested option blocks.

// Minimal: tweak BM25 only
var cfg = TextIndexConfig {
    bm25: BM25Options { k1: 1.5, b: 0.75 }
};

// Combine a few blocks
var cfg2 = TextIndexConfig {
    usePhonetic: true,
    synonyms: synonymMap,
    tokenization: TokenizationOptions {
        stemming: true,
        normOptions: NormOptions { stripHtmlTags: true, stripUrls: true, stripAccents: true }
    },
    stopWords: StopWordOptions {
        mode: StopWordMode::default,
        language: TextSearchLanguage::en
    },
    typoTolerance: TypoOptions { enabled: true }
};

Standalone fields: embed, synonyms, fields, deduplicateContent, fuzzyMaxTextLength, usePhonetic.

Nested option blocks: tokenization, stopWords, bm25, fusion, typoTolerance, edgeNgram, shortCircuit, diversify, chunking, dfr, lmDirichlet, highlight.

For hybrid fusion weights, build a map and pass it via fusion.weights:

var w = Map<SearchMode, float> {};
w.set(SearchMode::bm25, 0.7);
w.set(SearchMode::semantic, 0.3);
var _cfg = TextIndexConfig {
    fusion: FusionOptions { method: FusionMethod::rrf, weights: w }
};

See Configuration Reference for every field.

Presets

Static factories on TextIndexConfig produce ready-made configurations. Use them directly or as a starting point.

var _index = TextIndex<String> { config: TextIndexConfig::keyword() };

Preset	Use case
`TextIndexConfig::keyword()`	Traditional keyword search (BM25 + exact) for help desks and internal docs
`TextIndexConfig::semantic(embed)`	Vector similarity over user-supplied embedding function with sentence chunking
`TextIndexConfig::fuzzy()`	Typo-tolerant product/UI search (BM25 + fuzzy + exact, automatic typo tolerance)
`TextIndexConfig::multilingual(lang)`	Language-specific stop words and accent stripping
`TextIndexConfig::ecommerce()`	BM25F over name/description/brand with `<mark>` highlighting
`TextIndexConfig::code_search()`	Source code/logs: case-sensitive, keeps punctuation/numerics, no stemming
`TextIndexConfig::phonetic_name()`	People/contact directories with Double-Metaphone phonetic matching
`TextIndexConfig::social()`	Short-text/social: auto stop words, repeating-char normalization, URL/email cleanup
`TextIndexConfig::academic()`	Long-form papers: stemming, BM25+, MMR diversity re-ranking
`TextIndexConfig::logs()`	Server/application logs: no stop words, no stemming, single-char terms
`TextIndexConfig::realtime_alert()`	Standing-query baseline; pair with PercolateIndex for reverse search

See Presets for full preset bodies and tuning notes.

Documentation

Document	Contents
Search Modes	All 15 modes with when-to-use, formulas, examples
Search Methods	Complete API: all search method signatures
Indexing	add, add_batch, add_fields, remove, update, build
Utility Methods	snippet, suggest, explain, did_you_mean, more_like_this, stats
Types	TextResult, SearchOptions, ScoreExplanation, Snippet, and more
Enums	SearchMode, BM25Variant, StopWordMode, and 15+ enums
Configuration	TextIndexConfig + SearchOptions full reference
Presets	Ready-to-use configs for common use cases
Function Scoring & Curation	Decay functions, field boosting, curation rules
Facets & Aggregations	Faceted search, metrics, histograms
Percolation	Reverse search for alerts and monitoring
Text Processing	TextTokenizer, TextParser, TextChunker
Architecture	Indexing pipeline, score fusion, optimization, diagrams
Demo	Complete working example covering all search modes and utilities

Supported Languages

33 languages for stop words and text processing:

Arabic, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish, Ukrainian, Vietnamese.