In this page
Text Search Library
Full-text search library for GreyCat providing 15 search modes (plus a multi-field BM25F variant and a batch helper), multi-language support (33 languages), and C-accelerated hot paths. The main entry point is TextIndex<T>, a generic type where T is the value stored alongside each document.
Quick Start
// 1. Create an index. The default config is a sensible BM25 + stop words setup.
var index = TextIndex<String> { config: TextIndexConfig {} };
// 2. Add documents: add(text, value)
index.add("Machine learning is a subset of artificial intelligence", "doc1");
index.add("Neural networks learn patterns from data", "doc2");
index.add("Natural language processing enables text understanding", "doc3");
// 3. Build the index (required before searching)
index.build();
// 4. Search
var results = index.search_bm25("machine learning", 10);
for (var i = 0; i < results.size(); i++) {
var r = results[i];
info("${r.value} score=${r.score}");
}
Pipeline
Every usage follows the same three-step pattern:
- Add documents with
add(key, value),add_batch(entries), oradd_fields(value)(typed multi-field withfieldreferences on the document type; supportsTornode<T>) - Build the index with
build()– computes IDF, TF caches, tries, and posting arrays - Search with any of the search methods listed below
Use add() after build() for incremental updates. For deletions/updates use remove(key) and update(key, value).
Choosing a Search Mode
| Mode | Best for | Method |
|---|---|---|
| Hybrid | User-facing search bars (combines modes via fusion) | search(query, k, options) |
| BM25 | General keyword search | search_bm25(query, k) |
| BM25F | Multi-field weighted scoring (title/body/tags) | search_bm25_f(query, k) |
| BM25 Batch | Many queries against one index | search_bm25_batch(queries, k) |
| Semantic | Vector similarity using embeddings | search_semantic(query, k) |
| Exact | Known identifiers, error codes, log search | search_exact(query, k) |
| Fuzzy | Typo-tolerant search, name/address lookup | search_fuzzy(query, k, options) |
| Boolean | Advanced search with AND/OR/NOT | search_boolean(query, k) |
| Phrase | Exact word sequences, quoted search | search_phrase(phrase, k, options) |
| Proximity | Two concepts appearing near each other | search_proximity(term1, term2, k, options) |
| Prefix | Search-as-you-type | search_prefix(prefix, k) |
| Wildcard | Pattern matching (*, ?) |
search_wildcard(pattern, k) |
| Span | Positional constraints (NEAR, ONEAR, FIRST) | search_span(spanQuery, k) |
| DFR | Alternative scoring for long documents | search_dfr(query, k) |
| LM-Dirichlet | Short queries against long documents, QA | search_lm_dirichlet(query, k) |
| Phonetic | Sound-alike name search (Smith/Smyth) | search_phonetic(query, k) |
| Quorum | At least N of M terms must match | search_quorum(query, k, minMatch) |
For per-mode details, scoring formulas, and examples see Search Modes.
Key Features
- Faceted Search & Aggregations – Term facets, numeric range facets, metric/histogram aggregations
- Function Scoring & Curation – Decay functions, field value factors, document pinning/boosting/suppression, ranking rules
- Percolation (Reverse Search) – Register queries, match incoming documents against them
- Text Processing Utilities – Tokenizer, parser, chunker for preprocessing pipelines
Utility Methods
var _snippet = index.snippet("doc1", "machine learning", null); // Snippet { text, highlighted }
var _explanation = index.explain("machine learning", "doc1"); // ScoreExplanation
var _suggestions = index.suggest("mach", 5); // Array<Suggestion>
var _correction = index.did_you_mean("machin lerning"); // DidYouMeanResult
var _similar = index.more_like_this("doc1", 10, null); // Array<TextResult>
var _stats = index.stats(); // TextIndexStats
snippet() returns a Snippet { text, highlighted }. Use index.snippets(keys, query, options) for batched extraction.
See Utility Methods for full API details.
Configuration
TextIndexConfig is fully optional. All fields are nullable. Standalone fields stay at the top level; everything else is grouped into nested option blocks.
// Minimal: tweak BM25 only
var cfg = TextIndexConfig {
bm25: BM25Options { k1: 1.5, b: 0.75 }
};
// Combine a few blocks
var cfg2 = TextIndexConfig {
usePhonetic: true,
synonyms: synonymMap,
tokenization: TokenizationOptions {
stemming: true,
normOptions: NormOptions { stripHtmlTags: true, stripUrls: true, stripAccents: true }
},
stopWords: StopWordOptions {
mode: StopWordMode::default,
language: TextSearchLanguage::en
},
typoTolerance: TypoOptions { enabled: true }
};
Standalone fields: embed, synonyms, fields, deduplicateContent, fuzzyMaxTextLength, usePhonetic.
Nested option blocks: tokenization, stopWords, bm25, fusion, typoTolerance, edgeNgram, shortCircuit, diversify, chunking, dfr, lmDirichlet, highlight.
For hybrid fusion weights, build a map and pass it via fusion.weights:
var w = Map<SearchMode, float> {};
w.set(SearchMode::bm25, 0.7);
w.set(SearchMode::semantic, 0.3);
var _cfg = TextIndexConfig {
fusion: FusionOptions { method: FusionMethod::rrf, weights: w }
};
See Configuration Reference for every field.
Presets
Static factories on TextIndexConfig produce ready-made configurations. Use them directly or as a starting point.
var _index = TextIndex<String> { config: TextIndexConfig::keyword() };
| Preset | Use case |
|---|---|
TextIndexConfig::keyword() |
Traditional keyword search (BM25 + exact) for help desks and internal docs |
TextIndexConfig::semantic(embed) |
Vector similarity over user-supplied embedding function with sentence chunking |
TextIndexConfig::fuzzy() |
Typo-tolerant product/UI search (BM25 + fuzzy + exact, automatic typo tolerance) |
TextIndexConfig::multilingual(lang) |
Language-specific stop words and accent stripping |
TextIndexConfig::ecommerce() |
BM25F over name/description/brand with <mark> highlighting |
TextIndexConfig::code_search() |
Source code/logs: case-sensitive, keeps punctuation/numerics, no stemming |
TextIndexConfig::phonetic_name() |
People/contact directories with Double-Metaphone phonetic matching |
TextIndexConfig::social() |
Short-text/social: auto stop words, repeating-char normalization, URL/email cleanup |
TextIndexConfig::academic() |
Long-form papers: stemming, BM25+, MMR diversity re-ranking |
TextIndexConfig::logs() |
Server/application logs: no stop words, no stemming, single-char terms |
TextIndexConfig::realtime_alert() |
Standing-query baseline; pair with PercolateIndex for reverse search |
See Presets for full preset bodies and tuning notes.
Documentation
| Document | Contents |
|---|---|
| Search Modes | All 15 modes with when-to-use, formulas, examples |
| Search Methods | Complete API: all search method signatures |
| Indexing | add, add_batch, add_fields, remove, update, build |
| Utility Methods | snippet, suggest, explain, did_you_mean, more_like_this, stats |
| Types | TextResult, SearchOptions, ScoreExplanation, Snippet, and more |
| Enums | SearchMode, BM25Variant, StopWordMode, and 15+ enums |
| Configuration | TextIndexConfig + SearchOptions full reference |
| Presets | Ready-to-use configs for common use cases |
| Function Scoring & Curation | Decay functions, field boosting, curation rules |
| Facets & Aggregations | Faceted search, metrics, histograms |
| Percolation | Reverse search for alerts and monitoring |
| Text Processing | TextTokenizer, TextParser, TextChunker |
| Architecture | Indexing pipeline, score fusion, optimization, diagrams |
| Demo | Complete working example covering all search modes and utilities |
Supported Languages
33 languages for stop words and text processing:
Arabic, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish, Ukrainian, Vietnamese.