Switch to stable 8.0.427-dev

Text Search Library

Full-text search library for GreyCat providing 15 search modes (plus a multi-field BM25F variant and a batch helper), multi-language support (33 languages), and C-accelerated hot paths. The main entry point is TextIndex<T>, a generic type where T is the value stored alongside each document.

Quick Start

// 1. Create an index. The default config is a sensible BM25 + stop words setup.
var index = TextIndex<String> { config: TextIndexConfig {} };

// 2. Add documents: add(text, value)
index.add("Machine learning is a subset of artificial intelligence", "doc1");
index.add("Neural networks learn patterns from data", "doc2");
index.add("Natural language processing enables text understanding", "doc3");

// 3. Build the index (required before searching)
index.build();

// 4. Search
var results = index.search_bm25("machine learning", 10);
for (var i = 0; i < results.size(); i++) {
    var r = results[i];
    info("${r.value} score=${r.score}");
}

Pipeline

Every usage follows the same three-step pattern:

  1. Add documents with add(key, value), add_batch(entries), or add_fields(value) (typed multi-field with field references on the document type; supports T or node<T>)
  2. Build the index with build() – computes IDF, TF caches, tries, and posting arrays
  3. Search with any of the search methods listed below

Use add() after build() for incremental updates. For deletions/updates use remove(key) and update(key, value).

Choosing a Search Mode

Mode Best for Method
Hybrid User-facing search bars (combines modes via fusion) search(query, k, options)
BM25 General keyword search search_bm25(query, k)
BM25F Multi-field weighted scoring (title/body/tags) search_bm25_f(query, k)
BM25 Batch Many queries against one index search_bm25_batch(queries, k)
Semantic Vector similarity using embeddings search_semantic(query, k)
Exact Known identifiers, error codes, log search search_exact(query, k)
Fuzzy Typo-tolerant search, name/address lookup search_fuzzy(query, k, options)
Boolean Advanced search with AND/OR/NOT search_boolean(query, k)
Phrase Exact word sequences, quoted search search_phrase(phrase, k, options)
Proximity Two concepts appearing near each other search_proximity(term1, term2, k, options)
Prefix Search-as-you-type search_prefix(prefix, k)
Wildcard Pattern matching (*, ?) search_wildcard(pattern, k)
Span Positional constraints (NEAR, ONEAR, FIRST) search_span(spanQuery, k)
DFR Alternative scoring for long documents search_dfr(query, k)
LM-Dirichlet Short queries against long documents, QA search_lm_dirichlet(query, k)
Phonetic Sound-alike name search (Smith/Smyth) search_phonetic(query, k)
Quorum At least N of M terms must match search_quorum(query, k, minMatch)

For per-mode details, scoring formulas, and examples see Search Modes.

Key Features

Utility Methods

var _snippet = index.snippet("doc1", "machine learning", null);  // Snippet { text, highlighted }
var _explanation = index.explain("machine learning", "doc1");    // ScoreExplanation
var _suggestions = index.suggest("mach", 5);                     // Array<Suggestion>
var _correction = index.did_you_mean("machin lerning");            // DidYouMeanResult
var _similar = index.more_like_this("doc1", 10, null);             // Array<TextResult>
var _stats = index.stats();                                       // TextIndexStats

snippet() returns a Snippet { text, highlighted }. Use index.snippets(keys, query, options) for batched extraction.

See Utility Methods for full API details.

Configuration

TextIndexConfig is fully optional. All fields are nullable. Standalone fields stay at the top level; everything else is grouped into nested option blocks.

// Minimal: tweak BM25 only
var cfg = TextIndexConfig {
    bm25: BM25Options { k1: 1.5, b: 0.75 }
};

// Combine a few blocks
var cfg2 = TextIndexConfig {
    usePhonetic: true,
    synonyms: synonymMap,
    tokenization: TokenizationOptions {
        stemming: true,
        normOptions: NormOptions { stripHtmlTags: true, stripUrls: true, stripAccents: true }
    },
    stopWords: StopWordOptions {
        mode: StopWordMode::default,
        language: TextSearchLanguage::en
    },
    typoTolerance: TypoOptions { enabled: true }
};

Standalone fields: embed, synonyms, fields, deduplicateContent, fuzzyMaxTextLength, usePhonetic.

Nested option blocks: tokenization, stopWords, bm25, fusion, typoTolerance, edgeNgram, shortCircuit, diversify, chunking, dfr, lmDirichlet, highlight.

For hybrid fusion weights, build a map and pass it via fusion.weights:

var w = Map<SearchMode, float> {};
w.set(SearchMode::bm25, 0.7);
w.set(SearchMode::semantic, 0.3);
var _cfg = TextIndexConfig {
    fusion: FusionOptions { method: FusionMethod::rrf, weights: w }
};

See Configuration Reference for every field.

Presets

Static factories on TextIndexConfig produce ready-made configurations. Use them directly or as a starting point.

var _index = TextIndex<String> { config: TextIndexConfig::keyword() };
Preset Use case
TextIndexConfig::keyword() Traditional keyword search (BM25 + exact) for help desks and internal docs
TextIndexConfig::semantic(embed) Vector similarity over user-supplied embedding function with sentence chunking
TextIndexConfig::fuzzy() Typo-tolerant product/UI search (BM25 + fuzzy + exact, automatic typo tolerance)
TextIndexConfig::multilingual(lang) Language-specific stop words and accent stripping
TextIndexConfig::ecommerce() BM25F over name/description/brand with <mark> highlighting
TextIndexConfig::code_search() Source code/logs: case-sensitive, keeps punctuation/numerics, no stemming
TextIndexConfig::phonetic_name() People/contact directories with Double-Metaphone phonetic matching
TextIndexConfig::social() Short-text/social: auto stop words, repeating-char normalization, URL/email cleanup
TextIndexConfig::academic() Long-form papers: stemming, BM25+, MMR diversity re-ranking
TextIndexConfig::logs() Server/application logs: no stop words, no stemming, single-char terms
TextIndexConfig::realtime_alert() Standing-query baseline; pair with PercolateIndex for reverse search

See Presets for full preset bodies and tuning notes.

Documentation

Document Contents
Search Modes All 15 modes with when-to-use, formulas, examples
Search Methods Complete API: all search method signatures
Indexing add, add_batch, add_fields, remove, update, build
Utility Methods snippet, suggest, explain, did_you_mean, more_like_this, stats
Types TextResult, SearchOptions, ScoreExplanation, Snippet, and more
Enums SearchMode, BM25Variant, StopWordMode, and 15+ enums
Configuration TextIndexConfig + SearchOptions full reference
Presets Ready-to-use configs for common use cases
Function Scoring & Curation Decay functions, field boosting, curation rules
Facets & Aggregations Faceted search, metrics, histograms
Percolation Reverse search for alerts and monitoring
Text Processing TextTokenizer, TextParser, TextChunker
Architecture Indexing pipeline, score fusion, optimization, diagrams
Demo Complete working example covering all search modes and utilities

Supported Languages

33 languages for stop words and text processing:

Arabic, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish, Ukrainian, Vietnamese.