# Utility Methods Helper methods for snippet extraction, term suggestion, spell correction, score explanation, content discovery, and index statistics. --- ## snippet() Extract a query-aware text snippet from a document. Returns both plain text and highlighted variants in one call. The algorithm uses a sliding window to find the passage with the highest density of query terms, so the returned snippet centers on the most relevant part of the document. ```gcl fn snippet(key: String, query: String, options: SnippetOptions?): Snippet? ``` **Parameters** | Parameter | Type | Description | |-----------|------|-------------| | `key` | `String` | Original document text (the first argument passed to `add()`) | | `query` | `String` | Search query to locate relevant passage | | `options` | `SnippetOptions?` | Optional snippet options (default: `maxLength` 200) | **Returns:** `Snippet { text, highlighted }`, or `null` if the document is not found or no query terms match. > Note: throughout this document, `key` refers to the original indexed text content (i.e. the first argument passed to `add()`), not the metadata stored as `value`. On a `TextResult`, that text is exposed as `r.key`. **`SnippetOptions`** | Field | Type | Default | Description | |-------|------|---------|-------------| | `maxLength` | `int?` | `200` | Maximum snippet length in characters | **`Snippet`** | Field | Type | Description | |-------|------|-------------| | `text` | `String` | Plain snippet text (no markup) | | `highlighted` | `String` | Same snippet with matched query terms wrapped in `config.highlight.preTag` / `postTag` | **Example: Search results page** ```gcl var results = index.search_bm25("machine learning algorithms", 10); for (var i = 0; i < results.size(); i++) { var r = results[i]; // r.key is the original indexed text; r.value is the metadata stored alongside it. var snip = index.snippet(r.key, "machine learning algorithms", null); if (snip != null) { info("${r.value}: ${snip.highlighted}"); } } // Output: "doc1: ...supervised machine learning algorithms are trained on labeled data..." ``` **Example: Custom snippet length for mobile UI** ```gcl // The key is the same original text that was passed to add(). var articleText = "Deep neural networks have revolutionized many areas of machine learning..."; index.add(articleText, "article-42"); index.build(); // Short snippets for mobile cards var _shortSnip = index.snippet(articleText, "neural networks", SnippetOptions { maxLength: 80 }); // shortSnip!!.text -> "...deep neural networks have revolutionized..." // shortSnip!!.highlighted -> "...deep neural networks have revolutionized..." // Longer snippets for desktop detail view var _longSnip = index.snippet(articleText, "neural networks", SnippetOptions { maxLength: 400 }); ``` **Example: HTML highlighting via config** ```gcl var index = TextIndex { config: TextIndexConfig { stopWords: StopWordOptions { mode: StopWordMode::default }, highlight: HighlightOptions { preTag: "", postTag: "" } } }; var articleText = "An introduction about machine learning algorithms that transform raw data..."; index.add(articleText, "article-1"); index.build(); var _snip = index.snippet(articleText, "machine learning", null); // snip!!.highlighted: "...about machine learning algorithms that transform..." ``` **Example: Terminal output with ANSI colors** ```gcl var index = TextIndex { config: TextIndexConfig { stopWords: StopWordOptions { mode: StopWordMode::default }, highlight: HighlightOptions { preTag: "[1;33m", postTag: "[0m" } } }; // Highlighted terms appear bold yellow in terminal var logLine = "2026-01-12 ERROR: request timeout while contacting upstream service"; index.add(logLine, "doc1"); index.build(); var _snip = index.snippet(logLine, "error timeout", SnippetOptions { maxLength: 150 }); ``` **Notes** - The sliding window scores each candidate passage by counting distinct query terms present - If the query has stop words, they are filtered before matching (respects `stopWords.mode`) - Stemming applies if enabled: a query for "running" will match "runs" in the document - Both `text` and `highlighted` are computed in a single pass --- ## snippets() Extract `Snippet { text, highlighted }` for multiple documents in one call. More efficient than calling `snippet()` in a loop because it avoids repeated query parsing. ```gcl fn snippets(keys: Array, query: String, options: SnippetOptions?): Map ``` **Parameters** | Parameter | Type | Description | |-----------|------|-------------| | `keys` | `Array` | Array of original document texts (the first arguments passed to `add()`) | | `query` | `String` | Search query | | `options` | `SnippetOptions?` | Optional snippet options | **Returns:** Map of key (original document text) to `Snippet { text, highlighted }`. Only includes keys where at least one query term matched; missing keys are omitted. **Example: Render a page of search results** ```gcl var results = index.search_bm25("database optimization", 20); // Collect keys (original indexed text) from results var keys = Array {}; for (var i = 0; i < results.size(); i++) { keys.add(results[i].key); } // Single batch call instead of 20 individual calls var snips = index.snippets(keys, "database optimization", SnippetOptions { maxLength: 150 }); // Render results with snippets for (var i = 0; i < results.size(); i++) { var r = results[i]; var snip = snips.get(r.key); if (snip != null) { // r.value is the metadata stored alongside the text info("${r.value}: ${snip.highlighted}"); } } ``` **Notes** - Uses `highlight.preTag` and `highlight.postTag` from config for the `highlighted` field - Preferred over calling `snippet()` in a loop for result sets larger than 3-5 documents --- ## suggest() Get weighted term suggestions for a prefix. Returns `Suggestion` objects with IDF-weighted scores and document frequency counts, giving you control over how suggestions are ranked and displayed. ```gcl fn suggest(prefix: String, k: int): Array ``` **Parameters** | Parameter | Type | Description | |-----------|------|-------------| | `prefix` | `String` | Term prefix to complete | | `k` | `int` | Maximum number of suggestions | **Returns:** Array of `Suggestion` objects sorted by score (IDF-weighted). **Return type: `Suggestion`** | Field | Type | Description | |-------|------|-------------| | `term` | `String` | Suggested term | | `score` | `float` | IDF-weighted relevance score | | `df` | `int` | Number of documents containing this term | **Example: Autocomplete dropdown with frequency badges** ```gcl var suggestions = index.suggest("mach", 5); for (var i = 0; i < suggestions.size(); i++) { var s = suggestions[i]; info("${s.term} (${s.df} docs, score=${s.score})"); } // Output: // machine (142 docs, score=3.21) // machinery (28 docs, score=4.87) // machines (95 docs, score=3.58) // machining (12 docs, score=5.44) // machinist (5 docs, score=6.12) ``` **Example: Filtering suggestions by minimum document frequency** ```gcl var suggestions = index.suggest("pro", 20); // Filter out rare terms that might confuse users for (var i = 0; i < suggestions.size(); i++) { var s = suggestions[i]; if (s.df >= 5) { info(s.term); } } ``` **Example: Multi-term autocomplete** ```gcl // User has typed "neural net" — complete the last word var query = "neural net"; var lastSpace = query.lastIndexOf(' '); var prefix = query.slice(lastSpace + 1, query.size()); // "net" var completions = index.suggest(prefix, 5); for (var i = 0; i < completions.size(); i++) { info("${completions[i].term}"); } // networks, network, ... // UI can show: "neural networks", "neural network", ... ``` **Notes** - IDF-weighted scoring surfaces rare but informative terms higher than common ones - Works with edge n-gram terms if `edgeNgram.enabled = true` - Returns terms from the vocabulary after normalization (stemmed/lowercased as configured) --- ## did_you_mean() Spell correction that suggests a corrected query when the user's input contains typos. Uses trigram similarity against the index vocabulary to find the closest known terms for each query word. ```gcl fn did_you_mean(query: String): DidYouMeanResult ``` **Parameters** | Parameter | Type | Description | |-----------|------|-------------| | `query` | `String` | Possibly misspelled query | **Returns:** `DidYouMeanResult` with original query, corrected query, and per-term corrections. **Return type: `DidYouMeanResult`** | Field | Type | Description | |-------|------|-------------| | `originalQuery` | `String` | The original (possibly misspelled) query | | `correctedQuery` | `String?` | The corrected query, or `null` if no correction needed | | `corrections` | `Array` | Per-term corrections (one per query term) | **Example: Search with spell correction fallback** ```gcl var query = "machin lerning"; var results = index.search_bm25(query, 10); if (results.size() == 0) { var correction = index.did_you_mean(query); if (correction.correctedQuery != null) { info("Did you mean: ${correction.correctedQuery}"); // "Did you mean: machine learning" results = index.search_bm25(correction.correctedQuery, 10); } } ``` **Example: Always show suggestion even with results** ```gcl var query = "nural netwerk"; var correction = index.did_you_mean(query); if (correction.correctedQuery != null) { info("Showing results for: ${correction.correctedQuery}"); info("Search instead for: ${correction.originalQuery}"); } // Showing results for: neural network // Search instead for: nural netwerk ``` **Notes** - Uses trigram index for fast candidate lookup - Each query term is corrected independently against the vocabulary - Terms already in the vocabulary are not modified - `correctedQuery` is `null` when all terms are already valid --- ## more_like_this() Find documents similar to a given document. Extracts the most distinctive terms (by TF-IDF) from the source document and uses them as a query to find related content. Useful for "related articles", "similar products", and content recommendation. ```gcl fn more_like_this(key: String, k: int, options: MoreLikeThisOptions?): Array ``` **Parameters** | Parameter | Type | Description | |-----------|------|-------------| | `key` | `String` | Original document text (the first argument passed to `add()`) to find similar documents for | | `k` | `int` | Number of results | | `options` | `MoreLikeThisOptions?` | Optional MLT options (default: `maxQueryTerms` 10) | **Returns:** Array of `TextResult` ranked by similarity to the source document. The source document itself is excluded from results. **`MoreLikeThisOptions`** | Field | Type | Default | Description | |-------|------|---------|-------------| | `maxQueryTerms` | `int?` | `10` | Maximum number of top TF-IDF terms extracted from the source document | **Example: Related articles sidebar** ```gcl var seedText = "Introduction to neural networks and deep learning"; index.add(seedText, "article-1"); index.add("Convolutional neural networks for image recognition", "article-2"); index.add("Recurrent networks for natural language processing", "article-3"); index.add("Database indexing strategies for performance", "article-4"); index.build(); // Pass the original text of the seed document, not its metadata id. var similar = index.more_like_this(seedText, 5, null); for (var i = 0; i < similar.size(); i++) { info("${similar[i].value} (score=${similar[i].score})"); } // article-2 (score=4.12) -- shares "neural networks" // article-3 (score=3.87) -- shares "networks" // article-4 is ranked lower (no shared key terms) ``` **Example: Product recommendations with fewer query terms** ```gcl // Pass the original product description text (the first arg to add()), not its SKU/metadata id. var prod100Text = "Wireless noise-cancelling over-ear headphones with 30h battery life"; index.add(prod100Text, "prod-100"); index.build(); // Use fewer query terms for broader recommendations var _broad = index.more_like_this(prod100Text, 10, MoreLikeThisOptions { maxQueryTerms: 3 }); // Uses only top 3 terms: casts a wider net // Use more query terms for precise recommendations var _precise = index.more_like_this(prod100Text, 10, MoreLikeThisOptions { maxQueryTerms: 20 }); // Uses top 20 terms: more specific matches ``` **Notes** - Extracts terms with highest TF-IDF weight from the source document - Increase `maxQueryTerms` for more precise similarity (at the cost of recall) - Decrease `maxQueryTerms` for broader "you might also like" recommendations - The source document is automatically excluded from results --- ## explain() Get a detailed breakdown of how a document's BM25 score was computed for a given query. Shows per-term TF, IDF, normalized TF, and individual score contributions. Essential for debugging ranking issues. ```gcl fn explain(query: String, key: String): ScoreExplanation? ``` **Parameters** | Parameter | Type | Description | |-----------|------|-------------| | `query` | `String` | Search query | | `key` | `String` | Original document text (the first argument passed to `add()`) to explain | **Returns:** `ScoreExplanation` with full breakdown, or `null` if the document is not found. **Return type: `ScoreExplanation`** | Field | Type | Description | |-------|------|-------------| | `totalScore` | `float` | Final BM25 score | | `terms` | `Array` | Per-term score breakdowns | | `variant` | `BM25Variant` | BM25 variant used (lucene, plus, bm25l, atire, robertson) | | `k1` | `float` | Term frequency saturation parameter | | `b` | `float` | Length normalization parameter | | `docLen` | `int` | Document length in tokens | | `avgDocLen` | `float` | Average document length across the index | **`TermExplanation` fields:** | Field | Type | Description | |-------|------|-------------| | `term` | `String` | Query term | | `tf` | `float` | Raw term frequency in document | | `idf` | `float` | Inverse document frequency | | `tfNorm` | `float` | Length-normalized TF component | | `score` | `float` | This term's contribution to total score | **Example: Debug why a document ranks lower than expected** ```gcl // Pass the same original text that was indexed via add(), not the metadata id. var article42Text = "Supervised machine learning algorithms are trained on labeled data to predict outcomes..."; index.add(article42Text, "article-42"); index.build(); var explanation = index.explain("machine learning algorithms", article42Text); if (explanation != null) { info("Total BM25 score: ${explanation.totalScore}"); info("Variant: ${explanation.variant}, k1=${explanation.k1}, b=${explanation.b}"); info("Doc length: ${explanation.docLen}, Avg: ${explanation.avgDocLen}"); for (var i = 0; i < explanation.terms.size(); i++) { var t = explanation.terms[i]; info(" '${t.term}': TF=${t.tf}, IDF=${t.idf}, TF_norm=${t.tfNorm}, score=${t.score}"); } } // Total BM25 score: 5.23 // Variant: lucene, k1=1.5, b=0.75 // Doc length: 342, Avg: 156.7 // 'machine': TF=3, IDF=2.14, TF_norm=1.82, score=3.89 // 'learn': TF=1, IDF=1.87, TF_norm=0.72, score=1.34 (stemmed from "learning") // 'algorithm': TF=0, IDF=3.21, TF_norm=0, score=0 (term not in document!) ``` **Example: Compare scoring across documents** ```gcl // Each entry in `docs` is the original text that was passed to add() as `key`. var docs = Array {}; docs.add("Brief overview of transformer architecture and self-attention."); docs.add("A long survey covering many neural architectures including transformer architecture details, history, variants, training tricks, and benchmarks across dozens of tasks..."); docs.add("Comparing transformer architecture against convolutional networks for sequence tasks."); for (var i = 0; i < docs.size(); i++) { var exp = index.explain("transformer architecture", docs[i]); if (exp != null) { info("doc ${i}: score=${exp.totalScore}, len=${exp.docLen}"); } } // doc 0: score=6.12, len=89 (short, focused document) // doc 1: score=4.87, len=1542 (long document, length penalty) // doc 2: score=5.44, len=234 (medium length) ``` **Notes** - Query terms are stemmed/normalized before lookup (matching the indexing pipeline) - A term with `tf=0` means the term doesn't appear in that document - Long documents get a length penalty controlled by the `b` parameter: `docLen > avgDocLen` reduces scores - Use this to understand why `bm25.k1` and `bm25.b` tuning changes ranking --- ## stats() Get aggregate index statistics. Useful for monitoring index size, checking vocabulary growth, and informing BM25 parameter tuning decisions. ```gcl fn stats(): TextIndexStats ``` **Returns:** `TextIndexStats` with index metrics. **Return type: `TextIndexStats`** | Field | Type | Description | |-------|------|-------------| | `totalEntries` | `int` | Number of indexed documents | | `totalTerms` | `int` | Vocabulary size (unique terms after normalization) | | `avgTokenCount` | `float` | Average document length in tokens | **Example: Index health monitoring** ```gcl var s = index.stats(); info("Documents: ${s.totalEntries}"); info("Vocabulary size: ${s.totalTerms}"); info("Avg document length: ${s.avgTokenCount} tokens"); // Documents: 15420 // Vocabulary size: 28763 // Avg document length: 87.3 tokens ``` **Example: Refresh IDF after many incremental writes** ```gcl // After many add()/remove() calls following the previous build(), // IDF values drift. Call build() again to recompute them. var s = index.stats(); info("Avg length: ${s.avgTokenCount}"); ``` **Notes** - `totalTerms` reflects the vocabulary after normalization (stemming, case folding, stop word removal) - `avgTokenCount` is the value used by BM25 as `avgdl` in the length normalization formula - Call after `build()` for accurate values --- ## Common Patterns ### Search Results Page Combine search, snippet extraction, and spell correction for a complete search experience: ```gcl fn search_with_ui(index: TextIndex, query: String, k: int) { // 1. Check for typos var correction = index.did_you_mean(query); var effectiveQuery = query; if (correction.correctedQuery != null) { info("Did you mean: ${correction.correctedQuery}"); effectiveQuery = correction.correctedQuery; } // 2. Search var results = index.search_bm25(effectiveQuery, k); // 3. Batch extract snippets (with highlighted variant) // snippet/snippets are keyed by the original indexed text, which is r.key. var keys = Array {}; for (var i = 0; i < results.size(); i++) { keys.add(results[i].key); } var snips = index.snippets(keys, effectiveQuery, SnippetOptions { maxLength: 200 }); // 4. Render for (var i = 0; i < results.size(); i++) { var r = results[i]; var snip = snips.get(r.key); // r.value carries the metadata stored alongside the document text. info("${r.value} (score=${r.score})"); if (snip != null) { info(" ${snip.highlighted}"); } } } ``` ### Autocomplete with Suggestions Build a search-as-you-type dropdown: ```gcl fn on_keystroke(index: TextIndex, input: String) { if (input.size() < 2) { return; } // Use suggest() for rich metadata var suggestions = index.suggest(input, 8); for (var i = 0; i < suggestions.size(); i++) { var s = suggestions[i]; info("${s.term} (${s.df} results)"); } } ``` ### Relevance Debugging When search results seem wrong, use `explain()` and `stats()` together: ```gcl // docText is the original text passed to add() (i.e. the `key` argument). fn debug_ranking(index: TextIndex, query: String, docText: String) { var s = index.stats(); info("Index: ${s.totalEntries} docs, ${s.totalTerms} terms, avgLen=${s.avgTokenCount}"); var exp = index.explain(query, docText); if (exp != null) { info("Score for document: ${exp.totalScore}"); info("Doc length: ${exp.docLen} (avg: ${exp.avgDocLen})"); var lengthRatio = exp.docLen / exp.avgDocLen; if (lengthRatio > 2.0) { info(" WARNING: Document is ${lengthRatio}x longer than average — length penalty may be suppressing score"); info(" Consider lowering bm25.b (currently ${exp.b})"); } for (var i = 0; i < exp.terms.size(); i++) { var t = exp.terms[i]; if (t.tf == 0.0) { info(" MISSING: '${t.term}' not found in document"); } else { info(" '${t.term}': TF=${t.tf}, IDF=${t.idf}, score=${t.score}"); } } } } ```