
# Utility Methods

Helper methods for snippet extraction, term suggestion, spell correction, score explanation, content discovery, and index statistics.

---

## snippet()

Extract a query-aware text snippet from a document. Returns both plain text and highlighted variants in one call. The algorithm uses a sliding window to find the passage with the highest density of query terms, so the returned snippet centers on the most relevant part of the document.

```gcl
fn snippet(key: String, query: String, options: SnippetOptions?): Snippet?
```

**Parameters**

| Parameter | Type | Description |
|-----------|------|-------------|
| `key` | `String` | Original document text (the first argument passed to `add()`) |
| `query` | `String` | Search query to locate relevant passage |
| `options` | `SnippetOptions?` | Optional snippet options (default: `maxLength` 200) |

**Returns:** `Snippet { text, highlighted }`, or `null` if the document is not found or no query terms match.

> Note: throughout this document, `key` refers to the original indexed text content (i.e. the first argument passed to `add()`), not the metadata stored as `value`. On a `TextResult`, that text is exposed as `r.key`.

**`SnippetOptions`**

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `maxLength` | `int?` | `200` | Maximum snippet length in characters |

**`Snippet`**

| Field | Type | Description |
|-------|------|-------------|
| `text` | `String` | Plain snippet text (no markup) |
| `highlighted` | `String` | Same snippet with matched query terms wrapped in `config.highlight.preTag` / `postTag` |

**Example: Search results page**

```gcl
var results = index.search_bm25("machine learning algorithms", 10);
for (var i = 0; i < results.size(); i++) {
    var r = results[i];
    // r.key is the original indexed text; r.value is the metadata stored alongside it.
    var snip = index.snippet(r.key, "machine learning algorithms", null);
    if (snip != null) {
        info("${r.value}: ${snip.highlighted}");
    }
}
// Output: "doc1: ...supervised <em>machine</em> <em>learning</em> <em>algorithms</em> are trained on labeled data..."
```

**Example: Custom snippet length for mobile UI**

```gcl
// The key is the same original text that was passed to add().
var articleText = "Deep neural networks have revolutionized many areas of machine learning...";
index.add(articleText, "article-42");
index.build();

// Short snippets for mobile cards
var _shortSnip = index.snippet(articleText, "neural networks", SnippetOptions { maxLength: 80 });
// shortSnip!!.text         -> "...deep neural networks have revolutionized..."
// shortSnip!!.highlighted  -> "...deep <em>neural</em> <em>networks</em> have revolutionized..."

// Longer snippets for desktop detail view
var _longSnip = index.snippet(articleText, "neural networks", SnippetOptions { maxLength: 400 });
```

**Example: HTML highlighting via config**

```gcl
var index = TextIndex<String> {
    config: TextIndexConfig {
        stopWords: StopWordOptions { mode: StopWordMode::default },
        highlight: HighlightOptions { preTag: "<mark>", postTag: "</mark>" }
    }
};
var articleText = "An introduction about machine learning algorithms that transform raw data...";
index.add(articleText, "article-1");
index.build();

var _snip = index.snippet(articleText, "machine learning", null);
// snip!!.highlighted: "...about <mark>machine</mark> <mark>learning</mark> algorithms that transform..."
```

**Example: Terminal output with ANSI colors**

```gcl
var index = TextIndex<String> {
    config: TextIndexConfig {
        stopWords: StopWordOptions { mode: StopWordMode::default },
        highlight: HighlightOptions { preTag: "[1;33m", postTag: "[0m" }
    }
};
// Highlighted terms appear bold yellow in terminal
var logLine = "2026-01-12 ERROR: request timeout while contacting upstream service";
index.add(logLine, "doc1");
index.build();
var _snip = index.snippet(logLine, "error timeout", SnippetOptions { maxLength: 150 });
```

**Notes**

- The sliding window scores each candidate passage by counting distinct query terms present
- If the query has stop words, they are filtered before matching (respects `stopWords.mode`)
- Stemming applies if enabled: a query for "running" will match "runs" in the document
- Both `text` and `highlighted` are computed in a single pass

---

## snippets()

Extract `Snippet { text, highlighted }` for multiple documents in one call. More efficient than calling `snippet()` in a loop because it avoids repeated query parsing.

```gcl
fn snippets(keys: Array<String>, query: String, options: SnippetOptions?): Map<String, Snippet>
```

**Parameters**

| Parameter | Type | Description |
|-----------|------|-------------|
| `keys` | `Array<String>` | Array of original document texts (the first arguments passed to `add()`) |
| `query` | `String` | Search query |
| `options` | `SnippetOptions?` | Optional snippet options |

**Returns:** Map of key (original document text) to `Snippet { text, highlighted }`. Only includes keys where at least one query term matched; missing keys are omitted.

**Example: Render a page of search results**

```gcl
var results = index.search_bm25("database optimization", 20);

// Collect keys (original indexed text) from results
var keys = Array<String> {};
for (var i = 0; i < results.size(); i++) {
    keys.add(results[i].key);
}

// Single batch call instead of 20 individual calls
var snips = index.snippets(keys, "database optimization", SnippetOptions { maxLength: 150 });

// Render results with snippets
for (var i = 0; i < results.size(); i++) {
    var r = results[i];
    var snip = snips.get(r.key);
    if (snip != null) {
        // r.value is the metadata stored alongside the text
        info("${r.value}: ${snip.highlighted}");
    }
}
```

**Notes**

- Uses `highlight.preTag` and `highlight.postTag` from config for the `highlighted` field
- Preferred over calling `snippet()` in a loop for result sets larger than 3-5 documents

---

## suggest()

Get weighted term suggestions for a prefix. Returns `Suggestion` objects with IDF-weighted scores and document frequency counts, giving you control over how suggestions are ranked and displayed.

```gcl
fn suggest(prefix: String, k: int): Array<Suggestion>
```

**Parameters**

| Parameter | Type | Description |
|-----------|------|-------------|
| `prefix` | `String` | Term prefix to complete |
| `k` | `int` | Maximum number of suggestions |

**Returns:** Array of `Suggestion` objects sorted by score (IDF-weighted).

**Return type: `Suggestion`**

| Field | Type | Description |
|-------|------|-------------|
| `term` | `String` | Suggested term |
| `score` | `float` | IDF-weighted relevance score |
| `df` | `int` | Number of documents containing this term |

**Example: Autocomplete dropdown with frequency badges**

```gcl
var suggestions = index.suggest("mach", 5);
for (var i = 0; i < suggestions.size(); i++) {
    var s = suggestions[i];
    info("${s.term} (${s.df} docs, score=${s.score})");
}
// Output:
//   machine (142 docs, score=3.21)
//   machinery (28 docs, score=4.87)
//   machines (95 docs, score=3.58)
//   machining (12 docs, score=5.44)
//   machinist (5 docs, score=6.12)
```

**Example: Filtering suggestions by minimum document frequency**

```gcl
var suggestions = index.suggest("pro", 20);
// Filter out rare terms that might confuse users
for (var i = 0; i < suggestions.size(); i++) {
    var s = suggestions[i];
    if (s.df >= 5) {
        info(s.term);
    }
}
```

**Example: Multi-term autocomplete**

```gcl
// User has typed "neural net" — complete the last word
var query = "neural net";
var lastSpace = query.lastIndexOf(' ');
var prefix = query.slice(lastSpace + 1, query.size());  // "net"

var completions = index.suggest(prefix, 5);
for (var i = 0; i < completions.size(); i++) {
    info("${completions[i].term}");
}
// networks, network, ...
// UI can show: "neural networks", "neural network", ...
```

**Notes**

- IDF-weighted scoring surfaces rare but informative terms higher than common ones
- Works with edge n-gram terms if `edgeNgram.enabled = true`
- Returns terms from the vocabulary after normalization (stemmed/lowercased as configured)

---

## did_you_mean()

Spell correction that suggests a corrected query when the user's input contains typos. Uses trigram similarity against the index vocabulary to find the closest known terms for each query word.

```gcl
fn did_you_mean(query: String): DidYouMeanResult
```

**Parameters**

| Parameter | Type | Description |
|-----------|------|-------------|
| `query` | `String` | Possibly misspelled query |

**Returns:** `DidYouMeanResult` with original query, corrected query, and per-term corrections.

**Return type: `DidYouMeanResult`**

| Field | Type | Description |
|-------|------|-------------|
| `originalQuery` | `String` | The original (possibly misspelled) query |
| `correctedQuery` | `String?` | The corrected query, or `null` if no correction needed |
| `corrections` | `Array<String>` | Per-term corrections (one per query term) |

**Example: Search with spell correction fallback**

```gcl
var query = "machin lerning";
var results = index.search_bm25(query, 10);

if (results.size() == 0) {
    var correction = index.did_you_mean(query);
    if (correction.correctedQuery != null) {
        info("Did you mean: ${correction.correctedQuery}");
        // "Did you mean: machine learning"
        results = index.search_bm25(correction.correctedQuery, 10);
    }
}
```

**Example: Always show suggestion even with results**

```gcl
var query = "nural netwerk";
var correction = index.did_you_mean(query);
if (correction.correctedQuery != null) {
    info("Showing results for: ${correction.correctedQuery}");
    info("Search instead for: ${correction.originalQuery}");
}
// Showing results for: neural network
// Search instead for: nural netwerk
```

**Notes**

- Uses trigram index for fast candidate lookup
- Each query term is corrected independently against the vocabulary
- Terms already in the vocabulary are not modified
- `correctedQuery` is `null` when all terms are already valid

---

## more_like_this()

Find documents similar to a given document. Extracts the most distinctive terms (by TF-IDF) from the source document and uses them as a query to find related content. Useful for "related articles", "similar products", and content recommendation.

```gcl
fn more_like_this(key: String, k: int, options: MoreLikeThisOptions?): Array<TextResult>
```

**Parameters**

| Parameter | Type | Description |
|-----------|------|-------------|
| `key` | `String` | Original document text (the first argument passed to `add()`) to find similar documents for |
| `k` | `int` | Number of results |
| `options` | `MoreLikeThisOptions?` | Optional MLT options (default: `maxQueryTerms` 10) |

**Returns:** Array of `TextResult` ranked by similarity to the source document. The source document itself is excluded from results.

**`MoreLikeThisOptions`**

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `maxQueryTerms` | `int?` | `10` | Maximum number of top TF-IDF terms extracted from the source document |

**Example: Related articles sidebar**

```gcl
var seedText = "Introduction to neural networks and deep learning";
index.add(seedText, "article-1");
index.add("Convolutional neural networks for image recognition", "article-2");
index.add("Recurrent networks for natural language processing", "article-3");
index.add("Database indexing strategies for performance", "article-4");
index.build();

// Pass the original text of the seed document, not its metadata id.
var similar = index.more_like_this(seedText, 5, null);
for (var i = 0; i < similar.size(); i++) {
    info("${similar[i].value} (score=${similar[i].score})");
}
// article-2 (score=4.12)  -- shares "neural networks"
// article-3 (score=3.87)  -- shares "networks"
// article-4 is ranked lower (no shared key terms)
```

**Example: Product recommendations with fewer query terms**

```gcl
// Pass the original product description text (the first arg to add()), not its SKU/metadata id.
var prod100Text = "Wireless noise-cancelling over-ear headphones with 30h battery life";
index.add(prod100Text, "prod-100");
index.build();

// Use fewer query terms for broader recommendations
var _broad = index.more_like_this(prod100Text, 10, MoreLikeThisOptions { maxQueryTerms: 3 });
// Uses only top 3 terms: casts a wider net

// Use more query terms for precise recommendations
var _precise = index.more_like_this(prod100Text, 10, MoreLikeThisOptions { maxQueryTerms: 20 });
// Uses top 20 terms: more specific matches
```

**Notes**

- Extracts terms with highest TF-IDF weight from the source document
- Increase `maxQueryTerms` for more precise similarity (at the cost of recall)
- Decrease `maxQueryTerms` for broader "you might also like" recommendations
- The source document is automatically excluded from results

---

## explain()

Get a detailed breakdown of how a document's BM25 score was computed for a given query. Shows per-term TF, IDF, normalized TF, and individual score contributions. Essential for debugging ranking issues.

```gcl
fn explain(query: String, key: String): ScoreExplanation?
```

**Parameters**

| Parameter | Type | Description |
|-----------|------|-------------|
| `query` | `String` | Search query |
| `key` | `String` | Original document text (the first argument passed to `add()`) to explain |

**Returns:** `ScoreExplanation` with full breakdown, or `null` if the document is not found.

**Return type: `ScoreExplanation`**

| Field | Type | Description |
|-------|------|-------------|
| `totalScore` | `float` | Final BM25 score |
| `terms` | `Array<TermExplanation>` | Per-term score breakdowns |
| `variant` | `BM25Variant` | BM25 variant used (lucene, plus, bm25l, atire, robertson) |
| `k1` | `float` | Term frequency saturation parameter |
| `b` | `float` | Length normalization parameter |
| `docLen` | `int` | Document length in tokens |
| `avgDocLen` | `float` | Average document length across the index |

**`TermExplanation` fields:**

| Field | Type | Description |
|-------|------|-------------|
| `term` | `String` | Query term |
| `tf` | `float` | Raw term frequency in document |
| `idf` | `float` | Inverse document frequency |
| `tfNorm` | `float` | Length-normalized TF component |
| `score` | `float` | This term's contribution to total score |

**Example: Debug why a document ranks lower than expected**

```gcl
// Pass the same original text that was indexed via add(), not the metadata id.
var article42Text = "Supervised machine learning algorithms are trained on labeled data to predict outcomes...";
index.add(article42Text, "article-42");
index.build();

var explanation = index.explain("machine learning algorithms", article42Text);
if (explanation != null) {
    info("Total BM25 score: ${explanation.totalScore}");
    info("Variant: ${explanation.variant}, k1=${explanation.k1}, b=${explanation.b}");
    info("Doc length: ${explanation.docLen}, Avg: ${explanation.avgDocLen}");

    for (var i = 0; i < explanation.terms.size(); i++) {
        var t = explanation.terms[i];
        info("  '${t.term}': TF=${t.tf}, IDF=${t.idf}, TF_norm=${t.tfNorm}, score=${t.score}");
    }
}
// Total BM25 score: 5.23
// Variant: lucene, k1=1.5, b=0.75
// Doc length: 342, Avg: 156.7
//   'machine': TF=3, IDF=2.14, TF_norm=1.82, score=3.89
//   'learn': TF=1, IDF=1.87, TF_norm=0.72, score=1.34  (stemmed from "learning")
//   'algorithm': TF=0, IDF=3.21, TF_norm=0, score=0     (term not in document!)
```

**Example: Compare scoring across documents**

```gcl
// Each entry in `docs` is the original text that was passed to add() as `key`.
var docs = Array<String> {};
docs.add("Brief overview of transformer architecture and self-attention.");
docs.add("A long survey covering many neural architectures including transformer architecture details, history, variants, training tricks, and benchmarks across dozens of tasks...");
docs.add("Comparing transformer architecture against convolutional networks for sequence tasks.");

for (var i = 0; i < docs.size(); i++) {
    var exp = index.explain("transformer architecture", docs[i]);
    if (exp != null) {
        info("doc ${i}: score=${exp.totalScore}, len=${exp.docLen}");
    }
}
// doc 0: score=6.12, len=89     (short, focused document)
// doc 1: score=4.87, len=1542   (long document, length penalty)
// doc 2: score=5.44, len=234    (medium length)
```

**Notes**

- Query terms are stemmed/normalized before lookup (matching the indexing pipeline)
- A term with `tf=0` means the term doesn't appear in that document
- Long documents get a length penalty controlled by the `b` parameter: `docLen > avgDocLen` reduces scores
- Use this to understand why `bm25.k1` and `bm25.b` tuning changes ranking

---

## stats()

Get aggregate index statistics. Useful for monitoring index size, checking vocabulary growth, and informing BM25 parameter tuning decisions.

```gcl
fn stats(): TextIndexStats
```

**Returns:** `TextIndexStats` with index metrics.

**Return type: `TextIndexStats`**

| Field | Type | Description |
|-------|------|-------------|
| `totalEntries` | `int` | Number of indexed documents |
| `totalTerms` | `int` | Vocabulary size (unique terms after normalization) |
| `avgTokenCount` | `float` | Average document length in tokens |

**Example: Index health monitoring**

```gcl
var s = index.stats();
info("Documents: ${s.totalEntries}");
info("Vocabulary size: ${s.totalTerms}");
info("Avg document length: ${s.avgTokenCount} tokens");
// Documents: 15420
// Vocabulary size: 28763
// Avg document length: 87.3 tokens
```

**Example: Refresh IDF after many incremental writes**

```gcl
// After many add()/remove() calls following the previous build(),
// IDF values drift. Call build() again to recompute them.
var s = index.stats();
info("Avg length: ${s.avgTokenCount}");
```

**Notes**

- `totalTerms` reflects the vocabulary after normalization (stemming, case folding, stop word removal)
- `avgTokenCount` is the value used by BM25 as `avgdl` in the length normalization formula
- Call after `build()` for accurate values

---

## Common Patterns

### Search Results Page

Combine search, snippet extraction, and spell correction for a complete search experience:

```gcl
fn search_with_ui(index: TextIndex<String>, query: String, k: int) {
    // 1. Check for typos
    var correction = index.did_you_mean(query);
    var effectiveQuery = query;
    if (correction.correctedQuery != null) {
        info("Did you mean: ${correction.correctedQuery}");
        effectiveQuery = correction.correctedQuery;
    }

    // 2. Search
    var results = index.search_bm25(effectiveQuery, k);

    // 3. Batch extract snippets (with highlighted variant)
    //    snippet/snippets are keyed by the original indexed text, which is r.key.
    var keys = Array<String> {};
    for (var i = 0; i < results.size(); i++) {
        keys.add(results[i].key);
    }
    var snips = index.snippets(keys, effectiveQuery, SnippetOptions { maxLength: 200 });

    // 4. Render
    for (var i = 0; i < results.size(); i++) {
        var r = results[i];
        var snip = snips.get(r.key);
        // r.value carries the metadata stored alongside the document text.
        info("${r.value} (score=${r.score})");
        if (snip != null) {
            info("  ${snip.highlighted}");
        }
    }
}
```

### Autocomplete with Suggestions

Build a search-as-you-type dropdown:

```gcl
fn on_keystroke(index: TextIndex<String>, input: String) {
    if (input.size() < 2) {
        return;
    }
    // Use suggest() for rich metadata
    var suggestions = index.suggest(input, 8);
    for (var i = 0; i < suggestions.size(); i++) {
        var s = suggestions[i];
        info("${s.term} (${s.df} results)");
    }
}
```

### Relevance Debugging

When search results seem wrong, use `explain()` and `stats()` together:

```gcl
// docText is the original text passed to add() (i.e. the `key` argument).
fn debug_ranking(index: TextIndex<String>, query: String, docText: String) {
    var s = index.stats();
    info("Index: ${s.totalEntries} docs, ${s.totalTerms} terms, avgLen=${s.avgTokenCount}");

    var exp = index.explain(query, docText);
    if (exp != null) {
        info("Score for document: ${exp.totalScore}");
        info("Doc length: ${exp.docLen} (avg: ${exp.avgDocLen})");
        var lengthRatio = exp.docLen / exp.avgDocLen;
        if (lengthRatio > 2.0) {
            info("  WARNING: Document is ${lengthRatio}x longer than average — length penalty may be suppressing score");
            info("  Consider lowering bm25.b (currently ${exp.b})");
        }
        for (var i = 0; i < exp.terms.size(); i++) {
            var t = exp.terms[i];
            if (t.tf == 0.0) {
                info("  MISSING: '${t.term}' not found in document");
            } else {
                info("  '${t.term}': TF=${t.tf}, IDF=${t.idf}, score=${t.score}");
            }
        }
    }
}
```
