# Configuration Presets

Ready-to-use configurations for common use cases. Each preset is a static factory on `TextIndexConfig` that returns a fully-formed `TextIndexConfig` tuned for the scenario; values not explicitly set keep their library defaults.

```gcl
// One line — index ready to use
var _index = TextIndex<String> { config: TextIndexConfig::keyword() };
```

| Preset | Use case |
|--------|----------|
| `TextIndexConfig::keyword()` | Traditional keyword search (BM25 + exact) |
| `TextIndexConfig::semantic(embed)` | Vector similarity over a user-supplied embedding function |
| `TextIndexConfig::fuzzy()` | Typo-tolerant product/UI search |
| `TextIndexConfig::multilingual(lang)` | Language-specific stop words + accent stripping |
| `TextIndexConfig::ecommerce()` | BM25F over name/description/brand with `<mark>` highlighting |
| `TextIndexConfig::code_search()` | Source code/logs: case-sensitive, keeps punctuation/numerics |
| `TextIndexConfig::phonetic_name()` | People/contact directories with Double-Metaphone matching |
| `TextIndexConfig::social()` | Short-text/social: auto stop words, repeating-char normalization |
| `TextIndexConfig::academic()` | Long-form papers: stemming, BM25+, MMR diversity |
| `TextIndexConfig::logs()` | Server/application logs: no stop words, single-char terms |
| `TextIndexConfig::realtime_alert()` | Standing-query baseline; pair with `PercolateIndex` |

You can use a preset directly, or use it as a base and patch in your own customizations.

```gcl
// Use a preset as-is
var _cfg = TextIndexConfig::keyword();

// Or grab a preset and tweak it
var cfg = TextIndexConfig::keyword();
cfg.deduplicateContent = true;
cfg.bm25 = BM25Options { k1: 1.8, b: 0.6 };
```

---

## Keyword Search — `TextIndexConfig::keyword()`

**Use case:** Traditional search bar, internal document search, help desk knowledge base.

BM25 + exact matching with short-circuit optimization. Prioritizes precision: the preset uses hybrid fusion with weights `bm25: 0.7, exact: 0.3` — i.e., the exact-match score contributes 30% of the fused result so known-item lookups surface alongside BM25 relevance.

```gcl
var config = TextIndexConfig::keyword();

var index = TextIndex<String> { config: config };

// Index knowledge base articles
index.add("How to reset your password", "kb-001");
index.add("Two-factor authentication setup guide", "kb-002");
index.add("Account billing and subscription management", "kb-003");
index.build();

var _results = index.search("reset password", 10, null);
// Exact substring match on "reset password" boosts kb-001 to top
```

**When to tune:**
- Increase `bm25.k1` (1.8-2.0) if documents repeat key terms and you want that signal
- Lower `bm25.b` (0.3-0.5) if document lengths vary widely and you want to reduce length penalty

---

## Semantic Search — `TextIndexConfig::semantic(embed)`

**Use case:** Conceptual search, research exploration, question answering.

Vector similarity for conceptual matching. Requires a user-provided embedding function. Sentence chunking preserves semantic boundaries in long documents.

```gcl
fn my_embed(text: String): Tensor {
    return ai::embed(text, model);
}

var config = TextIndexConfig::semantic(my_embed);

var index = TextIndex<String> { config: config };

// "AI applications" will match "machine learning use cases" conceptually
var results = index.search("AI applications in healthcare", 10, null);
```

**When to tune:**
- Increase `chunking.size` (256-512) for technical documents where context matters
- Increase `chunking.overlap` (30-50) if retrieval misses passages near chunk boundaries
- Use `ChunkStrategy::recursive` for mixed-format documents

---

## Fuzzy Search — `TextIndexConfig::fuzzy()`

**Use case:** Customer-facing search, name/address lookup, product search where users frequently mistype.

Combines BM25 relevance with fuzzy matching and exact match. Fuzzy search uses an internal trigram pre-filter for speed (always on, not configured by this preset). Automatic typo tolerance enabled.

```gcl
var config = TextIndexConfig::fuzzy();

var index = TextIndex<String> { config: config };

// Index product catalog
index.add("Apple MacBook Pro 16 inch", "prod-001");
index.add("Samsung Galaxy S24 Ultra", "prod-002");
index.add("Sony WH-1000XM5 Headphones", "prod-003");
index.build();

// Handles typos: "macbok" -> matches "MacBook"
var _results = index.search("macbok pro", 10, null);
```

**When to tune:**
- Lower `fusion.weights[fuzzy]` (0.2) if false positives from typo matches are a problem
- Adjust `typoTolerance.maxEdits1` / `maxEdits2` to make typo tolerance tighter or looser

---

## Multi-Field — Custom config

**Use case:** Structured documents with title/body/tags, product catalogs, article databases.

BM25F weights fields differently. Title matches score 5x higher than body matches. There is no preset for this — define a document type and field weights on its `String` fields directly:

```gcl
type Article {
    title: String;
    abstract: String;
    body: String;
    tags: String;
}

var fields = Array<FieldConfig> {};
fields.add(FieldConfig { f: Article::title,    weight: 5.0, fieldB: 0.3 });
fields.add(FieldConfig { f: Article::abstract, weight: 3.0 });
fields.add(FieldConfig { f: Article::body,     weight: 1.0 });
fields.add(FieldConfig { f: Article::tags,     weight: 2.0, fieldB: 0.0 });

var index = TextIndex<Article> {
    config: TextIndexConfig {
        fields: fields,
        stopWords: StopWordOptions { mode: StopWordMode::default }
    }
};

index.add_fields(Article {
    title: "Introduction to Neural Networks",
    abstract: "A comprehensive guide to neural network architectures",
    body: "Neural networks are computational models inspired by the brain...",
    tags: "deep learning neural networks ai"
});
index.build();

// Title match "neural networks" scores higher than body-only match
var _results = index.search_bm25_f("neural networks", 10);
```

> **Auto-discovery shortcut.** Omit `fields:` and `add_fields` will index every
> `String`/`String?` field on the document type at weight 1.0. Add an explicit
> `fields:` list when per-field weights or `fieldB` overrides matter.

**When to tune:**
- Set `fieldB: 0.0` for short fields (tags, titles) to disable length penalty
- Set `fieldB: 0.9` for long fields (body) to normalize for length variation
- Increase title weight (8.0-10.0) for known-item search where exact title matches matter

---

## Code Search — `TextIndexConfig::code_search()`

**Use case:** Source code repositories, log files, configuration files, error messages.

Preserves case, punctuation, and numeric tokens. No stemming (you want exact matches for `getElementById` vs `getElementByName`). Single-character terms enabled for operators and variable names.

```gcl
var config = TextIndexConfig::code_search();

var index = TextIndex<String> { config: config };

// Index source code files
index.add("function calculateTotalPrice(items: Array<Item>): float", "pricing.gcl:42");
index.add("var connection = db.connect(host, port, credentials)", "database.gcl:15");
index.add("if (response.statusCode == 404) { throw NotFoundException(); }", "api.gcl:88");
index.build();

// Case-sensitive search for exact function name
var _results = index.search_exact("calculateTotalPrice", 10);

// Search for error codes
var _results2 = index.search_exact("404", 10);
```

**When to tune:**
- Use `search_exact()` for identifier lookup (exact substring)
- Use `search_bm25()` for broader code search across function bodies
- Add custom separators in `tokenization.separators` for camelCase splitting if needed

---

## Social Media — `TextIndexConfig::social()`

**Use case:** Tweets, comments, chat messages, reviews, forum posts.

Auto stop words adapt to the corpus. Low `bm25.b` (0.3) reduces length penalty for short posts. URL/HTML stripping and character normalization handle noisy user input.

```gcl
var config = TextIndexConfig::social();

var index = TextIndex<String> { config: config };

// Index social media posts
index.add("Just tried the new restaurant downtown and it was amaaaazing!!!", "post-001");
index.add("Check out https://example.com for deals!! Best prices ever", "post-002");
index.add("Can't believe the service at that caf&eacute; was so bad", "post-003");
index.build();

// "amaaaazing" normalized to "amazing", URLs stripped, entities decoded
var _results = index.search_bm25("amazing restaurant", 10);
```

**When to tune:**
- The preset sets `stopWords.autoThreshold` to `0.6` by default. Lower it (0.4-0.5) for very short corpora where common words appear in many documents
- Add `tokenization.normOptions.stripAccents = true` for multilingual social media

---

## Academic Papers — `TextIndexConfig::academic()`

**Use case:** Research papers, technical reports, legal documents, patents.

Stemming improves recall (matching "algorithms" when searching "algorithm"). High `bm25.b` (0.9) normalizes for paper length variance. Diversity reduces redundant results from similar papers.

```gcl
var config = TextIndexConfig::academic();

var index = TextIndex<String> { config: config };

// Index research papers
index.add("Attention Is All You Need: a novel transformer architecture for sequence transduction", "paper-001");
index.add("BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", "paper-002");
index.add("An empirical study of transformer architectures for neural machine translation", "paper-003");
index.build();

// Stemming: "transformers" matches "transformer"
// BM25+ prevents penalizing longer abstracts
// Diversity reduces near-duplicate transformer papers
var _results = index.search_bm25("transformer architectures", 10);
```

**When to tune:**
- Use `BM25Variant::plus` for corpora with high length variance (papers from 2 to 50 pages)
- Lower `diversify.lambda` (0.3) for broader topic exploration
- Raise `diversify.lambda` (0.8) for focused retrieval on a specific topic

---

## E-Commerce Product Search — `TextIndexConfig::ecommerce()`

**Use case:** Product catalogs, shopping sites, inventory search with faceting.

Typo tolerance, stop words, and `<mark>` highlighting. Field weights are not
preset (typed `field` refs need your document type) — `add_fields` auto-discovers
String fields at weight 1.0 by default, or attach an explicit `cfg.fields = [...]`.

```gcl
type Product {
    name: String;
    description: String;
    brand: String;
    category: String;
    price: float;
}

var config = TextIndexConfig::ecommerce();
config.fields = [
    FieldConfig { f: Product::name,        weight: 5.0, fieldB: 0.3 },
    FieldConfig { f: Product::description, weight: 1.0 },
    FieldConfig { f: Product::brand,       weight: 3.0 }
];

var index = TextIndex<Product> { config: config };

index.add_fields(Product {
    name: "Wireless Noise-Cancelling Headphones",
    description: "Premium over-ear headphones with 30-hour battery life",
    brand: "AudioMax",
    category: "electronics",
    price: 299.99
});
index.build();

// Faceted search with category and brand breakdowns
var requests = Array<FacetRequest> {};
requests.add(FacetRequest { f: Product::category, facetType: FacetType::term });
requests.add(FacetRequest { f: Product::brand,    facetType: FacetType::term });
var result = index.search_faceted("wireless headphones", 20, requests);

// Highlighted snippet for search results display.
// `add_fields()` builds an indexable key by concatenating the configured
// `FieldConfig.f` values, so pass `r.key` from a result.
var _snip = index.snippet(result.results[0].key, "wireless headphones", null);
```

**When to tune:**
- Use `search_faceted()` with `NumericRangeBucket` for price range facets
- Adjust `typoTolerance` thresholds to make typo tolerance tighter/looser

---

## Log & Event Search — `TextIndexConfig::logs()`

**Use case:** Server logs, application logs, audit trails, security events.

No stop words (log messages are structured, not natural language). No stemming. Low minimum term length to catch error codes and short tokens.

```gcl
var config = TextIndexConfig::logs();

var index = TextIndex<String> { config: config };

// Index log entries
index.add("2024-01-15 ERROR [api-gateway] Connection refused to db-primary:5432", "log-001");
index.add("2024-01-15 WARN [auth-service] Rate limit exceeded for IP 10.0.0.42", "log-002");
index.add("2024-01-15 ERROR [payment] Timeout after 30s connecting to payment-provider", "log-003");
index.build();

// Search for connection errors
var _results = index.search_boolean("ERROR AND (connection OR timeout)", 50);

// Exact match for specific error codes
var _results2 = index.search_exact("5432", 10);
```

**When to tune:**
- Use `search_boolean()` for structured log queries with AND/OR/NOT
- Use `search_exact()` for IP addresses, port numbers, and error codes
- Consider `PercolateIndex` for real-time alerting on incoming log streams

---

## Multilingual — `TextIndexConfig::multilingual(lang)`

**Use case:** International content, multilingual documentation, localized product descriptions.

Language-specific stop words and character normalization. Accent stripping ensures "cafe" matches "cafe" across languages.

```gcl
// French content index
var config = TextIndexConfig::multilingual(TextSearchLanguage::fr);

var frIndex = TextIndex<String> { config: config };

frIndex.add("Les algorithmes d'apprentissage automatique transforment l'industrie", "doc-fr-001");
frIndex.add("Introduction au traitement du langage naturel", "doc-fr-002");
frIndex.build();

// Accent-insensitive: "apprentissage" matches regardless of accents
var _results = frIndex.search_bm25("apprentissage automatique", 10);
```

**Supported languages (33):**

`ar` `bg` `ca` `cs` `da` `de` `el` `en` `es` `fa` `fi` `fr` `gu` `he` `hi` `hu` `id` `it` `ja` `ko` `ms` `nl` `no` `pl` `pt` `ro` `ru` `sk` `sv` `tr` `uk` `vi` `zh`

---

## Real-Time Alerting — `TextIndexConfig::realtime_alert()`

**Use case:** Content monitoring, compliance alerts, news feeds, security scanning.

Pair with `PercolateIndex` for reverse search. Register standing queries, then match incoming documents against them.

```gcl
var alertEngine = PercolateIndex {
    config: TextIndexConfig::realtime_alert()
};

// Register alert rules
alertEngine.add_query("outage-critical", "outage AND production", PercolateMode::boolean);
alertEngine.add_query("security-breach", "unauthorized AND access", PercolateMode::boolean);
alertEngine.add_query("performance-issue", "latency timeout degradation", PercolateMode::bm25);

// Process incoming events
var _alerts = alertEngine.percolate("Production outage in us-east-1 region", 10);
// alerts == ["outage-critical"]
```

**When to tune:**
- Use `PercolateMode::boolean` for precise matching (compliance rules)
- Use `PercolateMode::bm25` for broad topic matching (news feeds)
- Stemming is on by default so "running" in alerts matches "runs" in incoming documents

---

## Phonetic Name Search — `TextIndexConfig::phonetic_name()`

**Use case:** People directories, customer databases, genealogy, contact lookup.

Double Metaphone phonetic matching finds names that sound alike regardless of spelling.

```gcl
var config = TextIndexConfig::phonetic_name();

var index = TextIndex<String> { config: config };

// Index names
index.add("John Smith", "contact-001");
index.add("Jon Smythe", "contact-002");
index.add("Catherine Johnson", "contact-003");
index.add("Kathryn Jonson", "contact-004");
index.build();

// Phonetic: "Smith" matches "Smythe", "Catherine" matches "Kathryn"
var _results = index.search_phonetic("John Smith", 10);
// Returns both "John Smith" and "Jon Smythe"

var _results2 = index.search_phonetic("Catherine Johnson", 10);
// Returns both "Catherine Johnson" and "Kathryn Jonson"
```

**When to tune:**
- Combine with `search_fuzzy()` for both phonetic and edit-distance matching
- Use `FuzzyOptions { mode: FuzzyMode::term }` alongside phonetic for maximum recall on name variants