# Text Processing Utilities The text search library includes three standalone text processing utilities that can be used independently of `TextIndex`: **TextTokenizer** for tokenization and term frequency analysis, **TextParser** for splitting documents into semantic sections, and **TextChunker** for splitting long texts into sized chunks for RAG pipelines and semantic search. ## TextTokenizer `TextTokenizer` converts raw text into normalized tokens with positional metadata. It applies the full normalization pipeline: character mapping, NFKD casefolding, splitting on separators, length filtering, numeric filtering, and optional stemming. Note: stop word filtering is **not** performed by the tokenizer. Tokenization is purely about splitting and normalization; stop word filtering is a separate index-level concern, applied when populating term postings inside `TextIndex`. The `stopWords` fields on `TextIndexConfig` are ignored by `TextTokenizer::tokenize` / `tokenize_normalized` / `tokenize_core`. ### tokenize() The primary method. Takes raw text and a `TextIndexConfig`, returns an array of `TokenInfo` objects with normalized text, the original form, and position: ```gcl var config = TextIndexConfig { tokenization: TokenizationOptions { stemming: true } }; var tokens = TextTokenizer::tokenize("Machine Learning algorithms are powerful tools!", config); for (var i = 0; i < tokens.size(); i++) { var t = tokens[i]; info("pos=${t.position} text='${t.text}' original='${t.original}'"); } // Output: // pos=0 text='machin' original='machine' (stemmed) // pos=1 text='learn' original='learning' (stemmed) // pos=2 text='algorithm' original='algorithms' // pos=3 text='are' original='are' // pos=4 text='power' original='powerful' (stemmed) // pos=5 text='tool' original='tools' (stemmed) // Note: "are" is preserved by the tokenizer; stop word filtering is an // index-level step (it would skip "are" only when populating term postings // inside TextIndex, not during tokenization). ``` ### TokenInfo Type ```gcl @volatile type TokenInfo { text: String; // Normalized token text (stemmed, casefolded) original: String; // Original token form before normalization position: int; // Token position in document (0-indexed) } ``` Positions are preserved even when tokens are filtered out (by stop words, length, etc.), so they can be used for phrase and proximity queries. ### tokenize_with_originals() Tokenizes text and returns a term frequency map with original forms and positional offsets. This is useful for building term statistics or analyzing document term distributions: ```gcl var config = TextIndexConfig {}; var freqMap = TextTokenizer::tokenize_with_originals( "The quick brown fox jumps over the quick red fox", config ); for (term, freq in freqMap) { info("'${term}': count=${freq.count}, original='${freq.original}', positions=${freq.positions}"); } // Output (the tokenizer does not filter stop words): // 'the': count=2, original='the', positions=[0, 6] // 'quick': count=2, original='quick', positions=[1, 7] // 'brown': count=1, original='brown', positions=[2] // 'fox': count=2, original='fox', positions=[3, 9] // 'jumps': count=1, original='jumps', positions=[4] // 'over': count=1, original='over', positions=[5] // 'red': count=1, original='red', positions=[8] ``` ### TermFrequency Type ```gcl @volatile type TermFrequency { original: String; // Original term form (before normalization) count: int; // Number of occurrences in document positions: Array; // Positional offsets of each occurrence } ``` ### tokenize_normalized() Tokenizes pre-normalized text, skipping the normalization step. Use this when you have already normalized the input (for example, if you store normalized text separately): ```gcl var config = TextIndexConfig { stopWords: StopWordOptions { mode: StopWordMode::none } }; // Input is already lowercased and cleaned var _tokens = TextTokenizer::tokenize_normalized("machine learning algorithms", config); ``` ### Configuration Effects on Tokenization The `TextIndexConfig` controls every step of the tokenization pipeline: | Config Field | Effect | |-------------|--------| | `tokenization.separators` | Characters used to split text into tokens (default: space) | | `tokenization.minTermLength` | Tokens shorter than this are discarded (default: 2) | | `tokenization.maxTermLength` | Tokens longer than this are discarded (default: 100) | | `tokenization.filterNumericTerms` | If true, purely numeric tokens are removed (default: true) | | `tokenization.stripPunctuation` | If true, punctuation is stripped from tokens (default: true) | | `tokenization.stemming` | If true, Porter stemmer is applied (default: false) | The `stopWords.*` fields on `TextIndexConfig` are not read by the tokenizer — they only take effect when `TextIndex` populates its term postings. If you need stop word filtering on raw `TextTokenizer` output, apply it yourself after calling `tokenize()`. ## TextParser `TextParser` splits document text into semantic sections, detecting structure such as headings, code blocks, lists, tables, blockquotes, and paragraphs. It is designed primarily for Markdown content but works with plain text as well. ### split_into_sections() The main method. Takes a document string and returns an array of `ParsedSection` objects: ```gcl var markdown = "# Introduction\n\nThis is the introduction paragraph.\n\n## Methods\n\n- Step one\n- Step two\n- Step three\n\n```python\ndef hello():\n print('world')\n```\n\n> Important note about the results.\n\n---\n\nFinal paragraph here."; var sections = TextParser::split_into_sections(markdown); for (var i = 0; i < sections.size(); i++) { var s = sections[i]; info("${s.sectionType} [lines ${s.startLine}-${s.endLine}]: '${s.title}' => '${s.content}'"); } // Output: // heading [lines 0-1]: 'Introduction' => '# Introduction' // paragraph [lines 2-4]: '' => 'This is the introduction paragraph.' // heading [lines 4-5]: 'Methods' => '## Methods' // list [lines 6-9]: '' => '- Step one\n- Step two\n- Step three' // code [lines 9-12]: '' => 'def hello():\n print(\'world\')' // blockquote [lines 13-14]: '' => 'Important note about the results.' // horizontalRule [lines 15-16]: '' => '---' // paragraph [lines 17-18]: '' => 'Final paragraph here.' ``` ### ParsedSection Type ```gcl @volatile type ParsedSection { sectionType: SectionType; // Type of section detected content: String; // Section content (text without markers for blockquotes) title: String; // Heading text (only populated for heading sections) startLine: int; // Starting line number in original document endLine: int; // Ending line number in original document } ``` ### SectionType Enum | Type | Detection Rule | Content | |------|---------------|---------| | `paragraph` | Contiguous non-empty lines that are not another type | Raw paragraph text | | `heading` | Lines starting with `#` | Full heading line; `title` has text without `#` markers | | `code` | Lines between `` ``` `` or `~~~` fences | Code content without fence markers | | `list` | Lines starting with `- `, `* `, `+ `, or numbered (`1.`, `2)`) | All list items joined with newlines | | `table` | Lines containing `\|` followed by a separator line (`\|---\|`) | All table lines joined with newlines | | `blockquote` | Lines starting with `>` | Blockquote content with `>` prefix removed | | `horizontalRule` | Lines with 3+ of `---`, `***`, or `___` | The rule characters | ### Sentence Splitting `TextParser` also provides sentence splitting, with different strategies based on section type: ```gcl // Prose: splits on sentence-ending punctuation (. ! ?) with abbreviation handling var _sentences = TextParser::split_sentences( "Dr. Smith published results. The findings were significant!", SectionType::paragraph ); // sentences == ["Dr. Smith published results.", "The findings were significant!"] // Note: "Dr." is recognized as an abbreviation and does not trigger a split. // Structured content (lists, code, tables): splits by line var _lines = TextParser::split_sentences( "- Item one\n- Item two\n- Item three", SectionType::list ); // lines == ["- Item one", "- Item two", "- Item three"] ``` The sentence splitter handles: - Common abbreviations (Dr., Mr., Mrs., Ms., Prof., Inc., Ltd., Corp., vs., etc., e.g., i.e., and others) - Quoted text boundaries (straight and curly quotes) - CJK sentence-ending punctuation (period, exclamation mark, question mark, ellipsis) ## TextChunker `TextChunker` splits long texts into sized chunks suitable for embedding-based semantic search and retrieval-augmented generation (RAG). Each chunk includes position metadata and character offsets back to the original document. ### chunk() The main method. Takes text, a chunking strategy, a target chunk size in words, and an overlap size in words: ```gcl var chunks = TextChunker::chunk(longText, ChunkStrategy::sentence, 128, 20); ``` ### ChunkStrategy Enum | Strategy | Behavior | Best For | |----------|----------|----------| | `none` | Returns empty array (no chunking) | Disabling chunking | | `fixed` | Splits by word count with configurable overlap | Uniform chunk sizes, predictable token counts | | `sentence` | Splits on sentence boundaries, groups into chunks | Prose text, maintaining sentence integrity | | `paragraph` | Splits on paragraph boundaries (blank lines), groups into chunks | Structured documents, maintaining paragraph integrity | | `recursive` | Tries paragraph first, then sentence, then fixed | General-purpose; adapts to document structure | ### ChunkInfo Type ```gcl @volatile type ChunkInfo { content: String; // Chunk text content position: int; // Chunk index within source document (0-indexed) startChar: int; // Start character offset in source document endChar: int; // End character offset in source document } ``` ### Fixed Chunking Splits text into chunks of a fixed word count with optional overlap. Overlap ensures context is shared between adjacent chunks, which improves retrieval quality. ```gcl var text = "The quick brown fox jumps over the lazy dog and then runs across the field toward the distant hills beyond the river"; var chunks = TextChunker::chunk(text, ChunkStrategy::fixed, 5, 2); for (var i = 0; i < chunks.size(); i++) { var c = chunks[i]; info("chunk ${c.position}: '${c.content}' [${c.startChar}-${c.endChar}]"); } // Output: // chunk 0: 'The quick brown fox jumps' [0-25] // chunk 1: 'fox jumps over the lazy' [16-39] // chunk 2: 'the lazy dog and then' [30-51] // ... // Each chunk has 5 words; consecutive chunks overlap by 2 words ``` The step between chunks is `chunking.size - chunking.overlap`. If overlap >= size, the step is clamped to 1. ### Sentence Chunking Splits text on sentence boundaries (`.`, `!`, `?`, and CJK equivalents), then groups consecutive sentences until the target word count is reached: ```gcl var text = "Machine learning is transforming industries. Neural networks power modern AI systems. Natural language processing enables text understanding. Computer vision handles image recognition tasks."; var chunks = TextChunker::chunk(text, ChunkStrategy::sentence, 15, 0); for (var i = 0; i < chunks.size(); i++) { var c = chunks[i]; info("chunk ${c.position}: '${c.content}'"); } // Output: // chunk 0: 'Machine learning is transforming industries. Neural networks power modern AI systems.' // chunk 1: 'Natural language processing enables text understanding. Computer vision handles image recognition tasks.' // Sentences are grouped to stay near the 15-word target ``` ### Paragraph Chunking Splits text on paragraph boundaries (blank lines), then groups consecutive paragraphs until the target word count is reached: ```gcl var text = "First paragraph about machine learning.\n\nSecond paragraph about neural networks and deep learning architectures.\n\nThird paragraph covering natural language processing.\n\nFourth paragraph on computer vision."; var chunks = TextChunker::chunk(text, ChunkStrategy::paragraph, 20, 0); for (var i = 0; i < chunks.size(); i++) { var c = chunks[i]; info("chunk ${c.position}: '${c.content}'"); } // Output: // chunk 0: 'First paragraph about machine learning. Second paragraph about neural networks and deep learning architectures.' // chunk 1: 'Third paragraph covering natural language processing. Fourth paragraph on computer vision.' ``` ### Recursive Chunking The recursive strategy adaptively selects the best splitting method. It tries paragraph splitting first; if that produces only one chunk, it falls back to sentence splitting; if that also produces one chunk, it falls back to fixed splitting. This makes it a good default choice when you do not know the structure of incoming text. ```gcl // For structured text with paragraphs, this behaves like paragraph chunking // For prose without paragraph breaks, it falls back to sentence chunking // For text without sentence boundaries, it falls back to fixed chunking var chunks = TextChunker::chunk(text, ChunkStrategy::recursive, 100, 10); ``` ### RAG Pipeline Example A typical RAG pipeline: chunk documents, index the chunks, then search: ```gcl var index = TextIndex { config: TextIndexConfig { stopWords: StopWordOptions { mode: StopWordMode::default } } }; var document = "Long document text with multiple paragraphs and sections..."; // Chunk the document var chunks = TextChunker::chunk(document, ChunkStrategy::sentence, 128, 20); // Index each chunk with a reference to its source for (var i = 0; i < chunks.size(); i++) { var chunk = chunks[i]; var chunkId = "doc1::chunk_${chunk.position}"; index.add(chunk.content, chunkId); } index.build(); // Search across chunks var results = index.search_bm25("specific topic", 5); for (var i = 0; i < results.size(); i++) { var r = results[i]; info("Match in ${r.value}: score=${r.score}"); } ``` ### Chunking with Section Awareness Combine `TextParser` with `TextChunker` for section-aware chunking -- first parse the document into sections, then chunk each section independently: ```gcl var markdownDoc = "# Chapter 1\n\nLong introduction text...\n\n## Section 1.1\n\nDetailed content here...\n\n```python\ncode example\n```"; // Parse into sections var sections = TextParser::split_into_sections(markdownDoc); var allChunks = Array {}; for (var i = 0; i < sections.size(); i++) { var section = sections[i]; // Skip code blocks and horizontal rules if (section.sectionType == SectionType::code || section.sectionType == SectionType::horizontalRule) { continue; } // Chunk each prose section independently var sectionChunks = TextChunker::chunk(section.content, ChunkStrategy::sentence, 100, 10); for (var j = 0; j < sectionChunks.size(); j++) { allChunks.add(sectionChunks[j]); } } ``` ## When to Use Each Utility | Utility | Use Case | |---------|----------| | **TextTokenizer::tokenize()** | Get normalized tokens with positions for custom indexing or analysis | | **TextTokenizer::tokenize_with_originals()** | Build term frequency statistics, analyze document vocabulary | | **TextTokenizer::tokenize_normalized()** | Tokenize pre-normalized text on hot paths to avoid redundant normalization | | **TextParser::split_into_sections()** | Parse Markdown/text into structured sections for section-aware processing | | **TextParser::split_sentences()** | Split text into sentences with abbreviation handling | | **TextChunker::chunk() with fixed** | Uniform chunk sizes for embedding models with fixed context windows | | **TextChunker::chunk() with sentence** | Preserve sentence boundaries in chunks for better semantic coherence | | **TextChunker::chunk() with paragraph** | Preserve paragraph boundaries for structured documents | | **TextChunker::chunk() with recursive** | General-purpose default; adapts to document structure automatically |