8.0.631-stable

Switch to dev

GreyCat AI Library

@library("ai", "0.0.0");

Local AI inference library for GreyCat, powered by llama.cpp. Run GGUF quantized models (Llama, Qwen, Mistral, etc.) directly within your GreyCat application — no external API calls required.

Features

Available now:

Local embeddings — dense vector representations for semantic search (embed, embed_batch)
Tokenization — tokenize / detokenize and per-token utilities
Model introspection — architecture, metadata, chat template (info, meta, desc)
Chat-prompt formatting — apply a model’s chat template (format_chat)
GPU acceleration — CUDA, Metal, Vulkan at load time
Quantization formats — load 2-bit to 16-bit GGUF models

Planned (not yet implemented — these raise a runtime error today):

Text generation & chat: generate, generate_stream, chat, chat_stream
Model quantization: quantize
LoRA adapters and KV-cache state save/load

Quick Start

Note: Embeddings, tokenization, model introspection, and format_chat are available today. Chat completion and text generation (chat, generate, and their streaming variants) are planned and currently raise a runtime error.

Loading a Model

var model = Model::load("my_model", "./model.gguf", ModelParams {
    n_gpu_layers: -1,  // offload all layers to GPU
});

n_gpu_layers: -1 offloads all layers to the GPU. Use 0 for CPU-only inference, or a specific number to partially offload.

Chat Completion (planned — not yet implemented)

var messages = [
    ChatMessage { role: "system", content: "You are a helpful assistant." },
    ChatMessage { role: "user", content: "Hello!" },
];
var result = model.chat(messages, GenerationParams { max_tokens: 256 }, null);
info("Response: ${result.text}");

Chat with Streaming (planned — not yet implemented)

var result = model.chat_stream(messages, fn (token: String, is_final: bool) {
    print(token);
}, GenerationParams { max_tokens: 256 }, null);

Text Generation (planned — not yet implemented)

var result = model.generate("Once upon a time", GenerationParams {
    max_tokens: 128,
    temperature: 0.7,
}, null);
info(result.text);

Embeddings

var model = Model::load("embed_model", "./Qwen3-Embedding-0.6B-f16.gguf", ModelParams {
    n_gpu_layers: -1,
    use_mmap: true,
});
var embedding = model.embed("Hello World!", TensorType::f32, ContextParams {
    n_ctx: 1024,
    n_batch: 512,
});
info("Embedding dimension: ${embedding.size()}");
info("Sum: ${embedding.sum()}");

Batch Embeddings

var embeddings = model.embed_batch(
    ["first document", "second document", "third document"],
    TensorType::f32,
    null,
);

Tokenization

var tokens = model.tokenize("Hello world", true, false);
var text = model.detokenize(tokens, true, false);

API Reference

Core Types

`Model`

The main type for interacting with a loaded language model.

Static Methods:

Method	Description
`Model::load(id, path, params?)`	Load a GGUF model from disk. Returns `null` on failure.
`Model::load_from_splits(id, paths, params?)`	Load a model split across multiple files.
`Model::get(id)`	Retrieve a previously loaded model by its ID.
`Model::quantize(input, output, params?)`	Convert a model to a different quantization format.

Instance Methods — Generation:

Method	Description
`chat(messages, params?, ctx_params?)`	Chat completion from a list of messages.
`chat_stream(messages, callback, params?, ctx_params?)`	Streaming chat completion.
`generate(prompt, params?, ctx_params?)`	Raw text generation from a prompt.
`generate_stream(prompt, callback, params?, ctx_params?)`	Streaming text generation.
`format_chat(messages, add_assistant)`	Format messages using the model’s chat template.

Instance Methods — Embeddings:

Method	Description
`embed(text, tensor_type, ctx_params?)`	Compute embedding vector for a single text.
`embed_batch(texts, tensor_type, ctx_params?)`	Compute embeddings for multiple texts (batched).

Instance Methods — Tokenization:

Method	Description
`tokenize(text, add_special, parse_special)`	Convert text to token IDs.
`detokenize(tokens, remove_special, unparse_special)`	Convert token IDs back to text.
`token_to_text(token)`	Convert a single token ID to text.

Instance Methods — Info & Management:

Method	Description
`info()`	Get comprehensive model metadata (`ModelInfo`).
`desc()`	Get a human-readable model description.
`meta(key)`	Get a specific metadata value by key.
`chat_template(name?)`	Get the model’s chat template.
`perf()`	Get performance metrics from the last operation.
`save(path)`	Save model to a GGUF file.
`free()`	Explicitly release model resources.

`LLM`

Static utility functions for the llama.cpp runtime.

Method	Description
`LLM::logging(enabled)`	Enable/disable llama.cpp internal logging.
`LLM::system_info()`	Get runtime environment info (CPU, GPU, SIMD, etc.).
`LLM::supports_gpu()`	Check if GPU acceleration is available.
`LLM::supports_mmap()`	Check if memory-mapped file loading is supported.
`LLM::max_devices()`	Get maximum number of usable GPUs.
`LLM::chat_builtin_templates()`	List all built-in chat templates.
`LLM::params_fit(path, mparams, cparams, margin, n_ctx_min)`	Auto-fit parameters to available device memory.

Parameter Types

`ModelParams`

Controls model loading behavior.

Field	Type	Description
`n_gpu_layers`	`int?`	Layers to offload to GPU (`-1` = all, `0` = CPU only)
`split_mode`	`SplitMode?`	Multi-GPU distribution strategy
`main_gpu`	`int?`	Primary GPU index (default: 0)
`tensor_split`	`Array<float>?`	Per-GPU tensor row proportions
`vocab_only`	`bool?`	Load vocabulary only (no weights)
`use_mmap`	`bool?`	Use memory-mapped files for faster loading
`use_mlock`	`bool?`	Lock model in RAM (prevent swapping)
`check_tensors`	`bool?`	Validate tensor data during loading

`ContextParams`

Controls inference context behavior and memory usage.

Field	Type	Description
`n_ctx`	`int?`	Context window size (`0` = from model)
`n_batch`	`int?`	Logical max batch size for decode
`n_ubatch`	`int?`	Physical max batch size
`n_threads`	`int?`	Threads for generation
`n_threads_batch`	`int?`	Threads for batch processing
`embeddings`	`bool?`	Extract embeddings with logits
`normalize`	`bool?`	L2-normalize embeddings
`flash_attn_type`	`FlashAttnType?`	Flash Attention configuration
`type_k`	`GgmlType?`	KV cache key data type
`type_v`	`GgmlType?`	KV cache value data type
`offload_kqv`	`bool?`	Offload KQV operations to GPU

`GenerationParams`

High-level generation control.

Field	Type	Description
`max_tokens`	`int?`	Maximum tokens to generate
`temperature`	`float?`	Sampling temperature (`0.0` = deterministic)
`top_p`	`float?`	Nucleus sampling threshold
`top_k`	`int?`	Top-K sampling
`grammar`	`String?`	GBNF grammar to constrain output
`stop_sequences`	`Array<String>?`	Stop generation on these strings
`sampler`	`SamplerParams?`	Full sampler configuration

`SamplerParams`

Fine-grained sampling control. All fields are optional.

Field	Type	Description
`temperature`	`float?`	Temperature (`0.0` = deterministic)
`top_k`	`int?`	Top-K (`0` = disabled)
`top_p`	`float?`	Top-P / nucleus (`1.0` = disabled)
`min_p`	`float?`	Min-P (`0.0` = disabled)
`typical_p`	`float?`	Typical sampling (`1.0` = disabled)
`penalty`	`PenaltyParams?`	Repetition penalties
`dry`	`DryParams?`	DRY (Don’t Repeat Yourself) penalties
`mirostat`	`MirostatParams?`	Mirostat v1 parameters
`mirostat_v2`	`MirostatV2Params?`	Mirostat v2 parameters
`grammar`	`String?`	GBNF grammar constraint
`logit_bias`	`Array<LogitBias>?`	Per-token logit biases
`seed`	`int?`	Random seed for reproducibility

Result Types

`GenerationResult`

Field	Type	Description
`text`	`String`	Generated text
`tokens`	`Array<int>`	Generated token IDs
`n_tokens`	`int`	Number of tokens generated
`stop_reason`	`StopReason`	Why generation stopped (`max_tokens`, `eog_token`, `aborted`, `error`)
`perf`	`PerfData`	Performance metrics

`ChatMessage`

Field	Type	Description
`role`	`String`	Message role: `"system"`, `"user"`, or `"assistant"`
`content`	`String`	Message text

Detailed model metadata including architecture dimensions (n_embd, n_layer, n_head), vocabulary info, RoPE parameters, capabilities (has_encoder, has_decoder, is_recurrent), special tokens, and the full metadata map.

`PerfData`

Field	Type	Description
`context.n_eval`	`int`	Tokens evaluated
`context.t_eval_ms`	`float`	Total evaluation time (ms)
`context.tokens_per_second`	`float`	Generation throughput
`context.prompt_tokens_per_second`	`float`	Prompt processing throughput

Enums

Enum	Values	Description
`SplitMode`	`none`, `layer`, `row`	Multi-GPU splitting strategy
`PoolingType`	`unspecified`, `none`, `mean`, `cls`, `last`, `rank`	Embedding pooling strategy
`AttentionType`	`unspecified`, `causal`, `non_causal`	Attention mechanism
`FlashAttnType`	`disabled`, `enabled_for_fa`, `enabled_for_all`	Flash Attention config
`GgmlType`	`f32`, `f16`, `q4_0`, `q8_0`, …	Quantization formats
`StopReason`	`max_tokens`, `eog_token`, `aborted`, `error`	Generation stop reason
`VocabType`	`none`, `spm`, `bpe`, `wpm`, `ugm`, `rwkv`, `plamo2`	Tokenizer type
`SamplerType`	`greedy`, `dist`, `top_k`, `top_p`, `min_p`, `temp`, …	Sampler identifiers

Model Downloads

Download GGUF models from Hugging Face. Use a browser — wget/curl may not work with some download links.

Recommended models for testing:

Embeddings: Qwen3-Embedding-0.6B-GGUF (f16)
Chat: Qwen3-1.7B-GGUF (Q8_0)

In this page