7.7.190-stable Switch to dev

GreyCat AI Library

@library("ai", "0.0.0");

Local AI inference library for GreyCat, powered by llama.cpp. Run GGUF quantized models (Llama, Qwen, Mistral, etc.) directly within your GreyCat application — no external API calls required.

Features

  • Local inference — text generation, chat, embeddings, and tokenization
  • GPU acceleration — CUDA, Metal, OpenCL, Vulkan
  • Quantization — 2-bit to 16-bit formats for memory efficiency
  • Chat templates — built-in conversation formatting per model
  • Streaming — token-by-token generation via callbacks
  • Embeddings — dense vector representations for semantic search
  • LoRA adapters — fine-tuned model support
  • State save/load — resumable inference sessions

Quick Start

Loading a Model

var model = Model::load("my_model", "./model.gguf", ModelParams {
    n_gpu_layers: -1,  // offload all layers to GPU
});

n_gpu_layers: -1 offloads all layers to the GPU. Use 0 for CPU-only inference, or a specific number to partially offload.

Chat Completion

var messages = [
    ChatMessage { role: "system", content: "You are a helpful assistant." },
    ChatMessage { role: "user", content: "Hello!" },
];
var result = model.chat(messages, GenerationParams { max_tokens: 256 }, null);
info("Response: ${result.text}");

Chat with Streaming

var result = model.chat_stream(messages, fn (token: String, is_final: bool) {
    print(token);
}, GenerationParams { max_tokens: 256 }, null);

Text Generation (Raw Prompt)

var result = model.generate("Once upon a time", GenerationParams {
    max_tokens: 128,
    temperature: 0.7,
}, null);
info(result.text);

Embeddings

var model = Model::load("embed_model", "./Qwen3-Embedding-0.6B-f16.gguf", ModelParams {
    n_gpu_layers: -1,
    use_mmap: true,
});
var embedding = model.embed("Hello World!", TensorType::f32, ContextParams {
    n_ctx: 1024,
    n_batch: 512,
});
info("Embedding dimension: ${embedding.size()}");
info("Sum: ${embedding.sum()}");

Batch Embeddings

var embeddings = model.embed_batch(
    ["first document", "second document", "third document"],
    TensorType::f32,
    null,
);

Tokenization

var tokens = model.tokenize("Hello world", true, false);
var text = model.detokenize(tokens, true, false);

API Reference

Core Types

Model

The main type for interacting with a loaded language model.

Static Methods:

Method Description
Model::load(id, path, params?) Load a GGUF model from disk. Returns null on failure.
Model::load_from_splits(id, paths, params?) Load a model split across multiple files.
Model::get(id) Retrieve a previously loaded model by its ID.
Model::quantize(input, output, params?) Convert a model to a different quantization format.

Instance Methods — Generation:

Method Description
chat(messages, params?, ctx_params?) Chat completion from a list of messages.
chat_stream(messages, callback, params?, ctx_params?) Streaming chat completion.
generate(prompt, params?, ctx_params?) Raw text generation from a prompt.
generate_stream(prompt, callback, params?, ctx_params?) Streaming text generation.
format_chat(messages, add_assistant) Format messages using the model’s chat template.

Instance Methods — Embeddings:

Method Description
embed(text, tensor_type, ctx_params?) Compute embedding vector for a single text.
embed_batch(texts, tensor_type, ctx_params?) Compute embeddings for multiple texts (batched).

Instance Methods — Tokenization:

Method Description
tokenize(text, add_special, parse_special) Convert text to token IDs.
detokenize(tokens, remove_special, unparse_special) Convert token IDs back to text.
token_to_text(token) Convert a single token ID to text.

Instance Methods — Info & Management:

Method Description
info() Get comprehensive model metadata (ModelInfo).
desc() Get a human-readable model description.
meta(key) Get a specific metadata value by key.
chat_template(name?) Get the model’s chat template.
perf() Get performance metrics from the last operation.
save(path) Save model to a GGUF file.
free() Explicitly release model resources.

LLM

Static utility functions for the llama.cpp runtime.

Method Description
LLM::logging(enabled) Enable/disable llama.cpp internal logging.
LLM::system_info() Get runtime environment info (CPU, GPU, SIMD, etc.).
LLM::supports_gpu() Check if GPU acceleration is available.
LLM::supports_mmap() Check if memory-mapped file loading is supported.
LLM::max_devices() Get maximum number of usable GPUs.
LLM::chat_builtin_templates() List all built-in chat templates.
LLM::params_fit(path, mparams, cparams, margin, n_ctx_min) Auto-fit parameters to available device memory.

Parameter Types

ModelParams

Controls model loading behavior.

Field Type Description
n_gpu_layers int? Layers to offload to GPU (-1 = all, 0 = CPU only)
split_mode SplitMode? Multi-GPU distribution strategy
main_gpu int? Primary GPU index (default: 0)
tensor_split Array<float>? Per-GPU tensor row proportions
vocab_only bool? Load vocabulary only (no weights)
use_mmap bool? Use memory-mapped files for faster loading
use_mlock bool? Lock model in RAM (prevent swapping)
check_tensors bool? Validate tensor data during loading

ContextParams

Controls inference context behavior and memory usage.

Field Type Description
n_ctx int? Context window size (0 = from model)
n_batch int? Logical max batch size for decode
n_ubatch int? Physical max batch size
n_threads int? Threads for generation
n_threads_batch int? Threads for batch processing
embeddings bool? Extract embeddings with logits
normalize bool? L2-normalize embeddings
flash_attn_type FlashAttnType? Flash Attention configuration
type_k GgmlType? KV cache key data type
type_v GgmlType? KV cache value data type
offload_kqv bool? Offload KQV operations to GPU

GenerationParams

High-level generation control.

Field Type Description
max_tokens int? Maximum tokens to generate
temperature float? Sampling temperature (0.0 = deterministic)
top_p float? Nucleus sampling threshold
top_k int? Top-K sampling
grammar String? GBNF grammar to constrain output
stop_sequences Array<String>? Stop generation on these strings
sampler SamplerParams? Full sampler configuration

SamplerParams

Fine-grained sampling control. All fields are optional.

Field Type Description
temperature float? Temperature (0.0 = deterministic)
top_k int? Top-K (0 = disabled)
top_p float? Top-P / nucleus (1.0 = disabled)
min_p float? Min-P (0.0 = disabled)
typical_p float? Typical sampling (1.0 = disabled)
penalty PenaltyParams? Repetition penalties
dry DryParams? DRY (Don’t Repeat Yourself) penalties
mirostat MirostatParams? Mirostat v1 parameters
mirostat_v2 MirostatV2Params? Mirostat v2 parameters
grammar String? GBNF grammar constraint
logit_bias Array<LogitBias>? Per-token logit biases
seed int? Random seed for reproducibility

Result Types

GenerationResult

Field Type Description
text String Generated text
tokens Array<int> Generated token IDs
n_tokens int Number of tokens generated
stop_reason StopReason Why generation stopped (max_tokens, eog_token, aborted, error)
perf PerfData Performance metrics

ChatMessage

Field Type Description
role String Message role: "system", "user", or "assistant"
content String Message text

ModelInfo

Detailed model metadata including architecture dimensions (n_embd, n_layer, n_head), vocabulary info, RoPE parameters, capabilities (has_encoder, has_decoder, is_recurrent), special tokens, and the full metadata map.

PerfData

Field Type Description
context.n_eval int Tokens evaluated
context.t_eval_ms float Total evaluation time (ms)
context.tokens_per_second float Generation throughput
context.prompt_tokens_per_second float Prompt processing throughput

Enums

Enum Values Description
SplitMode none, layer, row Multi-GPU splitting strategy
PoolingType unspecified, none, mean, cls, last, rank Embedding pooling strategy
AttentionType unspecified, causal, non_causal Attention mechanism
FlashAttnType disabled, enabled_for_fa, enabled_for_all Flash Attention config
GgmlType f32, f16, q4_0, q8_0, … Quantization formats
StopReason max_tokens, eog_token, aborted, error Generation stop reason
VocabType none, spm, bpe, wpm, ugm, rwkv, plamo2 Tokenizer type
SamplerType greedy, dist, top_k, top_p, min_p, temp, … Sampler identifiers

Model Downloads

Download GGUF models from Hugging Face. Use a browser — wget/curl may not work with some download links.

Recommended models for testing: