In this page
GreyCat AI Library
@library("ai", "0.0.0");
Local AI inference library for GreyCat, powered by llama.cpp. Run GGUF quantized models (Llama, Qwen, Mistral, etc.) directly within your GreyCat application — no external API calls required.
Features
- Local inference — text generation, chat, embeddings, and tokenization
- GPU acceleration — CUDA, Metal, OpenCL, Vulkan
- Quantization — 2-bit to 16-bit formats for memory efficiency
- Chat templates — built-in conversation formatting per model
- Streaming — token-by-token generation via callbacks
- Embeddings — dense vector representations for semantic search
- LoRA adapters — fine-tuned model support
- State save/load — resumable inference sessions
Quick Start
Loading a Model
var model = Model::load("my_model", "./model.gguf", ModelParams {
n_gpu_layers: -1, // offload all layers to GPU
});
n_gpu_layers: -1 offloads all layers to the GPU. Use 0 for CPU-only inference, or a specific number to partially offload.
Chat Completion
var messages = [
ChatMessage { role: "system", content: "You are a helpful assistant." },
ChatMessage { role: "user", content: "Hello!" },
];
var result = model.chat(messages, GenerationParams { max_tokens: 256 }, null);
info("Response: ${result.text}");
Chat with Streaming
var result = model.chat_stream(messages, fn (token: String, is_final: bool) {
print(token);
}, GenerationParams { max_tokens: 256 }, null);
Text Generation (Raw Prompt)
var result = model.generate("Once upon a time", GenerationParams {
max_tokens: 128,
temperature: 0.7,
}, null);
info(result.text);
Embeddings
var model = Model::load("embed_model", "./Qwen3-Embedding-0.6B-f16.gguf", ModelParams {
n_gpu_layers: -1,
use_mmap: true,
});
var embedding = model.embed("Hello World!", TensorType::f32, ContextParams {
n_ctx: 1024,
n_batch: 512,
});
info("Embedding dimension: ${embedding.size()}");
info("Sum: ${embedding.sum()}");
Batch Embeddings
var embeddings = model.embed_batch(
["first document", "second document", "third document"],
TensorType::f32,
null,
);
Tokenization
var tokens = model.tokenize("Hello world", true, false);
var text = model.detokenize(tokens, true, false);
API Reference
Core Types
Model
The main type for interacting with a loaded language model.
Static Methods:
| Method | Description |
|---|---|
Model::load(id, path, params?) |
Load a GGUF model from disk. Returns null on failure. |
Model::load_from_splits(id, paths, params?) |
Load a model split across multiple files. |
Model::get(id) |
Retrieve a previously loaded model by its ID. |
Model::quantize(input, output, params?) |
Convert a model to a different quantization format. |
Instance Methods — Generation:
| Method | Description |
|---|---|
chat(messages, params?, ctx_params?) |
Chat completion from a list of messages. |
chat_stream(messages, callback, params?, ctx_params?) |
Streaming chat completion. |
generate(prompt, params?, ctx_params?) |
Raw text generation from a prompt. |
generate_stream(prompt, callback, params?, ctx_params?) |
Streaming text generation. |
format_chat(messages, add_assistant) |
Format messages using the model’s chat template. |
Instance Methods — Embeddings:
| Method | Description |
|---|---|
embed(text, tensor_type, ctx_params?) |
Compute embedding vector for a single text. |
embed_batch(texts, tensor_type, ctx_params?) |
Compute embeddings for multiple texts (batched). |
Instance Methods — Tokenization:
| Method | Description |
|---|---|
tokenize(text, add_special, parse_special) |
Convert text to token IDs. |
detokenize(tokens, remove_special, unparse_special) |
Convert token IDs back to text. |
token_to_text(token) |
Convert a single token ID to text. |
Instance Methods — Info & Management:
| Method | Description |
|---|---|
info() |
Get comprehensive model metadata (ModelInfo). |
desc() |
Get a human-readable model description. |
meta(key) |
Get a specific metadata value by key. |
chat_template(name?) |
Get the model’s chat template. |
perf() |
Get performance metrics from the last operation. |
save(path) |
Save model to a GGUF file. |
free() |
Explicitly release model resources. |
LLM
Static utility functions for the llama.cpp runtime.
| Method | Description |
|---|---|
LLM::logging(enabled) |
Enable/disable llama.cpp internal logging. |
LLM::system_info() |
Get runtime environment info (CPU, GPU, SIMD, etc.). |
LLM::supports_gpu() |
Check if GPU acceleration is available. |
LLM::supports_mmap() |
Check if memory-mapped file loading is supported. |
LLM::max_devices() |
Get maximum number of usable GPUs. |
LLM::chat_builtin_templates() |
List all built-in chat templates. |
LLM::params_fit(path, mparams, cparams, margin, n_ctx_min) |
Auto-fit parameters to available device memory. |
Parameter Types
ModelParams
Controls model loading behavior.
| Field | Type | Description |
|---|---|---|
n_gpu_layers |
int? |
Layers to offload to GPU (-1 = all, 0 = CPU only) |
split_mode |
SplitMode? |
Multi-GPU distribution strategy |
main_gpu |
int? |
Primary GPU index (default: 0) |
tensor_split |
Array<float>? |
Per-GPU tensor row proportions |
vocab_only |
bool? |
Load vocabulary only (no weights) |
use_mmap |
bool? |
Use memory-mapped files for faster loading |
use_mlock |
bool? |
Lock model in RAM (prevent swapping) |
check_tensors |
bool? |
Validate tensor data during loading |
ContextParams
Controls inference context behavior and memory usage.
| Field | Type | Description |
|---|---|---|
n_ctx |
int? |
Context window size (0 = from model) |
n_batch |
int? |
Logical max batch size for decode |
n_ubatch |
int? |
Physical max batch size |
n_threads |
int? |
Threads for generation |
n_threads_batch |
int? |
Threads for batch processing |
embeddings |
bool? |
Extract embeddings with logits |
normalize |
bool? |
L2-normalize embeddings |
flash_attn_type |
FlashAttnType? |
Flash Attention configuration |
type_k |
GgmlType? |
KV cache key data type |
type_v |
GgmlType? |
KV cache value data type |
offload_kqv |
bool? |
Offload KQV operations to GPU |
GenerationParams
High-level generation control.
| Field | Type | Description |
|---|---|---|
max_tokens |
int? |
Maximum tokens to generate |
temperature |
float? |
Sampling temperature (0.0 = deterministic) |
top_p |
float? |
Nucleus sampling threshold |
top_k |
int? |
Top-K sampling |
grammar |
String? |
GBNF grammar to constrain output |
stop_sequences |
Array<String>? |
Stop generation on these strings |
sampler |
SamplerParams? |
Full sampler configuration |
SamplerParams
Fine-grained sampling control. All fields are optional.
| Field | Type | Description |
|---|---|---|
temperature |
float? |
Temperature (0.0 = deterministic) |
top_k |
int? |
Top-K (0 = disabled) |
top_p |
float? |
Top-P / nucleus (1.0 = disabled) |
min_p |
float? |
Min-P (0.0 = disabled) |
typical_p |
float? |
Typical sampling (1.0 = disabled) |
penalty |
PenaltyParams? |
Repetition penalties |
dry |
DryParams? |
DRY (Don’t Repeat Yourself) penalties |
mirostat |
MirostatParams? |
Mirostat v1 parameters |
mirostat_v2 |
MirostatV2Params? |
Mirostat v2 parameters |
grammar |
String? |
GBNF grammar constraint |
logit_bias |
Array<LogitBias>? |
Per-token logit biases |
seed |
int? |
Random seed for reproducibility |
Result Types
GenerationResult
| Field | Type | Description |
|---|---|---|
text |
String |
Generated text |
tokens |
Array<int> |
Generated token IDs |
n_tokens |
int |
Number of tokens generated |
stop_reason |
StopReason |
Why generation stopped (max_tokens, eog_token, aborted, error) |
perf |
PerfData |
Performance metrics |
ChatMessage
| Field | Type | Description |
|---|---|---|
role |
String |
Message role: "system", "user", or "assistant" |
content |
String |
Message text |
ModelInfo
Detailed model metadata including architecture dimensions (n_embd, n_layer, n_head), vocabulary info, RoPE parameters, capabilities (has_encoder, has_decoder, is_recurrent), special tokens, and the full metadata map.
PerfData
| Field | Type | Description |
|---|---|---|
context.n_eval |
int |
Tokens evaluated |
context.t_eval_ms |
float |
Total evaluation time (ms) |
context.tokens_per_second |
float |
Generation throughput |
context.prompt_tokens_per_second |
float |
Prompt processing throughput |
Enums
| Enum | Values | Description |
|---|---|---|
SplitMode |
none, layer, row |
Multi-GPU splitting strategy |
PoolingType |
unspecified, none, mean, cls, last, rank |
Embedding pooling strategy |
AttentionType |
unspecified, causal, non_causal |
Attention mechanism |
FlashAttnType |
disabled, enabled_for_fa, enabled_for_all |
Flash Attention config |
GgmlType |
f32, f16, q4_0, q8_0, … |
Quantization formats |
StopReason |
max_tokens, eog_token, aborted, error |
Generation stop reason |
VocabType |
none, spm, bpe, wpm, ugm, rwkv, plamo2 |
Tokenizer type |
SamplerType |
greedy, dist, top_k, top_p, min_p, temp, … |
Sampler identifiers |
Model Downloads
Download GGUF models from Hugging Face. Use a browser — wget/curl may not work with some download links.
Recommended models for testing:
- Embeddings: Qwen3-Embedding-0.6B-GGUF (f16)
- Chat: Qwen3-1.7B-GGUF (Q8_0)