llamadart is a high-performance Dart and Flutter plugin for llama.cpp. It allows you to run Large Language Models (LLMs) locally using GGUF models across all major platforms with minimal setup.
- π High Performance: Powered by
llama.cpp's optimized C++ kernels. - π οΈ Zero Configuration: Uses the modern Pure Native Asset mechanismβno manual build scripts or platform folders required.
- π± Cross-Platform: Full support for Android, iOS, macOS, Linux, and Windows.
- β‘ GPU Acceleration:
- Apple: Metal (macOS/iOS)
- Android/Linux/Windows: Vulkan
- πΌοΈ Multimodal Support: Run vision and audio models (LLaVA, Gemma 3, Qwen2-VL) with integrated media processing.
- β¬ Resumable Downloads: Robust background-safe model downloads with parallel chunking and partial-file resume tracking.
- LoRA Support: Apply fine-tuned adapters (GGUF) dynamically at runtime.
- π Web Support: Web backend router with WebGPU bridge support and WASM fallback.
- π Dart-First API: Streamlined architecture with decoupled backends.
- π Split Logging Control: Configure Dart-side logger and native backend logs independently.
- π§ͺ High Coverage: CI enforces >=70% coverage on maintainable core code.
llamadart uses a modern, decoupled architecture designed for flexibility and platform independence:
- LlamaEngine: The primary high-level orchestrator. It handles model lifecycle, tokenization, chat templating, and manages the inference stream.
- ChatSession: A stateful wrapper for
LlamaEnginethat automatically manages conversation history, system prompts, and enforces context window limits (sliding window). - LlamaBackend: A platform-agnostic interface with a default
LlamaBackend()factory constructor that auto-selects native (llama.cpp) or web (WebGPU bridge first, WASM fallback) implementations.
| Platform | Architecture(s) | GPU Backend | Status |
|---|---|---|---|
| macOS | arm64, x86_64 | Metal | β Tested |
| iOS | arm64 (Device), arm64/x86_64 (Sim) | Metal (Device), CPU (Sim) | β Tested |
| Android | arm64-v8a, x86_64 | Vulkan | β Tested |
| Linux | arm64, x86_64 | Vulkan | β Tested |
| Windows | x64 | Vulkan | β Tested |
| Web | WASM / WebGPU Bridge | CPU / Experimental WebGPU | β Tested (WASM) |
The default web backend uses the bridge runtime (WebGpuLlamaBackend) for
both WebGPU and CPU execution paths.
Current limitations:
- Web mode is currently experimental and depends on an external JS bridge runtime.
- Bridge API contract: WebGPU bridge contract.
- Prebuilt web bridge assets are published from
leehack/llama-web-bridgetoleehack/llama-web-bridge-assets. example/chat_appuses local bridge files first and falls back to jsDelivr assets when local assets are missing.- Bridge model loading now uses browser Cache Storage when
useCacheis true (enabled by default inllamadartweb backend), so repeat loads of the same model URL can avoid full re-download. - To self-host pinned assets at build time:
WEBGPU_BRIDGE_ASSETS_TAG=<tag> ./scripts/fetch_webgpu_bridge_assets.sh. - The fetch script applies a Safari compatibility patch by default for universal
browser use (
WEBGPU_BRIDGE_PATCH_SAFARI_COMPAT=1,WEBGPU_BRIDGE_MIN_SAFARI_VERSION=170400). - The same patch flow also updates legacy bridge chunk assembly logic to avoid Safari stream-reader buffer reuse issues during model downloads.
example/chat_app/web/index.htmlapplies the same Safari compatibility patch at runtime for bridge core loading (including CDN fallback paths).- Bridge wasm build/publish CI and runtime implementation are maintained in
leehack/llama-web-bridge. - Current bridge browser targets in this repo: Chrome >= 128, Firefox >= 129, Safari >= 17.4.
- Safari GPU execution uses a compatibility gate: legacy bridge assets are forced to CPU by default, while adaptive bridge assets can probe/cap GPU layers and auto-fallback to CPU when generation looks unstable.
- You can bypass the legacy safeguard with
window.__llamadartAllowSafariWebGpu = truebefore model load. loadMultimodalProjectoris available on web when using URL-based model/mmproj assets.supportsVision/supportsAudioreflect loaded projector capabilities on web.- LoRA runtime adapter APIs are not supported on web in the current implementation.
- Changing log level via
setLogLevel/setNativeLogLevelapplies on the next model load.
If your app targets both native and web, gate feature toggles by platform/capability checks.
Add llamadart to your pubspec.yaml:
dependencies:
llamadart: ^0.5.4llamadart leverages the Dart Native Assets (build hooks) system. When you run your app for the first time (dart run or flutter run), the package automatically:
- Detects your target platform and architecture.
- Downloads the appropriate pre-compiled binary from GitHub.
- Bundles it seamlessly into your application.
No manual binary downloads, CMake configuration, or platform-specific project changes are needed.
If you are upgrading from 0.4.x, read:
High-impact changes:
ChatSessionnow centers oncreate(...)and streamsLlamaCompletionChunk.LlamaChatMessagenamed constructors were standardized:LlamaChatMessage.text(...)->LlamaChatMessage.fromText(...)LlamaChatMessage.multimodal(...)->LlamaChatMessage.withContent(...)
ModelParams.logLevelwas removed; logging is now controlled at engine level via:setDartLogLevel(...)setNativeLogLevel(...)
- Root exports changed; previously exported internals such as
ToolRegistry,LlamaTokenizer, andChatTemplateProcessorare no longer part of the public package surface. - Custom backend implementations must match the updated
LlamaBackendinterface (includinggetVramInfoand updatedapplyChatTemplate).
The easiest way to get started is by using the default LlamaBackend.
import 'package:llamadart/llamadart.dart';
void main() async {
// Automatically selects Native or Web backend
final engine = LlamaEngine(LlamaBackend());
try {
// Initialize with a local GGUF model
await engine.loadModel('path/to/model.gguf');
// Generate text (streaming)
await for (final token in engine.generate('The capital of France is')) {
print(token);
}
} finally {
// CRITICAL: Always dispose the engine to release native resources
await engine.dispose();
}
}Use ChatSession for most chat applications. It automatically manages conversation history, system prompts, and handles context window limits.
import 'package:llamadart/llamadart.dart';
void main() async {
final engine = LlamaEngine(LlamaBackend());
try {
await engine.loadModel('model.gguf');
// Create a session with a system prompt
final session = ChatSession(
engine,
systemPrompt: 'You are a helpful assistant.',
);
// Send a message
await for (final chunk in session.create([LlamaTextContent('What is the capital of France?')])) {
stdout.write(chunk.choices.first.delta.content ?? '');
}
} finally {
await engine.dispose();
}
}llamadart supports intelligent tool calling where the model can use external functions to help it answer questions.
final tools = [
ToolDefinition(
name: 'get_weather',
description: 'Get the current weather',
parameters: [
ToolParam.string('location', description: 'City name', required: true),
],
handler: (params) async {
final location = params.getRequiredString('location');
return 'It is 22Β°C and sunny in $location';
},
),
];
final session = ChatSession(engine);
// Pass tools per-request
await for (final chunk in session.create(
[LlamaTextContent("how's the weather in London?")],
tools: tools,
)) {
final delta = chunk.choices.first.delta;
if (delta.content != null) stdout.write(delta.content);
}Notes:
- Built-in template handlers automatically select model-specific tool-call grammar and parser behavior; you usually do not need to set
GenerationParams.grammarmanually for normal tool use. - Some handlers use lazy grammar activation (triggered when a tool-call prefix appears) to match llama.cpp behavior.
- If you implement a custom handler grammar, prefer Dart raw strings (
r'''...''') for GBNF blocks to avoid escaping bugs.
If you need behavior for a model-specific template that is not built in yet, you can register your own handler and/or template override.
import 'package:llamadart/llamadart.dart';
class MyHandler extends ChatTemplateHandler {
@override
ChatFormat get format => ChatFormat.generic;
@override
List<String> get additionalStops => const [];
@override
LlamaChatTemplateResult render({
required String templateSource,
required List<LlamaChatMessage> messages,
required Map<String, String> metadata,
bool addAssistant = true,
List<ToolDefinition>? tools,
bool enableThinking = true,
}) {
final prompt = messages.map((m) => m.content).join('\n');
return LlamaChatTemplateResult(prompt: prompt, format: format.index);
}
@override
ChatParseResult parse(
String output, {
bool isPartial = false,
bool parseToolCalls = true,
bool thinkingForcedOpen = false,
}) {
return ChatParseResult(content: output.trim());
}
@override
String? buildGrammar(List<ToolDefinition>? tools) => null;
}
void configureTemplateRouting() {
// 1) Register a custom handler
ChatTemplateEngine.registerHandler(
id: 'my-handler',
handler: MyHandler(),
matcher: (ctx) =>
(ctx.metadata['general.name'] ?? '').contains('MyModel'),
);
// 2) Register a global template override
ChatTemplateEngine.registerTemplateOverride(
id: 'my-template-override',
templateSource: '{{ messages[0]["content"] }}',
matcher: (ctx) => ctx.hasTools,
);
}
Future<void> usePerCallOverride(LlamaEngine engine) async {
final template = await engine.chatTemplate(
[
const LlamaChatMessage.fromText(
role: LlamaChatRole.user,
text: 'hello',
),
],
customTemplate: '{{ "CUSTOM:" ~ messages[0]["content"] }}',
customHandlerId: 'my-handler',
);
print(template.prompt);
}Use separate log levels for Dart and native output when debugging:
import 'package:llamadart/llamadart.dart';
final engine = LlamaEngine(LlamaBackend());
// Dart-side logs (template routing, parser diagnostics, etc.)
await engine.setDartLogLevel(LlamaLogLevel.info);
// Native llama.cpp / ggml logs
await engine.setNativeLogLevel(LlamaLogLevel.warn);
// Convenience: set both at once
await engine.setLogLevel(LlamaLogLevel.none);llamadart supports multimodal models (vision and audio) using LlamaChatMessage.withContent.
import 'package:llamadart/llamadart.dart';
void main() async {
final engine = LlamaEngine(LlamaBackend());
try {
await engine.loadModel('vision-model.gguf');
await engine.loadMultimodalProjector('mmproj.gguf');
final session = ChatSession(engine);
// Create a multimodal message
final messages = [
LlamaChatMessage.withContent(
role: LlamaChatRole.user,
content: [
LlamaImageContent(path: 'image.jpg'),
LlamaTextContent('What is in this image?'),
],
),
];
// Use stateless engine.create for one-off multimodal requests
final response = engine.create(messages);
await for (final chunk in response) {
stdout.write(chunk.choices.first.delta.content ?? '');
}
} finally {
await engine.dispose();
}
}Web-specific note:
- Load model/mmproj with URL-based assets (
loadModelFromUrl+ URL projector). - For user-picked browser files, send media as bytes (
LlamaImageContent(bytes: ...),LlamaAudioContent(bytes: ...)) rather than local file paths.
These models use a unique architecture where the Start-of-Sequence (BOS) and End-of-Sequence (EOS) tokens are identical. llamadart includes a specialized handler for these models that:
- Disables Auto-BOS: Prevents the model from stopping immediately upon generation.
- Manual Templates: Automatically applies the required
Question: / Answer:format if the model metadata is missing a chat template. - Stop Sequences: Injects
Question:as a stop sequence to prevent rambling in multi-turn conversations.
Since llamadart allocates significant native memory and manages background worker Isolates/Threads, it is essential to manage its lifecycle correctly.
- Explicit Disposal: Always call
await engine.dispose()when you are finished with an engine instance. - Native Stability: On mobile and desktop, failing to dispose can lead to "hanging" background processes or memory pressure.
- Hot Restart Support: In Flutter, placing the engine inside a
ProviderorStateand callingdispose()in the appropriate lifecycle method ensures stability across Hot Restarts.
@override
void dispose() {
_engine.dispose();
super.dispose();
}llamadart supports applying multiple LoRA adapters dynamically at runtime.
- Dynamic Scaling: Adjust the strength (
scale) of each adapter on the fly. - Isolate-Safe: Native adapters are managed in a background Isolate to prevent UI jank.
- Efficient: Multiple LoRAs share the memory of a single base model.
Check out our LoRA Training Notebook to learn how to train and convert your own adapters.
This project maintains a high standard of quality with >=70% line coverage on maintainable lib/ code (auto-generated files marked with // coverage:ignore-file are excluded).
- Multi-Platform Testing:
dart testruns VM and Chrome-compatible suites automatically. - Local-Only Scenarios: Slow E2E tests are tagged
local-onlyand skipped by default. - CI/CD: Automatic analysis, linting, and cross-platform test execution on every PR.
# Run default test suite (VM + Chrome-compatible tests)
dart test
# Run local-only E2E scenarios
dart test --run-skipped -t local-only
# Run VM tests with coverage
dart test -p vm --coverage=coverage
# Format lcov for maintainable code (respects // coverage:ignore-file)
dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore
# Enforce >=70% threshold
dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70Contributions are welcome! Please see CONTRIBUTING.md for architecture details and maintainer instructions for building native binaries.
This project is licensed under the MIT License - see the LICENSE file for details.