Skip to content

leehack/llamadart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

60 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

llamadart

Pub Version codecov License: MIT GitHub

llamadart is a high-performance Dart and Flutter plugin for llama.cpp. It allows you to run Large Language Models (LLMs) locally using GGUF models across all major platforms with minimal setup.

✨ Features

  • πŸš€ High Performance: Powered by llama.cpp's optimized C++ kernels.
  • πŸ› οΈ Zero Configuration: Uses the modern Pure Native Asset mechanismβ€”no manual build scripts or platform folders required.
  • πŸ“± Cross-Platform: Full support for Android, iOS, macOS, Linux, and Windows.
  • ⚑ GPU Acceleration:
    • Apple: Metal (macOS/iOS)
    • Android/Linux/Windows: Vulkan
  • πŸ–ΌοΈ Multimodal Support: Run vision and audio models (LLaVA, Gemma 3, Qwen2-VL) with integrated media processing.
  • ⏬ Resumable Downloads: Robust background-safe model downloads with parallel chunking and partial-file resume tracking.
  • LoRA Support: Apply fine-tuned adapters (GGUF) dynamically at runtime.
  • 🌐 Web Support: Web backend router with WebGPU bridge support and WASM fallback.
  • πŸ’Ž Dart-First API: Streamlined architecture with decoupled backends.
  • πŸ”‡ Split Logging Control: Configure Dart-side logger and native backend logs independently.
  • πŸ§ͺ High Coverage: CI enforces >=70% coverage on maintainable core code.

πŸ—οΈ Architecture

llamadart uses a modern, decoupled architecture designed for flexibility and platform independence:

  • LlamaEngine: The primary high-level orchestrator. It handles model lifecycle, tokenization, chat templating, and manages the inference stream.
  • ChatSession: A stateful wrapper for LlamaEngine that automatically manages conversation history, system prompts, and enforces context window limits (sliding window).
  • LlamaBackend: A platform-agnostic interface with a default LlamaBackend() factory constructor that auto-selects native (llama.cpp) or web (WebGPU bridge first, WASM fallback) implementations.

πŸš€ Quick Start

Platform Architecture(s) GPU Backend Status
macOS arm64, x86_64 Metal βœ… Tested
iOS arm64 (Device), arm64/x86_64 (Sim) Metal (Device), CPU (Sim) βœ… Tested
Android arm64-v8a, x86_64 Vulkan βœ… Tested
Linux arm64, x86_64 Vulkan βœ… Tested
Windows x64 Vulkan βœ… Tested
Web WASM / WebGPU Bridge CPU / Experimental WebGPU βœ… Tested (WASM)

🌐 Web Backend Notes (Router)

The default web backend uses the bridge runtime (WebGpuLlamaBackend) for both WebGPU and CPU execution paths.

Current limitations:

  • Web mode is currently experimental and depends on an external JS bridge runtime.
  • Bridge API contract: WebGPU bridge contract.
  • Prebuilt web bridge assets are published from leehack/llama-web-bridge to leehack/llama-web-bridge-assets.
  • example/chat_app uses local bridge files first and falls back to jsDelivr assets when local assets are missing.
  • Bridge model loading now uses browser Cache Storage when useCache is true (enabled by default in llamadart web backend), so repeat loads of the same model URL can avoid full re-download.
  • To self-host pinned assets at build time: WEBGPU_BRIDGE_ASSETS_TAG=<tag> ./scripts/fetch_webgpu_bridge_assets.sh.
  • The fetch script applies a Safari compatibility patch by default for universal browser use (WEBGPU_BRIDGE_PATCH_SAFARI_COMPAT=1, WEBGPU_BRIDGE_MIN_SAFARI_VERSION=170400).
  • The same patch flow also updates legacy bridge chunk assembly logic to avoid Safari stream-reader buffer reuse issues during model downloads.
  • example/chat_app/web/index.html applies the same Safari compatibility patch at runtime for bridge core loading (including CDN fallback paths).
  • Bridge wasm build/publish CI and runtime implementation are maintained in leehack/llama-web-bridge.
  • Current bridge browser targets in this repo: Chrome >= 128, Firefox >= 129, Safari >= 17.4.
  • Safari GPU execution uses a compatibility gate: legacy bridge assets are forced to CPU by default, while adaptive bridge assets can probe/cap GPU layers and auto-fallback to CPU when generation looks unstable.
  • You can bypass the legacy safeguard with window.__llamadartAllowSafariWebGpu = true before model load.
  • loadMultimodalProjector is available on web when using URL-based model/mmproj assets.
  • supportsVision / supportsAudio reflect loaded projector capabilities on web.
  • LoRA runtime adapter APIs are not supported on web in the current implementation.
  • Changing log level via setLogLevel/setNativeLogLevel applies on the next model load.

If your app targets both native and web, gate feature toggles by platform/capability checks.


πŸ“¦ Installation

Add llamadart to your pubspec.yaml:

dependencies:
  llamadart: ^0.5.4

Zero Setup (Native Assets)

llamadart leverages the Dart Native Assets (build hooks) system. When you run your app for the first time (dart run or flutter run), the package automatically:

  1. Detects your target platform and architecture.
  2. Downloads the appropriate pre-compiled binary from GitHub.
  3. Bundles it seamlessly into your application.

No manual binary downloads, CMake configuration, or platform-specific project changes are needed.


⚠️ Breaking Changes in 0.5.0

If you are upgrading from 0.4.x, read:

High-impact changes:

  • ChatSession now centers on create(...) and streams LlamaCompletionChunk.
  • LlamaChatMessage named constructors were standardized:
    • LlamaChatMessage.text(...) -> LlamaChatMessage.fromText(...)
    • LlamaChatMessage.multimodal(...) -> LlamaChatMessage.withContent(...)
  • ModelParams.logLevel was removed; logging is now controlled at engine level via:
    • setDartLogLevel(...)
    • setNativeLogLevel(...)
  • Root exports changed; previously exported internals such as ToolRegistry, LlamaTokenizer, and ChatTemplateProcessor are no longer part of the public package surface.
  • Custom backend implementations must match the updated LlamaBackend interface (including getVramInfo and updated applyChatTemplate).

πŸ› οΈ Usage

1. Simple Usage

The easiest way to get started is by using the default LlamaBackend.

import 'package:llamadart/llamadart.dart';

void main() async {
  // Automatically selects Native or Web backend
  final engine = LlamaEngine(LlamaBackend());

  try {
    // Initialize with a local GGUF model
    await engine.loadModel('path/to/model.gguf');

    // Generate text (streaming)
    await for (final token in engine.generate('The capital of France is')) {
      print(token);
    }
  } finally {
    // CRITICAL: Always dispose the engine to release native resources
    await engine.dispose();
  }
}

2. Advanced Usage (ChatSession)

Use ChatSession for most chat applications. It automatically manages conversation history, system prompts, and handles context window limits.

import 'package:llamadart/llamadart.dart';

void main() async {
  final engine = LlamaEngine(LlamaBackend());

  try {
    await engine.loadModel('model.gguf');

    // Create a session with a system prompt
    final session = ChatSession(
      engine, 
      systemPrompt: 'You are a helpful assistant.',
    );

    // Send a message
    await for (final chunk in session.create([LlamaTextContent('What is the capital of France?')])) {
      stdout.write(chunk.choices.first.delta.content ?? '');
    }
  } finally {
    await engine.dispose();
  }
}

3. Tool Calling

llamadart supports intelligent tool calling where the model can use external functions to help it answer questions.

final tools = [
  ToolDefinition(
    name: 'get_weather',
    description: 'Get the current weather',
    parameters: [
      ToolParam.string('location', description: 'City name', required: true),
    ],
    handler: (params) async {
      final location = params.getRequiredString('location');
      return 'It is 22Β°C and sunny in $location';
    },
  ),
];

final session = ChatSession(engine);

// Pass tools per-request
await for (final chunk in session.create(
  [LlamaTextContent("how's the weather in London?")],
  tools: tools,
)) {
  final delta = chunk.choices.first.delta;
  if (delta.content != null) stdout.write(delta.content);
}

Notes:

  • Built-in template handlers automatically select model-specific tool-call grammar and parser behavior; you usually do not need to set GenerationParams.grammar manually for normal tool use.
  • Some handlers use lazy grammar activation (triggered when a tool-call prefix appears) to match llama.cpp behavior.
  • If you implement a custom handler grammar, prefer Dart raw strings (r'''...''') for GBNF blocks to avoid escaping bugs.

3.5 Custom Template Handlers and Overrides (Advanced)

If you need behavior for a model-specific template that is not built in yet, you can register your own handler and/or template override.

import 'package:llamadart/llamadart.dart';

class MyHandler extends ChatTemplateHandler {
  @override
  ChatFormat get format => ChatFormat.generic;

  @override
  List<String> get additionalStops => const [];

  @override
  LlamaChatTemplateResult render({
    required String templateSource,
    required List<LlamaChatMessage> messages,
    required Map<String, String> metadata,
    bool addAssistant = true,
    List<ToolDefinition>? tools,
    bool enableThinking = true,
  }) {
    final prompt = messages.map((m) => m.content).join('\n');
    return LlamaChatTemplateResult(prompt: prompt, format: format.index);
  }

  @override
  ChatParseResult parse(
    String output, {
    bool isPartial = false,
    bool parseToolCalls = true,
    bool thinkingForcedOpen = false,
  }) {
    return ChatParseResult(content: output.trim());
  }

  @override
  String? buildGrammar(List<ToolDefinition>? tools) => null;
}

void configureTemplateRouting() {
  // 1) Register a custom handler
  ChatTemplateEngine.registerHandler(
    id: 'my-handler',
    handler: MyHandler(),
    matcher: (ctx) =>
        (ctx.metadata['general.name'] ?? '').contains('MyModel'),
  );

  // 2) Register a global template override
  ChatTemplateEngine.registerTemplateOverride(
    id: 'my-template-override',
    templateSource: '{{ messages[0]["content"] }}',
    matcher: (ctx) => ctx.hasTools,
  );
}

Future<void> usePerCallOverride(LlamaEngine engine) async {
  final template = await engine.chatTemplate(
    [
      const LlamaChatMessage.fromText(
        role: LlamaChatRole.user,
        text: 'hello',
      ),
    ],
    customTemplate: '{{ "CUSTOM:" ~ messages[0]["content"] }}',
    customHandlerId: 'my-handler',
  );

  print(template.prompt);
}

3.6 Logging Control

Use separate log levels for Dart and native output when debugging:

import 'package:llamadart/llamadart.dart';

final engine = LlamaEngine(LlamaBackend());

// Dart-side logs (template routing, parser diagnostics, etc.)
await engine.setDartLogLevel(LlamaLogLevel.info);

// Native llama.cpp / ggml logs
await engine.setNativeLogLevel(LlamaLogLevel.warn);

// Convenience: set both at once
await engine.setLogLevel(LlamaLogLevel.none);

4. Multimodal Usage (Vision/Audio)

llamadart supports multimodal models (vision and audio) using LlamaChatMessage.withContent.

import 'package:llamadart/llamadart.dart';

void main() async {
  final engine = LlamaEngine(LlamaBackend());
  
  try {
    await engine.loadModel('vision-model.gguf');
    await engine.loadMultimodalProjector('mmproj.gguf');

    final session = ChatSession(engine);

    // Create a multimodal message
    final messages = [
      LlamaChatMessage.withContent(
        role: LlamaChatRole.user,
        content: [
          LlamaImageContent(path: 'image.jpg'),
          LlamaTextContent('What is in this image?'),
        ],
      ),
    ];

    // Use stateless engine.create for one-off multimodal requests
    final response = engine.create(messages);
    await for (final chunk in response) {
      stdout.write(chunk.choices.first.delta.content ?? '');
    }
  } finally {
    await engine.dispose();
  }
}

Web-specific note:

  • Load model/mmproj with URL-based assets (loadModelFromUrl + URL projector).
  • For user-picked browser files, send media as bytes (LlamaImageContent(bytes: ...), LlamaAudioContent(bytes: ...)) rather than local file paths.

πŸ’‘ Model-Specific Notes

Moondream 2 & Phi-2

These models use a unique architecture where the Start-of-Sequence (BOS) and End-of-Sequence (EOS) tokens are identical. llamadart includes a specialized handler for these models that:

  • Disables Auto-BOS: Prevents the model from stopping immediately upon generation.
  • Manual Templates: Automatically applies the required Question: / Answer: format if the model metadata is missing a chat template.
  • Stop Sequences: Injects Question: as a stop sequence to prevent rambling in multi-turn conversations.

🧹 Resource Management

Since llamadart allocates significant native memory and manages background worker Isolates/Threads, it is essential to manage its lifecycle correctly.

  • Explicit Disposal: Always call await engine.dispose() when you are finished with an engine instance.
  • Native Stability: On mobile and desktop, failing to dispose can lead to "hanging" background processes or memory pressure.
  • Hot Restart Support: In Flutter, placing the engine inside a Provider or State and calling dispose() in the appropriate lifecycle method ensures stability across Hot Restarts.
@override
void dispose() {
  _engine.dispose();
  super.dispose();
}

🎨 Low-Rank Adaptation (LoRA)

llamadart supports applying multiple LoRA adapters dynamically at runtime.

  • Dynamic Scaling: Adjust the strength (scale) of each adapter on the fly.
  • Isolate-Safe: Native adapters are managed in a background Isolate to prevent UI jank.
  • Efficient: Multiple LoRAs share the memory of a single base model.

Check out our LoRA Training Notebook to learn how to train and convert your own adapters.


πŸ§ͺ Testing & Quality

This project maintains a high standard of quality with >=70% line coverage on maintainable lib/ code (auto-generated files marked with // coverage:ignore-file are excluded).

  • Multi-Platform Testing: dart test runs VM and Chrome-compatible suites automatically.
  • Local-Only Scenarios: Slow E2E tests are tagged local-only and skipped by default.
  • CI/CD: Automatic analysis, linting, and cross-platform test execution on every PR.
# Run default test suite (VM + Chrome-compatible tests)
dart test

# Run local-only E2E scenarios
dart test --run-skipped -t local-only

# Run VM tests with coverage
dart test -p vm --coverage=coverage

# Format lcov for maintainable code (respects // coverage:ignore-file)
dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=coverage/lcov.info --report-on=lib --check-ignore

# Enforce >=70% threshold
dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70

🀝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for architecture details and maintainer instructions for building native binaries.

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •