Skip to content

A streaming HTML parser for Elixir built on top of the CloudFlare's LOL HTML

License

Notifications You must be signed in to change notification settings

abiko-search/laughter

Repository files navigation

Laughter

CI

A streaming HTML parser for Elixir built on top of CloudFlare's LOL HTML.

Why Laughter?

Unlike traditional DOM-based parsers (like Floki), Laughter processes HTML as it streams in, making it ideal for:

  • Crawlers - Extract links as the page downloads, not after
  • Large documents - Constant memory usage regardless of document size
  • Real-time processing - Get results before the full document arrives

Features

  • 🚀 Streaming - Process HTML chunk by chunk
  • 🎯 CSS Selectors - Filter elements with familiar CSS syntax
  • 💾 Memory bounded - Configurable memory limits
  • 🔒 Thread-safe - Safe for concurrent use
  • Fast - Built on Rust's lol-html (used by Cloudflare Workers)

Installation

def deps do
  [
    {:laughter, "~> 0.2.0", github: "abiko-search/laughter", submodules: true}
  ]
end

Requirements:

  • Elixir ~> 1.15
  • Rust (for compilation)

Usage

Basic Example

# Create a parser builder
builder = Laughter.build()

# Register CSS selectors - matched elements are sent as messages
link_ref = Laughter.filter(builder, self(), "a[href]")

# Create the parser
parser = Laughter.create(builder)

# Stream HTML in chunks (simulating network data)
parser
|> Laughter.parse("<html><body>")
|> Laughter.parse("<a href='/page1'>Link 1</a>")
|> Laughter.parse("<a href='/page2'>Link 2</a>")
|> Laughter.parse("</body></html>")
|> Laughter.done()

# Receive matched elements
receive do
  {:element, ^link_ref, {"a", [{"href", "/page1"}]}} -> :ok
end

receive do
  {:element, ^link_ref, {"a", [{"href", "/page2"}]}} -> :ok
end

Extract Text Content

builder = Laughter.build()

# Pass `true` as 4th argument to receive text content
title_ref = Laughter.filter(builder, self(), "title", true)

builder
|> Laughter.create()
|> Laughter.parse("<html><head><title>My Page</title></head></html>")
|> Laughter.done()

receive do
  {:element, ^title_ref, {"title", []}} -> :ok
end

receive do
  {:text, ^title_ref, "My Page"} -> :ok
end

Multiple Selectors

builder = Laughter.build()

links = Laughter.filter(builder, self(), "a")
images = Laughter.filter(builder, self(), "img")
meta = Laughter.filter(builder, self(), "meta[name='description']")

# All selectors work on the same stream
builder
|> Laughter.create()
|> Laughter.parse(html)
|> Laughter.done()

Memory Limits

# Limit memory usage (bytes)
parser = Laughter.create(builder, max_memory: 16_384)

# Raises if limit exceeded
Laughter.parse(parser, very_large_html)

Encoding

# Specify character encoding
parser = Laughter.create(builder, encoding: "utf-8")

Message Format

Matched elements are sent as messages to the registered process:

# Element matched
{:element, reference, {tag_name, attributes}}

# Text content (when send_content: true)
{:text, reference, binary}

# Document end
{:end, reference}

CSS Selector Support

Laughter supports standard CSS selectors:

  • Tag: div, a, span
  • Class: .content, div.main
  • ID: #header
  • Attribute: [href], [rel="nofollow"]
  • Combinators: div > a, ul li, h1 + p
  • Pseudo-classes: :nth-child(2), :first-child

Performance

Laughter processes HTML in a single pass without building a DOM tree:

Parser Memory (1MB HTML) Time
Floki ~10MB ~50ms
Laughter ~16KB (constant) ~20ms

License

Apache 2.0 © Danila Poyarkov

Credits

  • lol-html - CloudFlare's streaming HTML rewriter

About

A streaming HTML parser for Elixir built on top of the CloudFlare's LOL HTML

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published