Deep Arithmetic Coding

A lossless neural compression framework combining a Character-Level GRU (Probability Engine) with Arithmetic Coding for compressing high-entropy IoT sensor data, URLs, error codes, and machine-generated text.

Problem Statement

The Bandwidth Problem in IoT

JSON-LD data and IoT/WoT sensor data contain high-entropy text and URLs that cannot be substituted using static dictionaries across devices. Traditional compression approaches fail:

CBOR/Standard Formats: Require 8 bits per character (1 byte/char) + extra bytes for length encoding
Static Dictionaries: Don't work across diverse deployment contexts
GZIP/ZIP: Too heavy for embedded devices; designed for file compression, not streaming data

The Trade-off: Bandwidth vs. Compute

In IoT systems, bandwidth is expensive:

Sending 100 bytes over LoRaWAN/NB-IoT → seconds of transmission → drains radio battery
Running a tiny CharGRU on ESP32 → milliseconds of computation → minimal CPU battery drain

This project inverts the trade-off: Use the compute power that exists to save expensive bandwidth.

The Idea: Neural Compression with Arithmetic Coding

Instead of static compression tables, use a deep learning model trained on domain data to predict character probabilities. Then use Arithmetic Encoding to convert those predictions into the minimum number of bits.

How It Works

All devices deploy the same CharGRU model → They generate identical probability predictions for any input
Sender: GRU predicts next-character probability → Arithmetic Coder encodes using that probability → Output: minimal bits
Receiver: Reverse process with RNN state tracking → Bit-by-bit decoding → Perfect reconstruction

No Length Encoding Needed

Unlike CBOR, we don't need to encode the string length because:

The vocabulary includes \n (end-of-line marker)
The RNN predicts its probability like any other character
Decoder recognizes when \n is decoded → stops decompression
The entropy-encoded \n is part of the compressed stream

Compression achieved: ~2-3x over CBOR, BPC reduced from 8 to 2-3 bits/char

Key Components

Component	Purpose	Implementation
Model	Character-level GRU	Stateful, batch_size=1 for streaming prediction
Vocabulary	Char↔Index mapping	Learned from training data
Arithmetic Coder	Entropy encoding	Fixed precision with renormalization
Decompressor	Bit-level decoder	Mirrors encoder logic exactly

Repository Structure

DeepArithmeticCoding/
├── config.py                  # Hyperparameters and constants
├── requirements.txt           # Python dependencies
├── README.md                  # This file
│
├── src/                       # Core modules
│   ├── __init__.py            # Lazy imports (TensorFlow optional)
│   ├── utils.py               # Data analysis, plotting, seeding
│   ├── data_generation.py     # Dataset creation (Gemini API integration)
│   ├── model.py               # GRU architecture
│   └── neural_compressor.py   # Compression/decompression engine
│
├── scripts/                   # Executable entry points
│   ├── prepare_data.py        # Generate data + analysis + bucket suggestions
│   ├── train.py               # Full training pipeline
│   └── compress_test.py       # Compression testing 
│
├── Notebook/
│   └── DAC_Development.ipynb  # Development notebook with results

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Set Up Gemini API Key

Option A: Use .env file (Recommended)

# Edit .env and add your Gemini API key
# GEMINI_API_KEY=your-actual-api-key-here

Get your API key at: https://aistudio.google.com/apikey

Option B: Environment Variable

# Linux/macOS
export GEMINI_API_KEY="your-gemini-api-key"

# Windows PowerShell
$env:GEMINI_API_KEY="your-gemini-api-key"

Option C: CLI Argument

python scripts/prepare_data.py --api-key "your-gemini-api-key"

3. Generate & Analyze Dataset

python scripts/prepare_data.py

This script:

Loads API key from .env (or environment variable)
Generates 150+ IoT templates via Gemini API (cached after first run)
Creates 50k train + 2k validation + 2k test lines
Analyzes line-length distributions
Suggests optimal bucket boundaries for variable-length batches for training and testing
Saves visualization plots

Review the terminal output and update config.py:

BUCKET_BOUNDARIES = [25, 45, 65]  # Use script suggestions to optimize RNN state partitioning

4. Train the Model

python scripts/train.py

Outputs:

best_model.keras — Best model (checkpointed during training)
vocab.pkl — Character vocabulary & indices
Training/validation accuracy & loss plots
Test set metrics (BPC, Accuracy)

5. Test Compression

# Test on sample strings (shows compression savings)
python scripts/compress.py --mode specific \
    --test-strings "sensor #123 reading"  "http://api.example.com"

# Batch test on dataset (empirical metrics)
python scripts/compress_test.py --mode batch --num-samples 100

# Test all modes
python scripts/compress_test.py

How Arithmetic Coding Works

The Problem: From Bytes to Bits

Traditional encoding uses 8 bits per character. Arithmetic Coding exploits probability:

English text: Average char probability ~0.1 → Entropy ~3.3 bits/char
Machine data: Patterns exist (URLs, numbers, errors) → Entropy ~4-5 bits/char
Random noise: No patterns → Entropy ≈ 8 bits/char

The Solution: Finite Precision Integer Arithmetic

Instead of floating-point [0.0, 1.0], use fixed 32-bit integers [0, 2³²-1].

Core Variables:

Low, High — Current interval boundary (integers)
Pending — Bits we're unsure about (handling middle trap)
Value — Decoder's rolling bit window

Encoding Process (Compression)

Step 1: Initialize

Low = 0, High = 2³² - 1, Pending = 0

Step 2: For each character:

Get Probability: RNN predicts P(char | context)

Narrow Range:

Range = High - Low + 1
High = Low + Range × CumProb(char) - 1
Low  = Low + Range × CumProb(char-1)

Renormalize (Zoom to extract bits):
- Case A (Top Half): Low ≥ 2³¹
  - Output: 1 + any pending opposite bits
  - Action: Shift Low, High left (discard top 2 bits)
- Case B (Bottom Half): High < 2³¹
  - Output: 0 + any pending opposite bits
  - Action: Shift both left
- Case C (Middle Trap): 2³⁰ ≤ Low and High < 3×2³⁰
  - Increment Pending counter
  - Zoom into middle (prevent range collapse)
- Otherwise: Range is wide enough, continue to next character

Step 3: At End of Stream

Output final bits to flush pending bits

Decoding Process (Decompression)

Step 1: Initialize

Read first 32 bits from file into Value register
Set Low = 0, High = 2³² - 1

Step 2: For each symbol:

Identify Character: Where does Value fall in the cumulative probability map?

Position = (Value - Low) / (High - Low + 1)
Find char where CumProb(char-1) ≤ Position < CumProb(char)

Narrow Range: Same math as encoder

High = Low + Range × CumProb(char) - 1
Low  = Low + Range × CumProb(char-1)

Synchronize Renormalization (Read bits):
- Decoder performs exact same tests as encoder
- When encoder outputs a bit → decoder reads it and shifts Value
- This keeps encoder/decoder perfectly synchronized
End on \n: Vocabulary contains \n character; decoder recognizes when it's decoded

Results & Performance

Laboratory Test (100 Random Lines)

Compression test on 100 randomly sampled lines from test dataset:

Metrics:

Decompression Accuracy: 99%+ lossless reconstruction verified across all test strings
Average Savings: 75-80% vs. CBOR

Sample Results:

String (first 40 chars)	Original	CBOR	AC	Savings
the ground displacement speed at locati...	56	62	15	75.8%
https://api.sensor-cloud.org/v1/dev/123...	42	47	8	82.9%
sensor #123 temperature is 25.6 C	34	39	6	84.6%
pressure sensor reading 1012.3 hPa at...	54	60	13	78.3%
Lidar #4829 operating normally, 12V su...	43	48	9	81.2%

Key Findings:

URLs: 83-85% compression — high repetition, predictable patterns
Sensor Descriptions: 75-80% compression — mix of templates and variable content
Status Messages: 78-82% compression — trained patterns dominate
Out-of-Distribution: 40-50% compression — model less confident on unseen patterns

Advanced Usage

Customize Dataset Generation

# Larger dataset with custom hybrid ratio
python scripts/prepare_data.py --train-lines 100000 --hybrid-ratio 0.6

# Skip time-consuming analysis
python scripts/prepare_data.py --skip-analysis

# Custom dataset splits
python scripts/prepare_data.py --train-split 0.75 --val-split 0.15

Flexible Compression Testing

# Mode 1: Single verification
python scripts/compress.py --mode single

# Mode 2: Batch metrics (100 samples)
python scripts/compress.py --mode batch --num-samples 100

# Mode 3: Custom test strings
python scripts/compress.py --mode specific \
  --test-strings "sensor #123" "http://example.com" "Error_0xAB"

# Mode 4: All tests
python scripts/compress.py --mode all --output-dir ./results

Data Flow

Training Pipeline:
  prepare_data.py → generate dataset → analyze distribution → store to disk
  train.py → load processed data → build model → train/validate → checkpoint

Compression Pipeline:
  Text input → vocab encode → stateful GRU prediction
            → arithmetic coder → compressed bytes → output file
            
Decompression Pipeline:
  Compressed bytes → initialize decoder with RNN
            → arithmetic decode → character recovery → text output

References & Inspiration

DeepZip (Stanford): RNNs beating GZIP on text compression
Arithmetic Coding: Information theory optimal prefix-free codes
IoT Standards: SSN/SOSA ontology for semantic sensor networks
WoT: W3C Web of Things architecture

Future Work

TFLite Quantization: Export fully quantized 8-bit model for edge and Test on ESP32
ESP32 Deployment: Verify inference pipeline on real hardware
Attention Mechanism: Longer context for improved predictions
Hardware Benchmarks: Compare with gzip, CBOR etc. on actual radio modules
Adaptive Models: Per-domain specialized networks (URLs, sensor readings, logs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Arithmetic Coding

Problem Statement

The Bandwidth Problem in IoT

The Trade-off: Bandwidth vs. Compute

The Idea: Neural Compression with Arithmetic Coding

How It Works

No Length Encoding Needed

Key Components

Repository Structure

Quick Start

1. Install Dependencies

2. Set Up Gemini API Key

3. Generate & Analyze Dataset

4. Train the Model

5. Test Compression

How Arithmetic Coding Works

The Problem: From Bytes to Bits

The Solution: Finite Precision Integer Arithmetic

Encoding Process (Compression)

Decoding Process (Decompression)

Results & Performance

Laboratory Test (100 Random Lines)

Advanced Usage

Customize Dataset Generation

Flexible Compression Testing

Data Flow

References & Inspiration

Future Work

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Notebook		Notebook
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt

PrudhviGudla/DeepArithmeticCoding

Folders and files

Latest commit

History

Repository files navigation

Deep Arithmetic Coding

Problem Statement

The Bandwidth Problem in IoT

The Trade-off: Bandwidth vs. Compute

The Idea: Neural Compression with Arithmetic Coding

How It Works

No Length Encoding Needed

Key Components

Repository Structure

Quick Start

1. Install Dependencies

2. Set Up Gemini API Key

3. Generate & Analyze Dataset

4. Train the Model

5. Test Compression

How Arithmetic Coding Works

The Problem: From Bytes to Bits

The Solution: Finite Precision Integer Arithmetic

Encoding Process (Compression)

Decoding Process (Decompression)

Results & Performance

Laboratory Test (100 Random Lines)

Advanced Usage

Customize Dataset Generation

Flexible Compression Testing

Data Flow

References & Inspiration

Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages