A lossless neural compression framework combining a Character-Level GRU (Probability Engine) with Arithmetic Coding for compressing high-entropy IoT sensor data, URLs, error codes, and machine-generated text.
JSON-LD data and IoT/WoT sensor data contain high-entropy text and URLs that cannot be substituted using static dictionaries across devices. Traditional compression approaches fail:
- CBOR/Standard Formats: Require 8 bits per character (1 byte/char) + extra bytes for length encoding
- Static Dictionaries: Don't work across diverse deployment contexts
- GZIP/ZIP: Too heavy for embedded devices; designed for file compression, not streaming data
In IoT systems, bandwidth is expensive:
- Sending 100 bytes over LoRaWAN/NB-IoT → seconds of transmission → drains radio battery
- Running a tiny CharGRU on ESP32 → milliseconds of computation → minimal CPU battery drain
This project inverts the trade-off: Use the compute power that exists to save expensive bandwidth.
Instead of static compression tables, use a deep learning model trained on domain data to predict character probabilities. Then use Arithmetic Encoding to convert those predictions into the minimum number of bits.
- All devices deploy the same CharGRU model → They generate identical probability predictions for any input
- Sender: GRU predicts next-character probability → Arithmetic Coder encodes using that probability → Output: minimal bits
- Receiver: Reverse process with RNN state tracking → Bit-by-bit decoding → Perfect reconstruction
Unlike CBOR, we don't need to encode the string length because:
- The vocabulary includes
\n(end-of-line marker) - The RNN predicts its probability like any other character
- Decoder recognizes when
\nis decoded → stops decompression - The entropy-encoded
\nis part of the compressed stream
Compression achieved: ~2-3x over CBOR, BPC reduced from 8 to 2-3 bits/char
| Component | Purpose | Implementation |
|---|---|---|
| Model | Character-level GRU | Stateful, batch_size=1 for streaming prediction |
| Vocabulary | Char↔Index mapping | Learned from training data |
| Arithmetic Coder | Entropy encoding | Fixed precision with renormalization |
| Decompressor | Bit-level decoder | Mirrors encoder logic exactly |
DeepArithmeticCoding/
├── config.py # Hyperparameters and constants
├── requirements.txt # Python dependencies
├── README.md # This file
│
├── src/ # Core modules
│ ├── __init__.py # Lazy imports (TensorFlow optional)
│ ├── utils.py # Data analysis, plotting, seeding
│ ├── data_generation.py # Dataset creation (Gemini API integration)
│ ├── model.py # GRU architecture
│ └── neural_compressor.py # Compression/decompression engine
│
├── scripts/ # Executable entry points
│ ├── prepare_data.py # Generate data + analysis + bucket suggestions
│ ├── train.py # Full training pipeline
│ └── compress_test.py # Compression testing
│
├── Notebook/
│ └── DAC_Development.ipynb # Development notebook with results
pip install -r requirements.txtOption A: Use .env file (Recommended)
# Edit .env and add your Gemini API key
# GEMINI_API_KEY=your-actual-api-key-hereGet your API key at: https://aistudio.google.com/apikey
Option B: Environment Variable
# Linux/macOS
export GEMINI_API_KEY="your-gemini-api-key"
# Windows PowerShell
$env:GEMINI_API_KEY="your-gemini-api-key"Option C: CLI Argument
python scripts/prepare_data.py --api-key "your-gemini-api-key"python scripts/prepare_data.pyThis script:
- Loads API key from
.env(or environment variable) - Generates 150+ IoT templates via Gemini API (cached after first run)
- Creates 50k train + 2k validation + 2k test lines
- Analyzes line-length distributions
- Suggests optimal bucket boundaries for variable-length batches for training and testing
- Saves visualization plots
Review the terminal output and update config.py:
BUCKET_BOUNDARIES = [25, 45, 65] # Use script suggestions to optimize RNN state partitioningpython scripts/train.pyOutputs:
best_model.keras— Best model (checkpointed during training)vocab.pkl— Character vocabulary & indices- Training/validation accuracy & loss plots
- Test set metrics (BPC, Accuracy)
# Test on sample strings (shows compression savings)
python scripts/compress.py --mode specific \
--test-strings "sensor #123 reading" "http://api.example.com"
# Batch test on dataset (empirical metrics)
python scripts/compress_test.py --mode batch --num-samples 100
# Test all modes
python scripts/compress_test.pyTraditional encoding uses 8 bits per character. Arithmetic Coding exploits probability:
- English text: Average char probability ~0.1 → Entropy ~3.3 bits/char
- Machine data: Patterns exist (URLs, numbers, errors) → Entropy ~4-5 bits/char
- Random noise: No patterns → Entropy ≈ 8 bits/char
Instead of floating-point [0.0, 1.0], use fixed 32-bit integers [0, 2³²-1].
Core Variables:
Low,High— Current interval boundary (integers)Pending— Bits we're unsure about (handling middle trap)Value— Decoder's rolling bit window
Step 1: Initialize
Low = 0, High = 2³² - 1, Pending = 0
Step 2: For each character:
- Get Probability: RNN predicts P(char | context)
- Narrow Range:
Range = High - Low + 1 High = Low + Range × CumProb(char) - 1 Low = Low + Range × CumProb(char-1) - Renormalize (Zoom to extract bits):
- Case A (Top Half):
Low ≥ 2³¹- Output:
1+ any pending opposite bits - Action: Shift
Low,Highleft (discard top 2 bits)
- Output:
- Case B (Bottom Half):
High < 2³¹- Output:
0+ any pending opposite bits - Action: Shift both left
- Output:
- Case C (Middle Trap):
2³⁰ ≤ LowandHigh < 3×2³⁰- Increment
Pendingcounter - Zoom into middle (prevent range collapse)
- Increment
- Otherwise: Range is wide enough, continue to next character
- Case A (Top Half):
Step 3: At End of Stream
Output final bits to flush pending bits
Step 1: Initialize
- Read first 32 bits from file into
Valueregister - Set
Low = 0,High = 2³² - 1
Step 2: For each symbol:
-
Identify Character: Where does
Valuefall in the cumulative probability map?Position = (Value - Low) / (High - Low + 1) Find char where CumProb(char-1) ≤ Position < CumProb(char) -
Narrow Range: Same math as encoder
High = Low + Range × CumProb(char) - 1 Low = Low + Range × CumProb(char-1) -
Synchronize Renormalization (Read bits):
- Decoder performs exact same tests as encoder
- When encoder outputs a bit → decoder reads it and shifts
Value - This keeps encoder/decoder perfectly synchronized
-
End on
\n: Vocabulary contains\ncharacter; decoder recognizes when it's decoded
Compression test on 100 randomly sampled lines from test dataset:
Metrics:
- Decompression Accuracy: 99%+ lossless reconstruction verified across all test strings
- Average Savings: 75-80% vs. CBOR
Sample Results:
| String (first 40 chars) | Original | CBOR | AC | Savings |
|---|---|---|---|---|
| the ground displacement speed at locati... | 56 | 62 | 15 | 75.8% |
| https://api.sensor-cloud.org/v1/dev/123... | 42 | 47 | 8 | 82.9% |
| sensor #123 temperature is 25.6 C | 34 | 39 | 6 | 84.6% |
| pressure sensor reading 1012.3 hPa at... | 54 | 60 | 13 | 78.3% |
| Lidar #4829 operating normally, 12V su... | 43 | 48 | 9 | 81.2% |
Key Findings:
- URLs: 83-85% compression — high repetition, predictable patterns
- Sensor Descriptions: 75-80% compression — mix of templates and variable content
- Status Messages: 78-82% compression — trained patterns dominate
- Out-of-Distribution: 40-50% compression — model less confident on unseen patterns
# Larger dataset with custom hybrid ratio
python scripts/prepare_data.py --train-lines 100000 --hybrid-ratio 0.6
# Skip time-consuming analysis
python scripts/prepare_data.py --skip-analysis
# Custom dataset splits
python scripts/prepare_data.py --train-split 0.75 --val-split 0.15# Mode 1: Single verification
python scripts/compress.py --mode single
# Mode 2: Batch metrics (100 samples)
python scripts/compress.py --mode batch --num-samples 100
# Mode 3: Custom test strings
python scripts/compress.py --mode specific \
--test-strings "sensor #123" "http://example.com" "Error_0xAB"
# Mode 4: All tests
python scripts/compress.py --mode all --output-dir ./resultsTraining Pipeline:
prepare_data.py → generate dataset → analyze distribution → store to disk
train.py → load processed data → build model → train/validate → checkpoint
Compression Pipeline:
Text input → vocab encode → stateful GRU prediction
→ arithmetic coder → compressed bytes → output file
Decompression Pipeline:
Compressed bytes → initialize decoder with RNN
→ arithmetic decode → character recovery → text output
- DeepZip (Stanford): RNNs beating GZIP on text compression
- Arithmetic Coding: Information theory optimal prefix-free codes
- IoT Standards: SSN/SOSA ontology for semantic sensor networks
- WoT: W3C Web of Things architecture
- TFLite Quantization: Export fully quantized 8-bit model for edge and Test on ESP32
- ESP32 Deployment: Verify inference pipeline on real hardware
- Attention Mechanism: Longer context for improved predictions
- Hardware Benchmarks: Compare with gzip, CBOR etc. on actual radio modules
- Adaptive Models: Per-domain specialized networks (URLs, sensor readings, logs)