Skip to content

Conversation

@rdheekonda
Copy link
Contributor

@rdheekonda rdheekonda commented Jan 16, 2026

Add constitutional classifiers probing transforms based on Cunningham et al., 2025 - Constitutional Classifiers++: Efficient Production-Grade Defenses Against Universal Jailbreaks (https://arxiv.org/abs/2601.04603)

Key Changes:

  • Add 7 new constitutional transform functions for AI red teaming
  • Support multiple transformation modes (static, LLM-powered, hybrid)
  • Add comprehensive example notebook with TAP attack integration

Added:

  • dreadnode/transforms/constitutional.py - Core constitutional transforms module
    • Reconstruction attacks: code_fragmentation, document_fragmentation, multi_turn_fragmentation
    • Obfuscation attacks: metaphor_encoding, riddle_encoding, contextual_substitution, character_separation
  • examples/airt/constitutional_attacks.ipynb - Complete example notebook demonstrating all transforms with TAP integration
  • Support for static mappings, LLM-powered generation, and hybrid modes
  • Integration with existing TAP (Tree of Attacks) framework

Changed:

  • Updated dreadnode/transforms/__init__.py to export constitutional module

Technical Details:

  • Implements defense probing techniques from Constitutional Classifiers++ paper
  • Static mode uses predefined mappings (chemistry_to_cooking domain)
  • LLM mode uses generative models for creative transformations
  • Hybrid mode combines static mappings with LLM fallback
  • All transforms work seamlessly with evaluation hooks and TAP attacks
  • Notebook outputs stripped for clean commit

Generated Summary:

  • Introduced a new module constitutional.py that implements Constitutional Classifier transforms.
  • Added metaphor encoding techniques for evading classifiers based on "Constitutional Classifiers++: Efficient Production-Grade Defenses Against Universal Jailbreaks".
  • Updated __init__.py to include constitutional in the exported modules and the __all__ list.
  • Implemented various metaphors for technical terms related to chemistry, biology, and weapons, enhancing the obfuscation capabilities.
  • Included support for static and dynamic metaphor generation modes (LLM-powered) to dynamically map harmful terms to benign alternatives.
  • Added specific transforms such as code_fragmentation and document_fragmentation that help evade input and output classifiers by fragmenting harmful queries across benign contexts.
  • Improved utility functions for encoding and hint generation to enhance contextual understanding of metaphorical substitutions.
  • Overall, these changes significantly augment the existing functionality of the dreadnode transforms by providing advanced techniques for content obfuscation and safety in AI outputs.

This summary was generated with ❤️ by rigging

Add constitutional classifiers probing transforms based on Cunningham et al. 2025 paper:
- Reconstruction attacks: code_fragmentation, document_fragmentation, multi_turn_fragmentation
- Obfuscation attacks: metaphor_encoding, riddle_encoding, contextual_substitution, character_separation
- Supports static, LLM-powered, and hybrid transformation modes
- Add comprehensive example notebook demonstrating all transforms with TAP integration
- Strip notebook outputs for clean commit
@dreadnode-renovate-bot dreadnode-renovate-bot bot added the area/examples Changes to example code and demonstrations label Jan 16, 2026
Replace # noqa: S311 with # nosec B311 for bandit security scanner compatibility
Add both # noqa: S311 (ruff) and # nosec B311 (bandit) to suppress security warnings for non-cryptographic random usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/examples Changes to example code and demonstrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants