This repository contains all scripts used to create the models. It was trained on the following dataset: Opcodes
-
pretrain_bert.py: This script handles the end-to-end pre-training of a BERT model specifically for opcode sequences. Key features include: Custom Tokenization, loads a specific opcode_vocab.txt to handle assembly syntax. Initializes a lightweight, 6-layer BERT configuration. Trains using Masked Language Modeling (MLM) with a 15% masking probability. Automates data splitting (90/10 train/test), tokenization, training, and model saving. -
train_bert_embeddings.py: It transforms a model previously trained on Masked Language Modeling (MLM) into an embedding model capable of representing the semantic similarity between blocks of code. -
domain_adaptive_mlm.py: Generic domain-adaptive pretraining (DAPT) for any MLM-capable checkpoint. This continues training from an existing model viaAutoModelForMaskedLM.from_pretrained(...)and supports custom train/eval files, tokenizer override, masking settings, and Trainer hyperparameters. -
tune_distil_bert.py: DistilBERT-focused wrapper arounddomain_adaptive_mlm.py. By default it performs DAPT starting fromdistilbert-base-uncasedonopcodes_300/*.opcodes.txt. -
generate_opcode_test_corpus.py: Creates a small synthetic test corpus of.opcodes.txtfiles by generating C samples, compiling them, and extracting opcode sequences from disassembly. -
tune_all-MiniLM-L6-v2.py: This script facilitates the training of a Sentence-BERT (SBERT) model specifically tailored for analyzing assembly opcodes. It is designed to create semantic embeddings of code sequences
Use the generic script directly when you want to continue pretraining any MLM model on your domain corpus:
python domain_adaptive_mlm.py ^
--model_name_or_path distilbert-base-uncased ^
--train_files opcodes_300/*.opcodes.txt ^
--output_dir ./opcode_DistilBERT ^
--overwrite_output_dir ^
--num_train_epochs 5 ^
--per_device_train_batch_size 16Example with another model checkpoint:
python domain_adaptive_mlm.py ^
--model_name_or_path bert-base-uncased ^
--train_files opcodes_300/*.opcodes.txt ^
--output_dir ./opcode_BERT_daptpython generate_opcode_test_corpus.py ^
--samples 30 ^
--output_dir opcodes_test ^
--overwriteThen use it for DAPT:
python domain_adaptive_mlm.py ^
--model_name_or_path distilbert-base-uncased ^
--train_files opcodes_test/*.opcodes.txt ^
--output_dir ./opcode_DistilBERT_test ^
--overwrite_output_dir