Skip to content

💪 Scripts for pretraining BERT, for opcodes.

Notifications You must be signed in to change notification settings

malwareuniverse/pretraining

Repository files navigation

Pretraining

Description

This repository contains all scripts used to create the models. It was trained on the following dataset: Opcodes

Scripts

  • pretrain_bert.py : This script handles the end-to-end pre-training of a BERT model specifically for opcode sequences. Key features include: Custom Tokenization, loads a specific opcode_vocab.txt to handle assembly syntax. Initializes a lightweight, 6-layer BERT configuration. Trains using Masked Language Modeling (MLM) with a 15% masking probability. Automates data splitting (90/10 train/test), tokenization, training, and model saving.

  • train_bert_embeddings.py : It transforms a model previously trained on Masked Language Modeling (MLM) into an embedding model capable of representing the semantic similarity between blocks of code.

  • domain_adaptive_mlm.py: Generic domain-adaptive pretraining (DAPT) for any MLM-capable checkpoint. This continues training from an existing model via AutoModelForMaskedLM.from_pretrained(...) and supports custom train/eval files, tokenizer override, masking settings, and Trainer hyperparameters.

  • tune_distil_bert.py: DistilBERT-focused wrapper around domain_adaptive_mlm.py. By default it performs DAPT starting from distilbert-base-uncased on opcodes_300/*.opcodes.txt.

  • generate_opcode_test_corpus.py: Creates a small synthetic test corpus of .opcodes.txt files by generating C samples, compiling them, and extracting opcode sequences from disassembly.

  • tune_all-MiniLM-L6-v2.py: This script facilitates the training of a Sentence-BERT (SBERT) model specifically tailored for analyzing assembly opcodes. It is designed to create semantic embeddings of code sequences

Domain-Adaptive Pretraining

Use the generic script directly when you want to continue pretraining any MLM model on your domain corpus:

python domain_adaptive_mlm.py ^
  --model_name_or_path distilbert-base-uncased ^
  --train_files opcodes_300/*.opcodes.txt ^
  --output_dir ./opcode_DistilBERT ^
  --overwrite_output_dir ^
  --num_train_epochs 5 ^
  --per_device_train_batch_size 16

Example with another model checkpoint:

python domain_adaptive_mlm.py ^
  --model_name_or_path bert-base-uncased ^
  --train_files opcodes_300/*.opcodes.txt ^
  --output_dir ./opcode_BERT_dapt

Create Test Opcode Data

python generate_opcode_test_corpus.py ^
  --samples 30 ^
  --output_dir opcodes_test ^
  --overwrite

Then use it for DAPT:

python domain_adaptive_mlm.py ^
  --model_name_or_path distilbert-base-uncased ^
  --train_files opcodes_test/*.opcodes.txt ^
  --output_dir ./opcode_DistilBERT_test ^
  --overwrite_output_dir

Models

About

💪 Scripts for pretraining BERT, for opcodes.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages