Pretraining

Description

This repository contains all scripts used to create the models. It was trained on the following dataset: Opcodes

Scripts

pretrain_bert.py : This script handles the end-to-end pre-training of a BERT model specifically for opcode sequences. Key features include: Custom Tokenization, loads a specific opcode_vocab.txt to handle assembly syntax. Initializes a lightweight, 6-layer BERT configuration. Trains using Masked Language Modeling (MLM) with a 15% masking probability. Automates data splitting (90/10 train/test), tokenization, training, and model saving.
train_bert_embeddings.py : It transforms a model previously trained on Masked Language Modeling (MLM) into an embedding model capable of representing the semantic similarity between blocks of code.
domain_adaptive_mlm.py: Generic domain-adaptive pretraining (DAPT) for any MLM-capable checkpoint. This continues training from an existing model via AutoModelForMaskedLM.from_pretrained(...) and supports custom train/eval files, tokenizer override, masking settings, and Trainer hyperparameters.
tune_distil_bert.py: DistilBERT-focused wrapper around domain_adaptive_mlm.py. By default it performs DAPT starting from distilbert-base-uncased on opcodes_300/*.opcodes.txt.
generate_opcode_test_corpus.py: Creates a small synthetic test corpus of .opcodes.txt files by generating C samples, compiling them, and extracting opcode sequences from disassembly.
tune_all-MiniLM-L6-v2.py: This script facilitates the training of a Sentence-BERT (SBERT) model specifically tailored for analyzing assembly opcodes. It is designed to create semantic embeddings of code sequences

Domain-Adaptive Pretraining

Use the generic script directly when you want to continue pretraining any MLM model on your domain corpus:

python domain_adaptive_mlm.py ^
  --model_name_or_path distilbert-base-uncased ^
  --train_files opcodes_300/*.opcodes.txt ^
  --output_dir ./opcode_DistilBERT ^
  --overwrite_output_dir ^
  --num_train_epochs 5 ^
  --per_device_train_batch_size 16

Example with another model checkpoint:

python domain_adaptive_mlm.py ^
  --model_name_or_path bert-base-uncased ^
  --train_files opcodes_300/*.opcodes.txt ^
  --output_dir ./opcode_BERT_dapt

Create Test Opcode Data

python generate_opcode_test_corpus.py ^
  --samples 30 ^
  --output_dir opcodes_test ^
  --overwrite

Then use it for DAPT:

python domain_adaptive_mlm.py ^
  --model_name_or_path distilbert-base-uncased ^
  --train_files opcodes_test/*.opcodes.txt ^
  --output_dir ./opcode_DistilBERT_test ^
  --overwrite_output_dir

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pretraining

Description

Scripts

Domain-Adaptive Pretraining

Create Test Opcode Data

Models

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
domain_adaptive_mlm.py		domain_adaptive_mlm.py
generate_opcode_test_corpus.py		generate_opcode_test_corpus.py
pretrain_bert.py		pretrain_bert.py
requirements.txt		requirements.txt
train_bert_embeddings.py		train_bert_embeddings.py
tune_all-MiniLM-L6-v2.py		tune_all-MiniLM-L6-v2.py
tune_distil_bert.py		tune_distil_bert.py

malwareuniverse/pretraining

Folders and files

Latest commit

History

Repository files navigation

Pretraining

Description

Scripts

Domain-Adaptive Pretraining

Create Test Opcode Data

Models

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages