Huge discrepancy between training PRED and generation inference results on Llama-3-8B.Bad performance for generation modes.

Hello, I have some questions regarding my fine-tuning results. I fine-tuned the `Llama-3-8B-Instruct` model for 18,010 steps,encoder is all-MiniLM-L6-v2. However, the generation results during testing are extremely poor; the model outputs in `generation_result.txt` are completely incoherent. In contrast, the model's outputs during the training phase appeared relatively normal. Why is this happening?

My training command is as follows:
setsid env WANDB_MODE=offline PYTHONUNBUFFERED=1 accelerate launch --multi_gpu --num_processes 4 /data2/home/KBLaM_LLama3_8B/experiments/train.py 

--dataset_dir datasets 

--train_dataset "synthetic" 

--N 120000 

--B 16 

--hf_model_spec "/data2/home/models/Meta-Llama-3-8B-Instruct" 

--encoder_spec all-MiniLM-L6-v2 

--model_save_dir "/data2/home/KBLaM_LLama3_8B/output/test4" 

--hf_token "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" 

--sep_query_head 

--llm_type "llama3" 

--kb_size 500 

--total_steps 20010 

--use_cached_embd 

--use_data_aug 

--gradient_accm_step 32 > train4.log 2>&1 &

My testing command is as follows:
CUDA_VISIBLE_DEVICES=1 python /data2/home/KBLaM_LLama3_8B/experiments/eval.py generation 

--dataset_dir datasets 

--test_dataset "synthetic.json" 

--encoder_dir "/data2/home/KBLaM_LLama3_8B/output/test4/stage1_lr_0.0001KBTokenLayerFreq3UseOutlier1KBSize500SepQueryHeadUseDataAugKeyFromkey_all-MiniLM-L6-v2_synthetic_llama3_step_18000_encoder/encoder.pt" 

--encoder_spec "all-MiniLM-L6-v2" 

--model_dir "/data2/home/KBLaM_LLama3_8B/output/test4/stage1_lr_0.0001KBTokenLayerFreq3UseOutlier1KBSize500SepQueryHeadUseDataAugKeyFromkey_all-MiniLM-L6-v2_synthetic_llama3_step_18000" 

--llm_base_dir "/data2/home/models/Meta-Llama-3-8B-Instruct" 

--llm_type "llama3" 

--save_dir "/data2/home/KBLaM_LLama3_8B/eval_results/generationtest4" 

--kb_size 200 

--eval_mode kb 

--kb_token_layer_frequency 3 

--precomputed_embed_keys_path "/data2/home/KBLaM_LLama3_8B/datasets/synthetic_all-MiniLM-L6-v2_embd_key.npy" 

--precomputed_embed_values_path "/data2/home/KBLaM_LLama3_8B/datasets/synthetic_all-MiniLM-L6-v2_embd_value.npy"

Here is an example of the model's output during training:
INFO    INPUT IDs SHAPE: torch.Size([16, 48])           train.py:607
INFO    KB SHAPE: torch.Size([16, 501, 45056])          train.py:636
INFO    GT: <|end_header_id|> What description does     train.py:637
Velvet Pulse

have?<|eot_id|><|start_header_id|>assistant<|en

d_header_id|>The description of Velvet Pulse is

a high-end audio equipment brand known for its

superior sound

quality.<|eot_id|><|eot_id|><|eot_id|><|eot_id|

><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_

id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|e

ot_id|>

INFO    PRED: def<|eot_id|><|end_header_id|> What       train.py:638
insights does The V

have?<|eot_id|><|start_header_id|>assistant<|en

d_header_id|>The description of Velvet Pulse is

a luxury-end fashion equipment brand known for

its exceptional sound

quality.<|eot_id|><|eot_id|><|eot_id|><|eot_id|

><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_

id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|e

ot_id|>                                                      INFO    step: 17820, loss: 0.6719280518591404           train.py:680 INFO    step: 17821, loss: 0.6504949191585183           train.py:680 INFO    step: 17822, loss: 0.6908198706805706           train.py:680 INFO    step: 17823, loss: 0.670011792331934            train.py:680 INFO    step: 17824, loss: 0.670219199731946            train.py:680 INFO    step: 17825, loss: 0.6978795994073153           train.py:680 INFO    step: 17826, loss: 0.6766274720430374           train.py:680 INFO    step: 17827, loss: 0.6636326257139444           train.py:680 INFO    step: 17828, loss: 0.6245416700839996           train.py:680 INFO    step: 17829, loss: 0.6708870176225901           train.py:680

## Here is the model's output during the testing/generation phase:

Model output:

## What is the most beautiful? most most beautiful most beautiful most beautiful? beautiful? beautiful most beautiful? beautiful? beautiful? beautiful most beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful? beautiful?
True answer: The objectives of HyperGlide Systems is study bee behavior, assess the impact of pesticides, and develop conservation strategies.

Model output:

the noble??
What a wonderful??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??
What a noble??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge discrepancy between training PRED and generation inference results on Llama-3-8B.Bad performance for generation modes. #96

Here is the model's output during the testing/generation phase:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Huge discrepancy between training PRED and generation inference results on Llama-3-8B.Bad performance for generation modes. #96

Description

Here is the model's output during the testing/generation phase:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions