-
Notifications
You must be signed in to change notification settings - Fork 24
The judge prefers original conversations over generated #163
Description
While trying to train a model to get the judge score more than 0.45 I have found a serious issue with the scoring system.
It turned out, the judge prefers original conversations over generated.
To proof that I have done the following changes in the code.
In the scoring.judge_score.get_judge_score function I swapped the original assistant message with the generated response (here
dippy-bittensor-subnet/scoring/judge_score.py
Line 334 in d6b3fd4
| generated_conversation = original_messages_list[idx].copy() |
for idx, output in enumerate(outputs):
generated_text = output.outputs[0].text
# Create complete conversations
generated_conversation = original_messages_list[idx].copy()
generated_conversation.append({"role": "assistant", "content": last_assistant_responses[idx]})
generated_conversations.append(generated_conversation)
original_conversation = original_messages_list[idx].copy()
original_conversation.append({"role": "assistant", "content": generated_text})
original_conversations.append(original_conversation)So, the generated conversation now contains last_assistant_responses[idx] and the original conversation contains generated_text.
Because of this change I had to update the scoring.judge_score.collect_judge_scores function right here
dippy-bittensor-subnet/scoring/judge_score.py
Line 262 in d6b3fd4
| win_rate = (total_generated) / (valid * 3) if valid > 0 else 0 |
total_original to calculate win_rate because original conversations have generated responses now:
win_rate = (total_original) / (valid * 3) if valid > 0 else 0I evaluated my model from the leaderboard (itorgov/dippy-roleplay-1742956655-193003) two times. First time I got 0.7194 as the judge score, the second time I got 0.7403 (while on the leaderboard it has 0.4511).
I think this is the serious issue that has to be fixed as soon as possible. Otherwise it won't be possible to train a really good model.
Please try to reproduce and tell me your results/opinion.