Skip to content
This repository was archived by the owner on Sep 21, 2025. It is now read-only.
This repository was archived by the owner on Sep 21, 2025. It is now read-only.

The judge prefers original conversations over generated #163

@itorgov

Description

@itorgov

While trying to train a model to get the judge score more than 0.45 I have found a serious issue with the scoring system.
It turned out, the judge prefers original conversations over generated.
To proof that I have done the following changes in the code.
In the scoring.judge_score.get_judge_score function I swapped the original assistant message with the generated response (here

generated_conversation = original_messages_list[idx].copy()
):

for idx, output in enumerate(outputs):
    generated_text = output.outputs[0].text
    
    # Create complete conversations
    generated_conversation = original_messages_list[idx].copy()
    generated_conversation.append({"role": "assistant", "content": last_assistant_responses[idx]})
    generated_conversations.append(generated_conversation)
    
    original_conversation = original_messages_list[idx].copy()
    original_conversation.append({"role": "assistant", "content": generated_text})
    original_conversations.append(original_conversation)

So, the generated conversation now contains last_assistant_responses[idx] and the original conversation contains generated_text.
Because of this change I had to update the scoring.judge_score.collect_judge_scores function right here

win_rate = (total_generated) / (valid * 3) if valid > 0 else 0
. Now we have to use total_original to calculate win_rate because original conversations have generated responses now:

win_rate = (total_original) / (valid * 3) if valid > 0 else 0

I evaluated my model from the leaderboard (itorgov/dippy-roleplay-1742956655-193003) two times. First time I got 0.7194 as the judge score, the second time I got 0.7403 (while on the leaderboard it has 0.4511).

I think this is the serious issue that has to be fixed as soon as possible. Otherwise it won't be possible to train a really good model.
Please try to reproduce and tell me your results/opinion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions