The judge prefers original conversations over generated

While trying to train a model to get the judge score more than 0.45 I have found a serious issue with the scoring system. 
It turned out, the judge prefers original conversations over generated.
To proof that I have done the following changes in the code.
In the `scoring.judge_score.get_judge_score` function I swapped the original assistant message with the generated response (here https://github.com/impel-intelligence/dippy-bittensor-subnet/blob/d6b3fd41ec52334efd93785bc23e90ba3c19ae0d/scoring/judge_score.py#L334):

```python
for idx, output in enumerate(outputs):
    generated_text = output.outputs[0].text
    
    # Create complete conversations
    generated_conversation = original_messages_list[idx].copy()
    generated_conversation.append({"role": "assistant", "content": last_assistant_responses[idx]})
    generated_conversations.append(generated_conversation)
    
    original_conversation = original_messages_list[idx].copy()
    original_conversation.append({"role": "assistant", "content": generated_text})
    original_conversations.append(original_conversation)
```

So, the generated conversation now contains `last_assistant_responses[idx]` and the original conversation contains `generated_text`.
Because of this change I had to update the `scoring.judge_score.collect_judge_scores` function right here https://github.com/impel-intelligence/dippy-bittensor-subnet/blob/d6b3fd41ec52334efd93785bc23e90ba3c19ae0d/scoring/judge_score.py#L262. Now we have to use `total_original` to calculate `win_rate` because original conversations have generated responses now:

```python
win_rate = (total_original) / (valid * 3) if valid > 0 else 0
```

I evaluated my model from the leaderboard (`itorgov/dippy-roleplay-1742956655-193003`) two times. First time I got `0.7194` as the judge score, the second time I got `0.7403` (while on the leaderboard it has `0.4511`).


I think this is the serious issue that has to be fixed as soon as possible. Otherwise it won't be possible to train a really good model.
Please try to reproduce and tell me your results/opinion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The judge prefers original conversations over generated #163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The judge prefers original conversations over generated #163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions