-
Notifications
You must be signed in to change notification settings - Fork 141
Description
This was discussed and confirmed in [https://discord.com/channels/698080905209577513/702060196222205962/1468594043851374604]
Assume the grid has only 1 redispatchable generator, and max ramp up/down is +5/-5, according to the docs, the action space is [-5, 5] [https://grid2op.readthedocs.io/en/latest/mdp.html#modeling-sequential-decisions]
If the agent outputs an action like [6.0] (this can be the case if the agent is a neural network), in this case, the way the reward kernel behaves, in the language of MDP, doesn't match the notation.
The reward kernel will process and perform "do-nothing action".
But for some reward functions available in Grid2Op, they can give a "-1.0" reward signal because the agent asked for an illegal action.
In the notation of reward kernel, the action "a", is it:
- Case A: the illegal, out-of-action-space [6.0],
- or Case B, the "do-nothing action" (which replaces the [6.0])?
I'm seeing a little of contradicts here:
If it's Case A, that means the reward kernel is processing an action (or action vector) that doesn't belong to the action space of the environment (out of [-5, 5] )? Is that suitable?
If it's Case B, it doesn't make sense, because now it returns -1.0 for a do-nothing action, which won't happen if we use the real do-nothing action at the beginning (it will not be treated as illegal, hence not -1.0). So that means there is something outside the reward kernel that defines the -1.0 illegal point?
Possible solution
Reward kernel should be a function that takes also some flags from the environment (like "is_ambiguous", "is_illegal" etc.) which is not in the actual formulation.
So we might have:
final_action, is_ambiguous, is_illegal = translate_action(action_vector_from_agent)
RewardKernel(s, final_action, is_ambiguous, is_illegal)
final_action usually is the do-nothing action