You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The title might not be explicit enough, let me explain this point.
When training an agent to play mountainCar, CartPole, CarRacing etc... the best scores he can get are rather close from an episode to another. With the generator we have (at least for Graph coloring atm), the difficulty can change a lot from an episode to another (in term of number of nodes visited). This can lead to a good policy appearing bad, which we should avoid !
A rather simple solution would be to use a simple heuristic to give insights about the current episode (it is indeed very hard to control the difficulty from the generator point of view).
This won't be supervised learning at all as we are not trying to do what the heuristic does, we are just using the heuristic to give use more information about what's happening.
The time of a heuristic search won't make our experiment explose as the training process is much bigger in terms of time consumption.
This is a first basic idea which can be completed later for sure !
The text was updated successfully, but these errors were encountered:
The reward could actually be the difference between the simple heuristic and the LearnedHeuristic? What is called "Delta" in the display during training.
That would be better I think, since what we actually follow is that Delta.
I liked the fact that it could reach approximately the same level as the minimum heuristic without having any information about it though.
The title might not be explicit enough, let me explain this point.
When training an agent to play mountainCar, CartPole, CarRacing etc... the best scores he can get are rather close from an episode to another. With the generator we have (at least for Graph coloring atm), the difficulty can change a lot from an episode to another (in term of number of nodes visited). This can lead to a good policy appearing bad, which we should avoid !
A rather simple solution would be to use a simple heuristic to give insights about the current episode (it is indeed very hard to control the difficulty from the generator point of view).
This won't be supervised learning at all as we are not trying to do what the heuristic does, we are just using the heuristic to give use more information about what's happening.
The time of a heuristic search won't make our experiment explose as the training process is much bigger in terms of time consumption.
This is a first basic idea which can be completed later for sure !
The text was updated successfully, but these errors were encountered: