Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tackle the "stochastic" issue while training: taking the difficulty of an episode into account. #23

Closed
felixchalumeau opened this issue Jul 3, 2020 · 2 comments
Assignees
Milestone

Comments

@felixchalumeau
Copy link
Contributor

The title might not be explicit enough, let me explain this point.

When training an agent to play mountainCar, CartPole, CarRacing etc... the best scores he can get are rather close from an episode to another. With the generator we have (at least for Graph coloring atm), the difficulty can change a lot from an episode to another (in term of number of nodes visited). This can lead to a good policy appearing bad, which we should avoid !

A rather simple solution would be to use a simple heuristic to give insights about the current episode (it is indeed very hard to control the difficulty from the generator point of view).
This won't be supervised learning at all as we are not trying to do what the heuristic does, we are just using the heuristic to give use more information about what's happening.
The time of a heuristic search won't make our experiment explose as the training process is much bigger in terms of time consumption.

This is a first basic idea which can be completed later for sure !

@felixchalumeau felixchalumeau self-assigned this Jul 3, 2020
@ilancoulon
Copy link
Contributor

The reward could actually be the difference between the simple heuristic and the LearnedHeuristic? What is called "Delta" in the display during training.
That would be better I think, since what we actually follow is that Delta.

I liked the fact that it could reach approximately the same level as the minimum heuristic without having any information about it though.

@ilancoulon ilancoulon added this to the v0.1 milestone Jul 8, 2020
@3rdCore
Copy link
Collaborator

3rdCore commented May 5, 2022

@louis-gautier this is about what we discussed yesterday about how to normalize the reward across the distribution of problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants