-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable imitation learning #223
Conversation
…learned heuristic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job in general and well commented, I think it fits well in the solver.
Simply, it is possible to avoid the duplication of functions common to SimpleLearnedHeuristic
and SupervisedLearnedHeuristic
. I will add this in a later commit.
It would be nice to add tests in order to check that the helpSolution
is correctly set in addition to testing the reproducibility with rng
.
…rch/SeaPearl.jl into imitation
@@ -32,7 +32,7 @@ Flux = "0.11 - 0.12" | |||
JuMP = "0.21" | |||
LightGraphs = "1.3" | |||
MathOptInterface = "0.9" | |||
NNlib = "0.7" | |||
NNlib = "0.8.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful to work with ReinforcementLearning.jl in dev mode
""" | ||
function (valueSelection::SupervisedLearnedHeuristic)(PHASE::Type{DecisionPhase}, model::CPModel, x::Union{Nothing,AbstractIntVar}) | ||
# domain change metrics, set before reward is updated | ||
set_metrics!(PHASE, valueSelection.search_metrics, model, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want the metrics to be totally independent from the chosen value heuristic type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand, what would you consider here?
return valueSelection.eta_stable | ||
else | ||
steps_left = valueSelection.warmup_steps + valueSelection.decay_steps - step | ||
return valueSelection.eta_stable + steps_left / valueSelection.decay_steps * (valueSelection.eta_init - valueSelection.eta_stable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a linear decay like this or would an exponential one be more adapted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would indeed be nice to be able to choose, but it doesn't seem to me to be absolutely essential at the moment.
false_x = first(values(model.variables)) | ||
env = SeaPearl.get_observation!(lh, model, false_x) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect to do step-by-step tests like this. Thank you very much for the time spent on this. It will ensure that the behavior of SupervisedLearnedHeuristic
is as expected.
valueSelection.helpSolution = solutions[1] | ||
end | ||
end | ||
reset_model!(model_duplicate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we reset the duplicated model ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably unnecessary and should indeed be removed.
if !isnothing(model_duplicate.statistics.solutions) | ||
solutions = model_duplicate.statistics.solutions[model_duplicate.statistics.solutions.!=nothing] | ||
if length(solutions) >= 1 | ||
valueSelection.helpSolution = solutions[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we select the first solution found, instead of the last solution found which is supposed to be of better quality!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could indeed be a much better approach!
end | ||
|
||
# Reset the agent, useful for things like recurrent networks | ||
Flux.reset!(valueSelection.agent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering why we previously added this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly have no idea, it seemed to me that this was not a problem but I don't know if it is useful.
|
||
Get the current value of `eta` (η), which is the probability for the solver to calculate and provide a classic CP-generated solution to the agent. | ||
""" | ||
function get_eta(valueSelection::SupervisedLearnedHeuristic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The of the greedy explorer represents the propensity of the agent to make random decisions when it makes a decision while the represents the propensity of the heuristic to guide the agent in a supervised way, without letting it make a decision.
These are for me distinct quantities that should not be linked.
x = variable_heuristic(model) | ||
v = lh(SeaPearl.DecisionPhase, model, x) | ||
@test v == lh.helpSolution[x.id] | ||
lh(SeaPearl.EndingPhase, model, :Feasible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line usefull?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tried to keep the heuristic calls as close as possible, in order to detect possible problems in the future. However, I don't think that deleting it would influence the result of the test in the current state of the project and the heuristic.
I am still in favor of keeping this line in the test.
lh(SeaPearl.InitializingPhase, model) | ||
@test isnothing(lh.helpSolution) | ||
SeaPearl.reset_model!(model) | ||
|
||
lh(SeaPearl.InitializingPhase, model) | ||
@test isnothing(lh.helpSolution) | ||
SeaPearl.reset_model!(model) | ||
|
||
lh(SeaPearl.InitializingPhase, model) | ||
@test isnothing(lh.helpSolution) | ||
SeaPearl.reset_model!(model) | ||
|
||
lh(SeaPearl.InitializingPhase, model) | ||
@test !isnothing(lh.helpSolution) | ||
SeaPearl.reset_model!(model) | ||
|
||
lh(SeaPearl.InitializingPhase, model) | ||
@test isnothing(lh.helpSolution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice check using the rng
.
From the last comments, it seems that we have merged the branch a bit early. I will work on a new small pull request to correct the few points mentioned. |
This PR adds an imitation learning functionality in a new
SupervisedLearnedHeuristic
, which is an adaptation ofLearnedHeuristic
(which name has changed toSimpleLearnedHeuristic
).InitializingPhase
, ie at the beginning of each episode, and with a probabilityeta
, theSupervisedLearnedHeuristic
solves the instance using classic CP and adds the solution to its fieldhelpSolution
, which is set tonothing
by default.DecisionPhase
:i. if
helpSolution
isnothing
the agent takes an action following its policy:action = valueSelection.agent(env)
,ii. if
helpSolution
contains a solution, the agent takes an action corresponding to the solution. At the end of the episode, the CP-obtained solution is reconstituted.This approach is inspired from Gasse et al. (https://arxiv.org/abs/1906.01629) and allows the agent to learn from "good" examples even since the beginning of its training.