Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable imitation learning #223

Merged
merged 32 commits into from
May 6, 2022
Merged

Enable imitation learning #223

merged 32 commits into from
May 6, 2022

Conversation

ziadelassal
Copy link
Contributor

@ziadelassal ziadelassal commented May 4, 2022

This PR adds an imitation learning functionality in a new SupervisedLearnedHeuristic, which is an adaptation of LearnedHeuristic (which name has changed to SimpleLearnedHeuristic).

  1. At each InitializingPhase, ie at the beginning of each episode, and with a probability eta, the SupervisedLearnedHeuristic solves the instance using classic CP and adds the solution to its field helpSolution, which is set to nothing by default.
  2. At each DecisionPhase:
    i. if helpSolution is nothing the agent takes an action following its policy: action = valueSelection.agent(env),
    ii. if helpSolution contains a solution, the agent takes an action corresponding to the solution. At the end of the episode, the CP-obtained solution is reconstituted.

This approach is inspired from Gasse et al. (https://arxiv.org/abs/1906.01629) and allows the agent to learn from "good" examples even since the beginning of its training.

@ziadelassal ziadelassal requested a review from 3rdCore May 4, 2022 16:27
@ziadelassal ziadelassal marked this pull request as ready for review May 4, 2022 16:46
Copy link
Collaborator

@gostreap gostreap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job in general and well commented, I think it fits well in the solver.

Simply, it is possible to avoid the duplication of functions common to SimpleLearnedHeuristic and SupervisedLearnedHeuristic. I will add this in a later commit.

It would be nice to add tests in order to check that the helpSolution is correctly set in addition to testing the reproducibility with rng.

src/CP/valueselection/learning/learnedheuristic.jl Outdated Show resolved Hide resolved
src/CP/valueselection/learning/learnedheuristic.jl Outdated Show resolved Hide resolved
src/experiment/launch_experiment.jl Outdated Show resolved Hide resolved
@@ -32,7 +32,7 @@ Flux = "0.11 - 0.12"
JuMP = "0.21"
LightGraphs = "1.3"
MathOptInterface = "0.9"
NNlib = "0.7"
NNlib = "0.8.4"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to work with ReinforcementLearning.jl in dev mode

"""
function (valueSelection::SupervisedLearnedHeuristic)(PHASE::Type{DecisionPhase}, model::CPModel, x::Union{Nothing,AbstractIntVar})
# domain change metrics, set before reward is updated
set_metrics!(PHASE, valueSelection.search_metrics, model, x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want the metrics to be totally independent from the chosen value heuristic type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, what would you consider here?

return valueSelection.eta_stable
else
steps_left = valueSelection.warmup_steps + valueSelection.decay_steps - step
return valueSelection.eta_stable + steps_left / valueSelection.decay_steps * (valueSelection.eta_init - valueSelection.eta_stable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a linear decay like this or would an exponential one be more adapted?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would indeed be nice to be able to choose, but it doesn't seem to me to be absolutely essential at the moment.

false_x = first(values(model.variables))
env = SeaPearl.get_observation!(lh, model, false_x)


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect to do step-by-step tests like this. Thank you very much for the time spent on this. It will ensure that the behavior of SupervisedLearnedHeuristic is as expected.

valueSelection.helpSolution = solutions[1]
end
end
reset_model!(model_duplicate)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we reset the duplicated model ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably unnecessary and should indeed be removed.

if !isnothing(model_duplicate.statistics.solutions)
solutions = model_duplicate.statistics.solutions[model_duplicate.statistics.solutions.!=nothing]
if length(solutions) >= 1
valueSelection.helpSolution = solutions[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we select the first solution found, instead of the last solution found which is supposed to be of better quality!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could indeed be a much better approach!

end

# Reset the agent, useful for things like recurrent networks
Flux.reset!(valueSelection.agent)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering why we previously added this line?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly have no idea, it seemed to me that this was not a problem but I don't know if it is useful.


Get the current value of `eta` (η), which is the probability for the solver to calculate and provide a classic CP-generated solution to the agent.
"""
function get_eta(valueSelection::SupervisedLearnedHeuristic)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To what extend do we need equation to have its own specific decay rate ?

Could we have used the equation value of the agent.policy.explorer ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The equation of the greedy explorer represents the propensity of the agent to make random decisions when it makes a decision while the equation represents the propensity of the heuristic to guide the agent in a supervised way, without letting it make a decision.

These are for me distinct quantities that should not be linked.

x = variable_heuristic(model)
v = lh(SeaPearl.DecisionPhase, model, x)
@test v == lh.helpSolution[x.id]
lh(SeaPearl.EndingPhase, model, :Feasible)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line usefull?

Copy link
Collaborator

@gostreap gostreap May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tried to keep the heuristic calls as close as possible, in order to detect possible problems in the future. However, I don't think that deleting it would influence the result of the test in the current state of the project and the heuristic.

I am still in favor of keeping this line in the test.

Comment on lines +481 to +498
lh(SeaPearl.InitializingPhase, model)
@test isnothing(lh.helpSolution)
SeaPearl.reset_model!(model)

lh(SeaPearl.InitializingPhase, model)
@test isnothing(lh.helpSolution)
SeaPearl.reset_model!(model)

lh(SeaPearl.InitializingPhase, model)
@test isnothing(lh.helpSolution)
SeaPearl.reset_model!(model)

lh(SeaPearl.InitializingPhase, model)
@test !isnothing(lh.helpSolution)
SeaPearl.reset_model!(model)

lh(SeaPearl.InitializingPhase, model)
@test isnothing(lh.helpSolution)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice check using the rng.

@ziadelassal ziadelassal merged commit 2b13bc5 into master May 6, 2022
@gostreap
Copy link
Collaborator

gostreap commented May 9, 2022

From the last comments, it seems that we have merged the branch a bit early. I will work on a new small pull request to correct the few points mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants