Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added features for experiment reproductibility and many other improvement : #203

Merged
merged 46 commits into from
May 5, 2022
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
aa2203d
added a guard for Evaluator evalFreq
Feb 18, 2022
c507cf6
added AccumulatedRewardBeforeReset to compute one path reward
Feb 18, 2022
fb2c249
added test sets for
Feb 18, 2022
63ae9e8
added TsptwRealData + dependencies
Feb 18, 2022
b7930cd
fixed typo
Feb 18, 2022
ee00993
redesign of smartreward functon
Feb 18, 2022
0429f65
fixing explorer decaying during evaluation
Feb 18, 2022
a133734
added testsets for explorer decaying
Feb 18, 2022
d2a6e66
added doc for fill_with_generator! function
Feb 22, 2022
755a1b1
fix typo
Feb 22, 2022
793139d
refacto SameInstancesEvaluator to avoid useless eval on non learned h…
Feb 22, 2022
b0b6c36
added repeatlast! function
Feb 22, 2022
3442dc2
added doc
Feb 22, 2022
de761e8
added rng on evaluator instance generation
Feb 23, 2022
ce14a31
added rng for tsptw fill_with_generator!
Feb 23, 2022
5cc383c
added testset for last_episode_total_reward for empty traj
Feb 23, 2022
1aadc20
reduced ρ vaue for SmartReward
Feb 23, 2022
1b32928
added testsets for fill_with_generator!
Feb 24, 2022
862dbdd
modified fill_with_generator! function tu fit new requirement
Feb 24, 2022
dd15ac7
changed SameInstancesEvaluator signature
Feb 24, 2022
6b88588
revmoves searhwith basicHerstics urng traning
Mar 16, 2022
5154287
fix launch experiment
Mar 22, 2022
028790d
added seed for NN layer initialisation
Mar 22, 2022
d238b82
fixed precompile time bias on 1st evaluation.
Mar 22, 2022
679c628
removed prints
Mar 23, 2022
5e77251
added seed for training instances generation
Mar 23, 2022
2af8044
added first "false" evaluation
Mar 23, 2022
5e011cb
added testsets for repeatlast!
Mar 23, 2022
8870ffd
WIP on basicmetrics
Mar 23, 2022
654b6f5
updated basicmetrics to be consistent.
Mar 24, 2022
bab6975
added basic metrics testets
Mar 24, 2022
4ce7662
removed dead dependencies
Mar 24, 2022
e4add08
Revert "added TsptwRealData + dependencies"
Apr 27, 2022
05b1dd9
removed TsptwRealData dependencies
Apr 27, 2022
50e7461
refacto doc
Apr 27, 2022
6085dce
removed obsolete comments
May 4, 2022
48f337a
Merge branch 'tom/feature/new_pipeline' of github.com:corail-research…
May 4, 2022
b82854a
fixed typo
May 4, 2022
1536cb5
removed max_dist
May 4, 2022
48cf440
replaced findall by findfirst
May 4, 2022
9644dc3
Merge branch 'master' into tom/feature/new_pipeline
3rdCore May 4, 2022
2b39d54
fixed merge conflict
May 4, 2022
c100838
Merge branch 'master' into tom/feature/new_pipeline
May 4, 2022
cf94c58
Change rng to only take type AbstractRNG
gostreap May 5, 2022
043347b
Add solutionFound to repeatLast + fix return type when scores is Nothing
gostreap May 5, 2022
fec5edb
Fix solutionFound count in repeatlast!
gostreap May 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions src/CP/core/model.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ mutable struct Statistics
numberOfSolutionsBeforeRestart ::Int
numberOfInfeasibleSolutionsBeforeRestart::Int
numberOfNodesBeforeRestart ::Int
AccumulatedRewardBeforeReset ::Float32
AccumulatedRewardBeforeReset ::Float32 # =last_episode_total_reward(lh.agent.trajectory)
AccumulatedRewardBeforeRestart ::Float32
solutions ::Vector{Union{Nothing,Solution}}
nodevisitedpersolution ::Vector{Int}
objectives ::Union{Nothing, Vector{Union{Nothing,Int}}}
Expand Down Expand Up @@ -47,7 +48,7 @@ mutable struct CPModel
limit ::Limit
knownObjective ::Union{Nothing,Int64}
adhocInfo ::Any
CPModel(trailer) = new(Dict{String, AbstractVar}(), Dict{String, Bool}(), Constraint[], trailer, nothing, nothing, Statistics(Dict{String, Int}(), 0,0, 0, 0, 0, 0, 0, Solution[],Int[], nothing, nothing, nothing), Limit(nothing, nothing, nothing), nothing)
CPModel(trailer) = new(Dict{String, AbstractVar}(), Dict{String, Bool}(), Constraint[], trailer, nothing, nothing, Statistics(Dict{String, Int}(), 0,0, 0, 0, 0, 0, 0, 0, Solution[],Int[], nothing, nothing, nothing), Limit(nothing, nothing, nothing), nothing)
end

CPModel() = CPModel(Trailer())
Expand Down Expand Up @@ -223,6 +224,7 @@ function Base.isempty(model::CPModel)::Bool
&& model.statistics.numberOfSolutionsBeforeRestart == 0
&& model.statistics.numberOfNodesBeforeRestart == 0
&& model.statistics.AccumulatedRewardBeforeReset == 0
&& model.statistics.AccumulatedRewardBeforeRestart == 0
&& isnothing(model.limit.numberOfNodes)
&& isnothing(model.limit.numberOfSolutions)
&& isnothing(model.limit.searchingTime)
Expand Down Expand Up @@ -255,6 +257,7 @@ function Base.empty!(model::CPModel)
model.statistics.numberOfSolutionsBeforeRestart = 0
model.statistics.numberOfNodesBeforeRestart = 0
model.statistics.AccumulatedRewardBeforeReset = 0
model.statistics.AccumulatedRewardBeforeRestart = 0
model.limit.numberOfNodes = nothing
model.limit.numberOfSolutions = nothing
model.limit.searchingTime = nothing
Expand Down Expand Up @@ -291,9 +294,8 @@ function reset_model!(model::CPModel)
model.statistics.numberOfSolutionsBeforeRestart = 0
model.statistics.numberOfNodesBeforeRestart = 0
model.statistics.AccumulatedRewardBeforeReset = 0

model.statistics.AccumulatedRewardBeforeRestart = 0
end

"""
restart_search!(model::CPModel)

Expand All @@ -307,6 +309,8 @@ function restart_search!(model::CPModel)
model.statistics.numberOfInfeasibleSolutionsBeforeRestart = 0
model.statistics.numberOfSolutionsBeforeRestart = 0
model.statistics.numberOfNodesBeforeRestart = 0
model.statistics.AccumulatedRewardBeforeRestart = 0

end

"""
Expand Down
2 changes: 1 addition & 1 deletion src/CP/core/search/ilds.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ function initroot!(toCall::Stack{Function}, search::ILDSearch , model::CPModel,
for k in search.d:-1:1
push!(toCall, (model) -> (restart_search!(model); expandIlds!(toCall,k, model, variableHeuristic, valueSelection)))
end
return expandIlds!(toCall, 0, model, variableHeuristic, valueSelection,nothing)
return expandIlds!(toCall, 0, model, variableHeuristic, valueSelection, nothing)
end

"""
Expand Down
2 changes: 1 addition & 1 deletion src/CP/valueselection/learning/learnedheuristic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ and your functions will be called instead of the default ones.
abstract type AbstractReward end

"""
LearnedHeuristic{SR<:AbstractStateRepresentation, R<:AbstractReward, A<:ActionOutput}
mutable struct LearnedHeuristic{SR<:AbstractStateRepresentation, R<:AbstractReward, A<:ActionOutput}

The LearnedHeuristic is a value selection heuristic which is learned thanks to a training made by solving a certain amount
of problem instances from files or from an `AbstractModelGenerator`. From the RL point of view, this is an agent which is
Expand Down
49 changes: 37 additions & 12 deletions src/CP/valueselection/learning/rewards/smartreward.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,15 @@
struct SmartReward <: AbstractReward end

This is the smart reward, that will be used to teach the agent to prioritize paths that lead to improving solutions.
This reward is the exact reward implemented by Quentin Cappart in
his recent paper: Combining RL & CP for Combinatorial Optimization, https://arxiv.org/pdf/2006.01610.pdf.
Comment on lines +5 to +6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be nice to explain in a few words how the smart reward works and why it's interesting to use it. If not, maybe add "section 2.2" after the paper's link to make this information easier for the user.

"""
mutable struct SmartReward <: AbstractReward
value::Float32
end

ρ = 0.001

SmartReward(model::CPModel) = SmartReward(0)

"""
Expand All @@ -19,15 +23,18 @@ function set_reward!(::Type{StepPhase}, lh::LearnedHeuristic{SR, SmartReward, A}
A <: ActionOutput
}
if symbol == :Infeasible
#println("INFEASIBLE")
#lh.reward.value -= last_episode_total_reward(lh.agent.trajectory)
lh.reward.value -= 0

elseif symbol == :FoundSolution
#println("SOLUTION FOUND, score : ",assignedValue(model.objective), " delta : ",15-assignedValue(model.objective)," accumulated reward : ", model.statistics.AccumulatedRewardBeforeReset)
lh.reward.value += isnothing(model.objective) ? 0 : 100 * (-assignedValue(model.objective))
#lh.reward.value += model.statistics.lastPruning

elseif symbol == :FoundSolution #last portion required to get the full closed path
dist = model.adhocInfo[1]
n = size(dist)[1]
max_dist = Float32(Base.maximum(dist))
if isbound(model.variables["a_"*string(n-1)])
last = assignedValue(model.variables["a_"*string(n-1)])
first = assignedValue(model.variables["a_1"])

dist_to_first_node = lh.current_state.dist[last, first] * max_dist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's the good behavior but I don't understand the * max_dist

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to be copied from the original implementation of the reward provided by @qcappart :

https://github.com/qcappart/hybrid-cp-rl-solver/blob/master/src/problem/tsptw/environment/environment.py

However as I can't find it anymore in the original repo, I removed the factor * max_dist in the computation of last_dist and dist_to_first_node.

lh.reward.value += -ρ*dist_to_first_node
end
elseif symbol == :Feasible
lh.reward.value -= 0
elseif symbol == :BackTracking
Expand All @@ -38,16 +45,34 @@ end
"""
set_reward!(::DecisionPhase, lh::LearnedHeuristic{SmartReward, O}, model::CPModel)

Change the current reward at the DecisionPhase. This is called right before making the next decision, so you know you have the very last state before the new decision
and every computation like fixPoints and backtracking has been done.
Change the current reward at the DecisionPhase. This is called right before making the next decision, so you know you have the very last state before the new decision and every computation like fixPoints and backtracking has been done.

This computes the reward : ρ*( 1+ tour_upper_bound - last_dist) where ρ is a constant, tour_upper_bound and upper bound of the tour and lastdist the distance between the previous node and the target node decided by the previous decision (the reward is attributed just before takng a new decision)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo in "takng"

"""
function set_reward!(::Type{DecisionPhase}, lh::LearnedHeuristic{SR, SmartReward, A}, model::CPModel) where {
SR <: AbstractStateRepresentation,
A <: ActionOutput
}
#println("Decision, reward : ",model.statistics.lastPruning)
dist = model.adhocInfo[1]
n = size(dist)[1]

tour_upper_bound = Base.maximum(dist) * n
max_dist = Float32(Base.maximum(dist))

if !isnothing(model.statistics.lastVar)
x = model.statistics.lastVar
s = x.id
current = parse(Int,split(x.id,'_')[2])
if isbound(model.variables["a_"*string(current)])
a_i = assignedValue(model.variables["a_"*string(current)])
v_i = assignedValue(model.variables["v_"*string(current)])
last_dist = lh.current_state.dist[v_i, a_i] * max_dist
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to the distance between the previous node and the node that has just been selected by the heuristic one step before. (Recall that the reward is always given one step after, just before making a new decision).

#print("last_dist : ", last_dist, " // ")
lh.reward.value += ρ*( 1+ tour_upper_bound - last_dist)
end

end

#lh.reward.value += model.statistics.lastPruning

end

Expand Down
3 changes: 3 additions & 0 deletions src/CP/valueselection/learning/utils.jl
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ manually change the mode again if he wants.
"""
function Flux.testmode!(lh::LearnedHeuristic, mode = true)
Flux.testmode!(lh.agent, mode)
lh.agent.policy.explorer.is_training = !mode
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing RL agent's explorer value to zero during evaluation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that by default the explorer is not called when trainMode is false.

However, this should not be a problem.

lh.trainMode = !mode
end

Expand Down Expand Up @@ -79,6 +80,8 @@ function get_observation!(lh::LearnedHeuristic, model::CPModel, x::AbstractIntVa

# Initialize reward for the next state: not compulsory with DefaultReward, but maybe useful in case the user forgets it
model.statistics.AccumulatedRewardBeforeReset += lh.reward.value
model.statistics.AccumulatedRewardBeforeRestart += lh.reward.value

lh.reward.value = 0

# synchronize state:
Expand Down
9 changes: 6 additions & 3 deletions src/RL/utils/totalreward.jl
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
"""
last_episode_total_reward(t::AbstractTrajectory)

Compute the sum of every reward of the last episode of the trajectory
Compute the sum of every reward of the last episode of the trajectory.

For example, if the t[:terminal] = [0, 0, 1, 0, 1, 1, 1, 0, 0, 1], The 7-th state is a terminal state, which means that the last episode started at step 8. Hence, last_episode_total_reward corresponds to the 3 lasts decisions.
"""
function last_episode_total_reward(t::AbstractTrajectory)
last_index = length(t[:terminal])
last_index == 0 && return 0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return 0 in case the trajectory is empty. This was needed in order to evaluate the model before any training step without triggering a BoundsError while retrieving last_episode_total_reward in the trajectory.


#if t[:terminal][last_index] #TODO understand why they wrote this

#if t[:terminal][last_index] Do we need to consider cases where the last state is not a terminal state ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this case been resolved?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @marco-novaes98, this case is done in line 10.

totalReward = t[:reward][last_index]

i = 1
Expand All @@ -18,3 +20,4 @@ function last_episode_total_reward(t::AbstractTrajectory)
end
return totalReward
end

Loading