-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added features for experiment reproductibility and many other improvement : #203
Changes from 35 commits
aa2203d
c507cf6
fb2c249
63ae9e8
b7930cd
ee00993
0429f65
a133734
d2a6e66
755a1b1
793139d
b0b6c36
3442dc2
de761e8
ce14a31
5cc383c
1aadc20
1b32928
862dbdd
dd15ac7
6b88588
5154287
028790d
d238b82
679c628
5e77251
2af8044
5e011cb
8870ffd
654b6f5
bab6975
4ce7662
e4add08
05b1dd9
50e7461
6085dce
48f337a
b82854a
1536cb5
48cf440
9644dc3
2b39d54
c100838
cf94c58
043347b
fec5edb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,11 +2,15 @@ | |
struct SmartReward <: AbstractReward end | ||
|
||
This is the smart reward, that will be used to teach the agent to prioritize paths that lead to improving solutions. | ||
This reward is the exact reward implemented by Quentin Cappart in | ||
his recent paper: Combining RL & CP for Combinatorial Optimization, https://arxiv.org/pdf/2006.01610.pdf. | ||
""" | ||
mutable struct SmartReward <: AbstractReward | ||
value::Float32 | ||
end | ||
|
||
ρ = 0.001 | ||
|
||
SmartReward(model::CPModel) = SmartReward(0) | ||
|
||
""" | ||
|
@@ -19,15 +23,18 @@ function set_reward!(::Type{StepPhase}, lh::LearnedHeuristic{SR, SmartReward, A} | |
A <: ActionOutput | ||
} | ||
if symbol == :Infeasible | ||
#println("INFEASIBLE") | ||
#lh.reward.value -= last_episode_total_reward(lh.agent.trajectory) | ||
lh.reward.value -= 0 | ||
|
||
elseif symbol == :FoundSolution | ||
#println("SOLUTION FOUND, score : ",assignedValue(model.objective), " delta : ",15-assignedValue(model.objective)," accumulated reward : ", model.statistics.AccumulatedRewardBeforeReset) | ||
lh.reward.value += isnothing(model.objective) ? 0 : 100 * (-assignedValue(model.objective)) | ||
#lh.reward.value += model.statistics.lastPruning | ||
|
||
elseif symbol == :FoundSolution #last portion required to get the full closed path | ||
dist = model.adhocInfo[1] | ||
n = size(dist)[1] | ||
max_dist = Float32(Base.maximum(dist)) | ||
if isbound(model.variables["a_"*string(n-1)]) | ||
last = assignedValue(model.variables["a_"*string(n-1)]) | ||
first = assignedValue(model.variables["a_1"]) | ||
|
||
dist_to_first_node = lh.current_state.dist[last, first] * max_dist | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess it's the good behavior but I don't understand the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is supposed to be copied from the original implementation of the reward provided by @qcappart : However as I can't find it anymore in the original repo, I removed the factor |
||
lh.reward.value += -ρ*dist_to_first_node | ||
end | ||
elseif symbol == :Feasible | ||
lh.reward.value -= 0 | ||
elseif symbol == :BackTracking | ||
|
@@ -38,16 +45,34 @@ end | |
""" | ||
set_reward!(::DecisionPhase, lh::LearnedHeuristic{SmartReward, O}, model::CPModel) | ||
|
||
Change the current reward at the DecisionPhase. This is called right before making the next decision, so you know you have the very last state before the new decision | ||
and every computation like fixPoints and backtracking has been done. | ||
Change the current reward at the DecisionPhase. This is called right before making the next decision, so you know you have the very last state before the new decision and every computation like fixPoints and backtracking has been done. | ||
|
||
This computes the reward : ρ*( 1+ tour_upper_bound - last_dist) where ρ is a constant, tour_upper_bound and upper bound of the tour and lastdist the distance between the previous node and the target node decided by the previous decision (the reward is attributed just before takng a new decision) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. small typo in "takng" |
||
""" | ||
function set_reward!(::Type{DecisionPhase}, lh::LearnedHeuristic{SR, SmartReward, A}, model::CPModel) where { | ||
SR <: AbstractStateRepresentation, | ||
A <: ActionOutput | ||
} | ||
#println("Decision, reward : ",model.statistics.lastPruning) | ||
dist = model.adhocInfo[1] | ||
n = size(dist)[1] | ||
|
||
tour_upper_bound = Base.maximum(dist) * n | ||
max_dist = Float32(Base.maximum(dist)) | ||
|
||
if !isnothing(model.statistics.lastVar) | ||
x = model.statistics.lastVar | ||
s = x.id | ||
current = parse(Int,split(x.id,'_')[2]) | ||
if isbound(model.variables["a_"*string(current)]) | ||
a_i = assignedValue(model.variables["a_"*string(current)]) | ||
v_i = assignedValue(model.variables["v_"*string(current)]) | ||
last_dist = lh.current_state.dist[v_i, a_i] * max_dist | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This corresponds to the distance between the previous node and the node that has just been selected by the heuristic one step before. (Recall that the reward is always given one step after, just before making a new decision). |
||
#print("last_dist : ", last_dist, " // ") | ||
lh.reward.value += ρ*( 1+ tour_upper_bound - last_dist) | ||
end | ||
|
||
end | ||
|
||
#lh.reward.value += model.statistics.lastPruning | ||
|
||
end | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ manually change the mode again if he wants. | |
""" | ||
function Flux.testmode!(lh::LearnedHeuristic, mode = true) | ||
Flux.testmode!(lh.agent, mode) | ||
lh.agent.policy.explorer.is_training = !mode | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixing RL agent's explorer value to zero during evaluation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me that by default the explorer is not called when trainMode is false. However, this should not be a problem. |
||
lh.trainMode = !mode | ||
end | ||
|
||
|
@@ -79,6 +80,8 @@ function get_observation!(lh::LearnedHeuristic, model::CPModel, x::AbstractIntVa | |
|
||
# Initialize reward for the next state: not compulsory with DefaultReward, but maybe useful in case the user forgets it | ||
model.statistics.AccumulatedRewardBeforeReset += lh.reward.value | ||
model.statistics.AccumulatedRewardBeforeRestart += lh.reward.value | ||
|
||
lh.reward.value = 0 | ||
|
||
# synchronize state: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,15 @@ | ||
""" | ||
last_episode_total_reward(t::AbstractTrajectory) | ||
|
||
Compute the sum of every reward of the last episode of the trajectory | ||
Compute the sum of every reward of the last episode of the trajectory. | ||
|
||
For example, if the t[:terminal] = [0, 0, 1, 0, 1, 1, 1, 0, 0, 1], The 7-th state is a terminal state, which means that the last episode started at step 8. Hence, last_episode_total_reward corresponds to the 3 lasts decisions. | ||
""" | ||
function last_episode_total_reward(t::AbstractTrajectory) | ||
last_index = length(t[:terminal]) | ||
last_index == 0 && return 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. return 0 in case the trajectory is empty. This was needed in order to evaluate the model before any training step without triggering a |
||
|
||
#if t[:terminal][last_index] #TODO understand why they wrote this | ||
|
||
#if t[:terminal][last_index] Do we need to consider cases where the last state is not a terminal state ? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Has this case been resolved? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes @marco-novaes98, this case is done in line 10. |
||
totalReward = t[:reward][last_index] | ||
|
||
i = 1 | ||
|
@@ -18,3 +20,4 @@ function last_episode_total_reward(t::AbstractTrajectory) | |
end | ||
return totalReward | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be nice to explain in a few words how the smart reward works and why it's interesting to use it. If not, maybe add "section 2.2" after the paper's link to make this information easier for the user.