-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Fix ParametricRecSys observations #28358
[RLlib] Fix ParametricRecSys observations #28358
Conversation
@@ -143,9 +143,8 @@ def step(self, action): | |||
|
|||
reward = 0.0 | |||
if which_clicked < self.slate_size: | |||
# Reward is 1.0 - regret if clicked. 0.0 if not clicked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we remove this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that it was very obvious and did not need a comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, let's not remove comments that are already there! A lot of things make sense when one reads the code, but might be unclear when one just skims through things.
Also, here, we should explain, where the magic new 100.0 value comes from.
scores = [ | ||
np.dot(self.current_user, doc) for doc in self.currently_suggested_docs | ||
] | ||
scores = softmax( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment on why this should be softmax'd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scores can be > 1. If we then select a score to calculate the regret and reward, the reward may become < 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not being more verbose about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we calculate the reward here makes it so that the reward ends up somewhere <= 1, instead of between 0 and 100 like it is specified in the observation space above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, could you add this explanation here? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArturNiederfahrenhorst .
Just two nits on better comments, then we can merge it.
Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: ilee300a <ilee300@anyscale.com>
Signed-off-by: Artur Niederfahrenhorst artur@anyscale.com
Why are these changes needed?
The way observations are constructed right now, they don't necessarily fall into the observations space.
Furthermore, they don't make use of the complete observations space.
Related issue number
#28231
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.