Code for experiments on policy control and evaluation in Reinforcement Learning with delayed, aggregated and anonymous feedback.
In the standard reinforcement learning setting, for each action an agent takes, the environment provides a reward.
This is encoded by the function
In DAAF settings, the environment instead provides feedback at periodic time intervals (e.g. based on a Poisson distribution), and on aggregate, in the sense that the agent gets a combination of rewards for several actions. The fact that the agent cannot discern how much each action taken contributes to the observed reward makes the feedback anonymous.
To constrast with fully sparse reward problems, where the reward is only observed at the end, after task completion or failure, DAAF problems have intermittent feedback.
Contains
- Algorithms for policy control with DAAF
- Algorithms for policy evaluation with DAAF
- Notebooks with analysis results on reward rstimation or recovery
For specific snapshots of code submitted to conferences:
First, make sure the following python development tools are installed:
Then, in a virtual environment, run pip-compile and install:
$ make pip-compile
$ make pip-install
These should install all the requirements dependencies for development.
For building, install tox and tox-uv
$ pip install tox tox-uv
The dependecy files map to a purpose as follows:
- requirements.in: packages for the experiments.
- test-requirements.in: for running tests.
- nb-requirements.in: for jupyter notebooks.
- rendering-requirements.in: for environments can be rendered in a graphical interface, with OpenGL.
- ray-env-requirements.in: for ray in a cluster environment. During compilation with
pip-compile
, it's best to exclude the version of ray (see Makefile).
All requirements files are compiled using uv
.