documentclass | indent | lang | link-citations | colorlinks | numbersections | toc | title | author | date | reference-section-title |
---|---|---|---|---|---|---|---|---|---|---|
scrartcl |
true |
en-GB |
true |
true |
true |
true |
How to turn SupAmp into ReAmp? |
Richard Möhn |
2019-09-12 |
References |
(For a project overview and a glossary, see the home page of the Farlamp repository.) This is a design document that I mostly wrote for myself in order to clarify where I have to go. I will update it on the way to implementation.
Before I can experiment with overseer failure, I need to adapt the supervised
learning system from [@CSASupAmp] to reinforcement learning and evaluate the
result on the original tasks from the article. I see three major areas of work:
overhaul the learning mechanism, find a good way to determine rewards, and
update
In the following I assume that you are familiar with [@CSASupAmp]. \textsc{oq} stands for open question. I also recommend you read Overseer failures in SupAmp and ReAmp first, even though it's about the second part of the project. It's more polished and has a proper introduction to the problem. And it uses more consistent notation, distinguishing between problems, solutions, questions and answers.
In the following three points I paraphrase [@CSASupAmp{sec. 2.2}], with
adaptations for RL. Note that here we run only three processes in parallel,
while in the original there are four. This is because the original trains
\zkref{Z22a1c1}
- Sample a question
$q \sim D$ and pose it to$X$ .$X$ returns answer$a$ . Use$\Amplify^H(X)$ to evaluate$a$ and calculate a reward$r$ for$X$ . Record the whole interaction, ie.$(q, a, \left<\text{sub-questions and -answers}\right>, r)$ . The number of sub-questions$H$ asks is fixed. - Using a recording from process 1, train
$H'$ to predict the outputs of$H$ . - Sample a question
$q \sim D$ and pose it to$X$ .$X$ returns answer$a$ . Feed$q$ and$a$ to$\Amplify^{H'}(X)$ to generate a reward$r$ . Train$X$ on$r$ using reinforcement learning.
In SupAmp the learning process builds on questions that
- \OQ If the number
$l$ of sub-questions is fixed, does$H$ ask blank sub-questions if it has fewer than$l$ actual sub-questions? - \OQ Does
$H$ always decompose a question? Or does it sometimes evaluate higher-level questions directly as well? Eg. ‘What is$\sigma^2(5)$ ?’ - \OQ How does pretraining work? How do I have to adapt it?
- \OQ In process 3, is it useful to take one question and do multiple rounds of answer-evaluation?
- \TODO Read what Paul has written about RL-based IDA.
- \OQ
$X$ doesn't have a way to feed sub-questions to itself. This is because the authors of [@CSASupAmp] want it to come up with structures more efficient than recursive questions. But this limits capability, doesn't it? I suspect that at some point feeding questions to oneself is necessary to become more intelligent. Then the question becomes: when and when not to decompose? See [@CSASupAmp, p. 6 ‘Recursive model architectures’] for going deeper into this.
\OQ This is not clear to me yet. It looks like in some tasks, such as
\task{permutation powering}, we can only give reward 1 for a right answer and
reward 0 for a wrong answer. On other tasks the reward might be between 1 and 0,
depending on how close
- \OQ Choosing between two answers as in [@StuhDelCog] wouldn't work well here. Or would it? \zkref{Z22a1b1} \zkref{Z22a1b1a} Why do they do it?
- \TODO Get a general idea about how rewards are determined in RL. \zkref{Z22a1b2}
- \TODO Read again about ‘reward engineering’ [@ChriREngP].
\zkref{Z22b}
This is almost the same as in SupAmp, except that
- \TODO Is this really so? – Sketch how to adapt all the scripts to evaluation.
To the task description in [@CSASupAmp{app. C}] add: ‘Evaluation: If the
suggested root answer equals the answer to the second sub-question, return
\OQ What is the greatest power that we expect H to figure without asking non-primitive sub-questions? This goes back to an open question in section \ref{sec:adapt-mechanism}.
\begin{example} Given this permutation:
\begin{tabular}{l|llllllll}
\medskip
\noindent
\medskip
\noindent
\smallskip
\noindent
\medskip
\noindent
\end{example}
Following the evaluation prescription above,
Is it that
Or does
Maybe when
Question: Could we get a better training signal if I ask for all the intermediate results and make the reward the number of correct mappings? Answer: Maybe, but then I prevent the learning of more efficient representations.
This task would work well for early experiments: start with small
In this part of the project I'm not testing a hypothesis, but trying to turn
SupAmp into an RL-based system (ReAmp) with the same performance. In order to
judge success, I need to compare ReAmp with SupAmp by measuring accuracy over
number of training data for
- \OQ Can I make sampling more effective by preferring questions that
$H$ has asked as sub-questions before? – I could try this with SupAmp. - \OQ Is it efficient to start training
$X$ and$H'$ at the same time? Wouldn't it be better to wait until$H'$ is reasonably accurate in predicting the initial behaviour of$H$ ? And pause training of$X$ whenever the accuracy of$H'$ drops below a threshold.
These are points that came to my mind. Turning SupAmp into ReAmp doesn't depend on them.
\TODO I will have to learn a lot in order to understand the architecture described in [@CSASupAmp{sec. 2.5}].
- autoregressive models (not sure if this is required with RL)
- Transformer architecture (encoder-decoder, self-attention)
- embeddings and linear projection
- more about neural nets in general
- pointer networks
- more about reinforcement learning