-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0,1, 2, or 3 address form? #489
Comments
What would be the representation of instructions with 2 args? |
As a way of answering this, could we look at the code generated by the RegCPython fork and analyse the number of places where its 3 arg instructions could be reduced to 2 arg form? I guess if they aren't already re-using a register that would require some sort of register usage tracking... |
Not sure - their compiler wasn't restricted to 2 args, so it might use 3 when it doesn't need to. I don't know whether we really need to decide this now. We could do something flexible (like I did in my branch): leave the instruction layout as it is, add opcodes to set OPARG1, OPARG2, OPARG3 to the register they need to be (merge them into superinstructions so there is only one dispatch no matter how many opargs there are). We can get pretty far with that, and then experiment. |
|
I just realized that the recipe I wrote here works with 2 address instructions as well, since binary operations write their result in the same location as the first argument. |
In my experience, 1-address is the optimal tradeoff for infinite-register machines, since the existence of the accumulator (which almost never needs to be spilled in practice) means you can use a fixed 16 bits almost all of the time. 2-address works fine for actual hardware since the number of registers is very small, so you can encode 2 arguments within a word. But bytecode VMs do not want to limit themselves to such a small number of reigsters (it's possible, but it necessarily means the bytecode is only efficient on hardware with similar assumptions), and multiple My suggested example bytecode:
The main point of flexibility is exactly how the |
But of course the important question here is: which is faster in an interpreter? 1 address because of decoding simplicity, or 2 address because it uses fewer instructions? |
32 bits most likely. No bigger than the 3 address form, and no smaller, but with fewer |
My point is: the architectural addition of the accumulator means that 1-address usually does not have many, if any, more instructions than 2-address. And always fewer bytes. Adding dedicated 4-byte (or another table?) We're agreed about 3-argument being bad though. Since the "change index meanings per BB" discussion raced me, I would note that actually changing for every BB is certainly suboptimal (as is having a new output register for every instruction - there are a lot of cases where the same local is used or nothing is stored), but if you instead change only when you hit the 256 limit, that means you can avoid The most important thing is eliminating the (in-function) stack, though. Temporaries, if they can't stay in the accumulator, are just registers with no (meaningful) name. (note that between functions, the "stack" does exist; outgoing arguments deliberately overlap with the next frame's incoming arguments (though they might instead need to be copied if we've run out of stack space in the current allocation - we do not want to copy the whole stack); they (whether used or not), along with the accumulator (... actually, I guess for Python there is always a return value to put in it), should always be considered clobbered across calls (but not all jumps, to avoid pessimizing |
@o11c You say "In my experience", care to elaborate? You state that "1-address usually does not have many, if any, more instructions than 2-address". Why not? "outgoing arguments deliberately overlap with the next frame's incoming arguments" Where do the globals, builtins, etc. go?
Published work suggests otherwise #485 |
There are a few reasons why an n+1 address architecture may be better than an n address architecture in the case of a reference counting interpreter:
While playing around with manually compiling a bit of code, I saw the incref/decrets decrease with the number of instruction addresses. The step from 2 to 3 addresses was bigger than I expected. The biggest surprise was that the calculating bytecodes (so: not jumps, etc) all take on average one So I would not immediately reject a 3 register architecture, even with more parameter decoding and longer instructions. Edit: I read the Shi, Casey, Ertl, Gregg paper. It is quite convincing, and it is obvious that a three address register machine (with decent move elimination) is the way forward. |
Extra data will be very useful - like which comparison operator to use for a COMPARE_OP, or which of the args need to be decref'ed. Looks like don't need to worry about EXTENDED_ARGS. There may be a few COPYs that we still need (if neither arg can be overwritten) but I think the most common ones will go. |
Maybe we can just close this with the conclusion that we're striving for a 3-address form (with the occasional extension to 4-address, in particular for |
I realized something that might have contributed to the perf slowdown Irit measured for her first register VM prototype. The discussion about the This is the insight: Having additional variables I'm not sure what to do about this yet, but given that the GCC discussion mentioned as much as a 10% slowdown, this might be a big effect. |
There's one particular remark in that piece that may give a hint in how to proceed:
I am a bit disappointed by GCC not doing a better job at variable lifetime analysis here, so we'll have to help it. |
So the new plan is that |
In summary, we decided on three address form. |
We currently have a 0-address (stack machine) instruction format. All inputs are taken from the stack and all outputs are pushed to the stack.
We are investigating a 3-address (register machine) instruction format.
But what about one or two address formats?
One address format
A.K.A. register-accumulator. One input and the output is the accumulator. The other input is explicit.
Two address format
x86 machine instructions are typically two-address instructions. The output goes to the same place as one of the inputs.
I don't think one-address format will be useful, but the two address form is interesting, as it consumes one of its argument references, but not both which may have advantages in terms of object re-use.
Minimizing reference count operations and instruction counts.
Which format minimizes reference count operations and instruction counts?
I think an AST level analysis should be sufficient to answer this.
Examples
f
is global, the others are local:0-address
2-address
3-address
The two address form looks promising as is it more compact. Don't know if it would work as well in general.
@iritkatriel what do you think?
The text was updated successfully, but these errors were encountered: