-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster schedule_output_reactions #747
Conversation
Ah, I was looking for a vector. Thank you! |
Python tests should still fail, but C, C++ tests "should" pass.
Well, C tests are passing now (yay!), and as expected, However, I have not yet observed any decrease in execution times in this branch. I will keep tinkering, running the profiler with the "simulate cache" option, etc. |
It might be possible to make a test case that would demonstrate the possible performance improvement. I think the test case would need to produce sparse outputs on a wide multiport that is sent to a large bank. It may be that the benchmarks don't have such sparse outputs. |
The hope is that this improves cache performance by allowing us to avoid operating on multiple big data structures at once. (A slight simplification to generateRemoteTriggerTable is also included.)
Update: Since commit c4ee135, I think I might be seeing the 15%-ish speedup on Chameneos that was prefigured by the 15%-ish reduction in instruction reads that I observed earlier. This was expected because I think Chameneos does have the property that @edwardalee described above. What remains at this point is to figure out why two Python tests are failing and to do a more systematic evaluation across multiple different benchmarks. |
The following data were generated using a script that I added in the All benchmarks were executed as written, without cogging or changing any parameters. For example, I used as many threads as were specified in the original files. Certain benchmarks were omitted due to compilation errors on my end such as GMP not being available.
A quick look at the results suggests that benchmarks with sparse outputs did better in this branch. Benchmarks such as Philosophers, SortedLinkList, and Dictionary that do not involve sparse outputs (in particular, where many or all outputs in a large multiport are set) tended to do worse. This is not surprising. In Perhaps we need to consider whether outputs are likely to be mostly true or mostly false in real applications. It might also be worth considering that this branch should be only a constant factor slower than master in the worst case, whereas if there is a large enough sparse output, EDIT: Above, I seemed to suggest that the problem was that this optimization causes more instructions to be executed. In fact, that is not at all obvious, not many more instructions are executed in this branch than in master, and it seems possible that even fewer instructions could be executed if more optimization were done. The number of instructions executed could not possibly have been the problem for SortedLinkList, which is the most egregious case, because nearly all of instructions executed are in the linear-time linked list operations, in a reaction that does not even invoke SET or any similar macro, ever. The real problem, I now believe, is that accesses to this big vector of present ports harms cache performance because it results in nodes of the list disappearing from the cache. EDIT 2: My second guess was also probably wrong. Now I think the issue was heap fragmentation. This problem has been solved for SortedLinkList, but now another problem has arisen for NQueens, which I also think might have something to do with heap fragmentation. |
There is already a quite powerful python script in |
This is very cool! Perhaps one option to try is to not use a resizing array, but rather fixed-size array, and when that fixed size is exceeded, fall back on old method. This could even be sticky, in the sense that if you fall back once, you fall back permanently, on the assumption that you have learned something about the application (that it tends to not write outputs sparsely). |
I will try it and compare the results to commit 9ba47fb. Thanks! |
@petervdonovan I think most of your points could be done with the python script, but require a bit of work. Some of your points, have also annoyed me in the past, but I didn't have the time to look for a better solution. I created lf-lang/benchmarks-lingua-franca#24 to keep track of this and will try to implement some improvements soon. Let me know if there is anything else we should add to the list. I am not sure to which hypothesis test you reference, but I think it should be implemented by a script that analysis the data after running. While I agree that having some adaptive strategy that decides how many iterations to run (e.g. until CI is below a certain threshold), I don't think that it is easily possible to implement in the current hydra based script. |
…plementations." This reverts commit 3095a73. The implementation was getting uncomfortably complicated. Other ideas will be tried before this is revisited.
…space-inefficient way.
The most recent version of this branch is a bit better now, and I think that the test failure might just be happening due to memory issues compiling SleepingBarber (but I'm not sure). Here are the numbers from Wessel, sorted according to how this benchmark compares with master. In this branch, Throughput takes 5% more time to execute, and SleepingBarber takes 46% less time to execute. I have not thoroughly investigated why this branch is slightly slower or faster on certain benchmarks, nor am I sure which version in this branch is best. Because of hardware dependence, it is tricky to pinpoint what is/is not important. For example, there is a caveat that on my laptop, NQueens is ~20% slower for some reason. However, the details might not be very important here because regardless of any future micro-optimizations in this branch, it might never be a good idea to merge this PR. I have not even tried Edward's suggestion given here yet. To do so, I will make a completely separate branch that might replace this one. Additionally, depending on decisions that we might make about memory management, the approach used in this branch might be unacceptable. |
This PR is rather stale by now. What should we do with it? |
Let's close it. This PR might be, or might once have been, a net win, but the objectives that it was aiming for are no longer important to me and at this point I am much more interested in correctness and code quality than in these kinds of little optimizations. |
This PR follows some suggestions from @edwardalee about how reactions ought to be triggered. It is related to #736 in that both involve saving more data when the
SET
macro is used to avoid comprehensively addressing each possible outcome after the fact.