Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster schedule_output_reactions #747

Closed
wants to merge 28 commits into from

Conversation

petervdonovan
Copy link
Collaborator

@petervdonovan petervdonovan commented Nov 12, 2021

This PR follows some suggestions from @edwardalee about how reactions ought to be triggered. It is related to #736 in that both involve saving more data when the SET macro is used to avoid comprehensively addressing each possible outcome after the fact.

@Soroosh129
Copy link
Contributor

Ah, I was looking for a vector. Thank you!

@petervdonovan
Copy link
Collaborator Author

Well, C tests are passing now (yay!), and as expected, schedule_output_reactions now seems to be taking a smaller proportion of the instructions executed in comparison to master. This claim is based on the fact that in Chameneos, 34% of instructions executed are from schedule_output_reactions as opposed to 51% for master.

However, I have not yet observed any decrease in execution times in this branch. I will keep tinkering, running the profiler with the "simulate cache" option, etc.

@edwardalee
Copy link
Collaborator

It might be possible to make a test case that would demonstrate the possible performance improvement. I think the test case would need to produce sparse outputs on a wide multiport that is sent to a large bank. It may be that the benchmarks don't have such sparse outputs.

The hope is that this improves cache performance by allowing us to avoid operating on multiple big data structures at once.

(A slight simplification to generateRemoteTriggerTable is also included.)
@petervdonovan
Copy link
Collaborator Author

Update: Since commit c4ee135, I think I might be seeing the 15%-ish speedup on Chameneos that was prefigured by the 15%-ish reduction in instruction reads that I observed earlier. This was expected because I think Chameneos does have the property that @edwardalee described above. What remains at this point is to figure out why two Python tests are failing and to do a more systematic evaluation across multiple different benchmarks.

@petervdonovan
Copy link
Collaborator Author

petervdonovan commented Nov 16, 2021

The following data were generated using a script that I added in the a-b-test branch. This script is a work-in-progress, and I am not sure if it is something that we would consider merging.

All benchmarks were executed as written, without cogging or changing any parameters. For example, I used as many threads as were specified in the original files. Certain benchmarks were omitted due to compilation errors on my end such as GMP not being available.

  • "n1" is the number of samples taken from master, and "n2" is the number of samples taken from this branch.
  • The "change" column is the proportion change in execution time in this branch relative to master. For example, SleepingBarber is apparently 49% faster in this branch, and SortedLinkList is 24% slower.

A quick look at the results suggests that benchmarks with sparse outputs did better in this branch. Benchmarks such as Philosophers, SortedLinkList, and Dictionary that do not involve sparse outputs (in particular, where many or all outputs in a large multiport are set) tended to do worse. This is not surprising. In master, if n outputs were set, we only have to iterate over n array entries to find them. In this branch, we have to build a separate resizing array with n entries before iterating over that array.

image

Perhaps we need to consider whether outputs are likely to be mostly true or mostly false in real applications. It might also be worth considering that this branch should be only a constant factor slower than master in the worst case, whereas if there is a large enough sparse output, master may be arbitrarily slow relative to this branch.

EDIT: Above, I seemed to suggest that the problem was that this optimization causes more instructions to be executed. In fact, that is not at all obvious, not many more instructions are executed in this branch than in master, and it seems possible that even fewer instructions could be executed if more optimization were done. The number of instructions executed could not possibly have been the problem for SortedLinkList, which is the most egregious case, because nearly all of instructions executed are in the linear-time linked list operations, in a reaction that does not even invoke SET or any similar macro, ever. The real problem, I now believe, is that accesses to this big vector of present ports harms cache performance because it results in nodes of the list disappearing from the cache.

EDIT 2: My second guess was also probably wrong. Now I think the issue was heap fragmentation. This problem has been solved for SortedLinkList, but now another problem has arisen for NQueens, which I also think might have something to do with heap fragmentation.

@cmnrd
Copy link
Collaborator

cmnrd commented Nov 16, 2021

The following data were generated using a script that I added in the a-b-test branch. This script is a work-in-progress, and I am not sure if it is something that we would consider merging.

There is already a quite powerful python script in benchmark/runner. It is documented here. Let me know if your use-case is not supported yet by the script, then we should extend it.

@edwardalee
Copy link
Collaborator

This is very cool! Perhaps one option to try is to not use a resizing array, but rather fixed-size array, and when that fixed size is exceeded, fall back on old method. This could even be sticky, in the sense that if you fall back once, you fall back permanently, on the assumption that you have learned something about the application (that it tends to not write outputs sparsely).

@petervdonovan
Copy link
Collaborator Author

petervdonovan commented Nov 16, 2021

Let me know if your use-case is not supported yet by the script, then we should extend it.

I agree it would be ideal to use the benchmark runner. There are two reasons why I did not think I could:

  1. I wanted to be able to randomly alternate between running different binaries so that at any given time, either of the two binaries being compared was equally likely to be running. This is because of the concern I mentioned to you before about sequential runs being correlated with each other.
    As an illustration of why I hesitate to ignore this, the following plot shows the aggregated benchmark runs from my testing on the X axis paired with the immediately following run of the same executable on the Y axis. They are normalized by the mean execution time of the same executable. Pearson's r is 0.1965 and a rough p-value computed by Scipy is 6.52e-35.
    image
    In order for the benchmark runner to do this, I assume it would have to switch branches both in the main repo and in the submodules while compiling the executables.
  2. I also wished to be able to run the benchmarks multiple times without re-compiling when errors occurred or when I realized I had set an inappropriate number of iterations. This also saves time if I wish to keep one set of binaries the same but only change the other set of binaries -- that way, they do not all have to be re-compiled.

These reasons compelled me to compile the binaries separately. However, the second reason is not very important. Maybe the benchmark runner could be changed to address the first reason -- or maybe the first reason is not very important either.

(Edit: Another extension that would be needed for this use case is the hypothesis test, but that is not so complicated. It is also nice for it to automatically adjust the number of iterations according to how long each iteration takes, at least for internal use -- just not for external reporting, since it looks a little suspicious perhaps.)

@petervdonovan
Copy link
Collaborator Author

Perhaps one option to try is to not use a resizing array, but rather fixed-size array, and when that fixed size is exceeded, fall back on old method.

I will try it and compare the results to commit 9ba47fb. Thanks!

@cmnrd
Copy link
Collaborator

cmnrd commented Nov 22, 2021

@petervdonovan I think most of your points could be done with the python script, but require a bit of work. Some of your points, have also annoyed me in the past, but I didn't have the time to look for a better solution. I created lf-lang/benchmarks-lingua-franca#24 to keep track of this and will try to implement some improvements soon. Let me know if there is anything else we should add to the list.

I am not sure to which hypothesis test you reference, but I think it should be implemented by a script that analysis the data after running. While I agree that having some adaptive strategy that decides how many iterations to run (e.g. until CI is below a certain threshold), I don't think that it is easily possible to implement in the current hydra based script.

…plementations."

This reverts commit 3095a73. The implementation was getting uncomfortably complicated. Other ideas will be tried before this is revisited.
@petervdonovan
Copy link
Collaborator Author

The most recent version of this branch is a bit better now, and I think that the test failure might just be happening due to memory issues compiling SleepingBarber (but I'm not sure).

Here are the numbers from Wessel, sorted according to how this benchmark compares with master. In this branch, Throughput takes 5% more time to execute, and SleepingBarber takes 46% less time to execute.

I have not thoroughly investigated why this branch is slightly slower or faster on certain benchmarks, nor am I sure which version in this branch is best. Because of hardware dependence, it is tricky to pinpoint what is/is not important. For example, there is a caveat that on my laptop, NQueens is ~20% slower for some reason.

However, the details might not be very important here because regardless of any future micro-optimizations in this branch, it might never be a good idea to merge this PR. I have not even tried Edward's suggestion given here yet. To do so, I will make a completely separate branch that might replace this one. Additionally, depending on decisions that we might make about memory management, the approach used in this branch might be unacceptable.

@lhstrh
Copy link
Member

lhstrh commented Nov 8, 2023

This PR is rather stale by now. What should we do with it?

@petervdonovan
Copy link
Collaborator Author

Let's close it. This PR might be, or might once have been, a net win, but the objectives that it was aiming for are no longer important to me and at this point I am much more interested in correctness and code quality than in these kinds of little optimizations.

@petervdonovan petervdonovan deleted the faster-schedule_output_reactions branch November 8, 2023 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants