Faster schedule_output_reactions #747

petervdonovan · 2021-11-12T07:11:42Z

This PR follows some suggestions from @edwardalee about how reactions ought to be triggered. It is related to #736 in that both involve saving more data when the SET macro is used to avoid comprehensively addressing each possible outcome after the fact.

Soroosh129 · 2021-11-12T07:26:54Z

Ah, I was looking for a vector. Thank you!

Python tests should still fail, but C, C++ tests "should" pass.

petervdonovan · 2021-11-13T07:52:54Z

Well, C tests are passing now (yay!), and as expected, schedule_output_reactions now seems to be taking a smaller proportion of the instructions executed in comparison to master. This claim is based on the fact that in Chameneos, 34% of instructions executed are from schedule_output_reactions as opposed to 51% for master.

However, I have not yet observed any decrease in execution times in this branch. I will keep tinkering, running the profiler with the "simulate cache" option, etc.

edwardalee · 2021-11-13T19:50:19Z

It might be possible to make a test case that would demonstrate the possible performance improvement. I think the test case would need to produce sparse outputs on a wide multiport that is sent to a large bank. It may be that the benchmarks don't have such sparse outputs.

The hope is that this improves cache performance by allowing us to avoid operating on multiple big data structures at once. (A slight simplification to generateRemoteTriggerTable is also included.)

petervdonovan · 2021-11-15T08:30:52Z

Update: Since commit c4ee135, I think I might be seeing the 15%-ish speedup on Chameneos that was prefigured by the 15%-ish reduction in instruction reads that I observed earlier. This was expected because I think Chameneos does have the property that @edwardalee described above. What remains at this point is to figure out why two Python tests are failing and to do a more systematic evaluation across multiple different benchmarks.

petervdonovan · 2021-11-16T09:24:58Z

The following data were generated using a script that I added in the a-b-test branch. This script is a work-in-progress, and I am not sure if it is something that we would consider merging.

All benchmarks were executed as written, without cogging or changing any parameters. For example, I used as many threads as were specified in the original files. Certain benchmarks were omitted due to compilation errors on my end such as GMP not being available.

"n1" is the number of samples taken from master, and "n2" is the number of samples taken from this branch.
The "change" column is the proportion change in execution time in this branch relative to master. For example, SleepingBarber is apparently 49% faster in this branch, and SortedLinkList is 24% slower.

A quick look at the results suggests that benchmarks with sparse outputs did better in this branch. Benchmarks such as Philosophers, SortedLinkList, and Dictionary that do not involve sparse outputs (in particular, where many or all outputs in a large multiport are set) tended to do worse. This is not surprising. In master, if n outputs were set, we only have to iterate over n array entries to find them. In this branch, we have to build a separate resizing array with n entries before iterating over that array.

Perhaps we need to consider whether outputs are likely to be mostly true or mostly false in real applications. It might also be worth considering that this branch should be only a constant factor slower than master in the worst case, whereas if there is a large enough sparse output, master may be arbitrarily slow relative to this branch.

EDIT: Above, I seemed to suggest that the problem was that this optimization causes more instructions to be executed. In fact, that is not at all obvious, not many more instructions are executed in this branch than in master, and it seems possible that even fewer instructions could be executed if more optimization were done. The number of instructions executed could not possibly have been the problem for SortedLinkList, which is the most egregious case, because nearly all of instructions executed are in the linear-time linked list operations, in a reaction that does not even invoke SET or any similar macro, ever. The real problem, I now believe, is that accesses to this big vector of present ports harms cache performance because it results in nodes of the list disappearing from the cache.

EDIT 2: My second guess was also probably wrong. Now I think the issue was heap fragmentation. This problem has been solved for SortedLinkList, but now another problem has arisen for NQueens, which I also think might have something to do with heap fragmentation.

cmnrd · 2021-11-16T13:16:56Z

The following data were generated using a script that I added in the a-b-test branch. This script is a work-in-progress, and I am not sure if it is something that we would consider merging.

There is already a quite powerful python script in benchmark/runner. It is documented here. Let me know if your use-case is not supported yet by the script, then we should extend it.

edwardalee · 2021-11-16T15:11:42Z

This is very cool! Perhaps one option to try is to not use a resizing array, but rather fixed-size array, and when that fixed size is exceeded, fall back on old method. This could even be sticky, in the sense that if you fall back once, you fall back permanently, on the assumption that you have learned something about the application (that it tends to not write outputs sparsely).

petervdonovan · 2021-11-16T17:35:48Z

Let me know if your use-case is not supported yet by the script, then we should extend it.

I agree it would be ideal to use the benchmark runner. There are two reasons why I did not think I could:

I wanted to be able to randomly alternate between running different binaries so that at any given time, either of the two binaries being compared was equally likely to be running. This is because of the concern I mentioned to you before about sequential runs being correlated with each other.
As an illustration of why I hesitate to ignore this, the following plot shows the aggregated benchmark runs from my testing on the X axis paired with the immediately following run of the same executable on the Y axis. They are normalized by the mean execution time of the same executable. Pearson's r is 0.1965 and a rough p-value computed by Scipy is 6.52e-35.

In order for the benchmark runner to do this, I assume it would have to switch branches both in the main repo and in the submodules while compiling the executables.
I also wished to be able to run the benchmarks multiple times without re-compiling when errors occurred or when I realized I had set an inappropriate number of iterations. This also saves time if I wish to keep one set of binaries the same but only change the other set of binaries -- that way, they do not all have to be re-compiled.

These reasons compelled me to compile the binaries separately. However, the second reason is not very important. Maybe the benchmark runner could be changed to address the first reason -- or maybe the first reason is not very important either.

(Edit: Another extension that would be needed for this use case is the hypothesis test, but that is not so complicated. It is also nice for it to automatically adjust the number of iterations according to how long each iteration takes, at least for internal use -- just not for external reporting, since it looks a little suspicious perhaps.)

petervdonovan · 2021-11-16T17:47:17Z

Perhaps one option to try is to not use a resizing array, but rather fixed-size array, and when that fixed size is exceeded, fall back on old method.

I will try it and compare the results to commit 9ba47fb. Thanks!

…tions.

cmnrd · 2021-11-22T08:58:22Z

@petervdonovan I think most of your points could be done with the python script, but require a bit of work. Some of your points, have also annoyed me in the past, but I didn't have the time to look for a better solution. I created lf-lang/benchmarks-lingua-franca#24 to keep track of this and will try to implement some improvements soon. Let me know if there is anything else we should add to the list.

I am not sure to which hypothesis test you reference, but I think it should be implemented by a script that analysis the data after running. While I agree that having some adaptive strategy that decides how many iterations to run (e.g. until CI is below a certain threshold), I don't think that it is easily possible to implement in the current hydra based script.

…plementations." This reverts commit 3095a73. The implementation was getting uncomfortably complicated. Other ideas will be tried before this is revisited.

…space-inefficient way.

3fe0736 did not accomplish what it set out to do, but it is another alternative path that might eventually lead to improvement.

petervdonovan · 2021-12-16T01:11:21Z

The most recent version of this branch is a bit better now, and I think that the test failure might just be happening due to memory issues compiling SleepingBarber (but I'm not sure).

Here are the numbers from Wessel, sorted according to how this benchmark compares with master. In this branch, Throughput takes 5% more time to execute, and SleepingBarber takes 46% less time to execute.

I have not thoroughly investigated why this branch is slightly slower or faster on certain benchmarks, nor am I sure which version in this branch is best. Because of hardware dependence, it is tricky to pinpoint what is/is not important. For example, there is a caveat that on my laptop, NQueens is ~20% slower for some reason.

However, the details might not be very important here because regardless of any future micro-optimizations in this branch, it might never be a good idea to merge this PR. I have not even tried Edward's suggestion given here yet. To do so, I will make a completely separate branch that might replace this one. Additionally, depending on decisions that we might make about memory management, the approach used in this branch might be unacceptable.

lhstrh · 2023-11-08T05:44:04Z

This PR is rather stale by now. What should we do with it?

petervdonovan · 2023-11-08T05:46:38Z

Let's close it. This PR might be, or might once have been, a net win, but the objectives that it was aiming for are no longer important to me and at this point I am much more interested in correctness and code quality than in these kinds of little optimizations.

petervdonovan added 2 commits November 11, 2021 22:06

First attempt.

393e61f

Clean up after previous commit.

9cf484b

petervdonovan added 5 commits November 12, 2021 10:05

Address test failure.

1d27122

Address errors and warnings for C.

a5de7c3

Python tests should still fail, but C, C++ tests "should" pass.

Try again to address failing tests.

95a8203

Fix another threaded C issue.

68eb94e

Try again to address failing tests.

b400576

petervdonovan added 7 commits November 13, 2021 14:36

Follow fewer pointers. Temporarily disable #line directives.

e3dc202

Delete code that is no longer needed in this branch.

312dd15

Don't do unnecessary copying.

c4ee135

The hope is that this improves cache performance by allowing us to avoid operating on multiple big data structures at once. (A slight simplification to generateRemoteTriggerTable is also included.)

Another slight simplification.

1dd0315

Attempt to repair new bugs.

1dc8a6d

Revert 1dd0315.

99f54a8

Revert more completely.

25241fc

Attempt to address Python issues.

9ba47fb

Start incorporating two distinct schedule output reactions implementa…

3095a73

…tions.

cmnrd mentioned this pull request Jul 6, 2022

Improve the benchmark runner script lf-lang/benchmarks-lingua-franca#24

Open

petervdonovan added 3 commits December 3, 2021 12:45

Revert "Start incorporating two distinct schedule output reactions im…

7ab46f1

…plementations." This reverts commit 3095a73. The implementation was getting uncomfortably complicated. Other ideas will be tried before this is revisited.

Merge in changes from master, esp. scalability branch, in a horribly …

0ec750f

…space-inefficient way.

A more space-efficient way to associate reactions to port.

ec6d56b

petervdonovan added 9 commits December 5, 2021 23:09

Contiguous memory for reactions triggered by multiport.

6018c16

Another attempt at memory savings.

3fe0736

Attempt to reduce redundant adds to triggered reactions vector.

d67b850

Fix performance regressions seen for SortedLinkList.

9119550

Update submodule to vector bugfixes.

7f581bc

Revert 3fe0736.

37dddb6

3fe0736 did not accomplish what it set out to do, but it is another alternative path that might eventually lead to improvement.

Update submodule.

0a620fa

Bugfix.

8aa8f7f

vector.c: Vote more efficiently.

43f898b

petervdonovan closed this Nov 8, 2023

petervdonovan deleted the faster-schedule_output_reactions branch November 8, 2023 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster schedule_output_reactions #747

Faster schedule_output_reactions #747

petervdonovan commented Nov 12, 2021 •

edited

Loading

Soroosh129 commented Nov 12, 2021

petervdonovan commented Nov 13, 2021

edwardalee commented Nov 13, 2021

petervdonovan commented Nov 15, 2021

petervdonovan commented Nov 16, 2021 •

edited

Loading

cmnrd commented Nov 16, 2021

edwardalee commented Nov 16, 2021

petervdonovan commented Nov 16, 2021 •

edited

Loading

petervdonovan commented Nov 16, 2021

cmnrd commented Nov 22, 2021

petervdonovan commented Dec 16, 2021

lhstrh commented Nov 8, 2023

petervdonovan commented Nov 8, 2023

Faster schedule_output_reactions #747

Faster schedule_output_reactions #747

Conversation

petervdonovan commented Nov 12, 2021 • edited Loading

Soroosh129 commented Nov 12, 2021

petervdonovan commented Nov 13, 2021

edwardalee commented Nov 13, 2021

petervdonovan commented Nov 15, 2021

petervdonovan commented Nov 16, 2021 • edited Loading

cmnrd commented Nov 16, 2021

edwardalee commented Nov 16, 2021

petervdonovan commented Nov 16, 2021 • edited Loading

petervdonovan commented Nov 16, 2021

cmnrd commented Nov 22, 2021

petervdonovan commented Dec 16, 2021

lhstrh commented Nov 8, 2023

petervdonovan commented Nov 8, 2023

petervdonovan commented Nov 12, 2021 •

edited

Loading

petervdonovan commented Nov 16, 2021 •

edited

Loading

petervdonovan commented Nov 16, 2021 •

edited

Loading