-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove autogenerated code from ChannelMixer #1966
Conversation
402db72
to
6124eb2
Compare
Thank you. |
Whoops, I did not intend to close this. |
I tested playing two decks for a few minutes (using the same tracks with each test): master + 324b2fb cherry-picked:
this branch:
So this branch does seem to be slightly slower. The mean difference is 6426.9 nanoseconds. I think we can afford that to get rid of this awfully overcomplicated code. |
Can you fix the merge conflicts? I'd like to reduce the number of open PRs. I think this is a candidate for merging. |
The benchmarks, that proves that this is a good idea is still missing. Without it, we should not touch this code. |
Huh? I posted that above. |
Our engine is still single threaded and with keylock enabled on one channel alone is near buffer maximum even on my machine (t470p). I rather don't like slowing down the engine for the reason of removing som automated code. |
@Be-ing Do you plan to pick it up and possibly fix the performance issuues or should we close this? |
I do not know if the performance of this could be improved further. But I think this is the wrong place to be nitpicking about performance. The difference I measured before was 6426.9 nanoseconds. If we want to optimize the audio processing code, we should be talking about metering and timestretching. Other software uses assembler or Intel intrinsics to optimize metering. |
0.006426 milliseconds. I think we can live with that. |
Maybe we can use intel SIMD intrinsics on to work on multiple channels at once? But we'd need to rewrite that code and maintain multiple code paths for non-x86 architectures. |
A more practical approach is using SIMD instructions for multiple samples within a single buffer. |
If we abstract it like this, we can also extend it to other Architectures that offer a similar SIMD instruction set: https://blog.molecular-matters.com/2011/10/18/simdifying-multi-platform-math/ |
324b2fb
to
6f02998
Compare
6f02998
to
0a5a203
Compare
I'm sorry, this was not clear to me. So all that is left for this PR is that someone takes on the responsibility for Approving it and then hitting merge? |
I don't insist. I guess I was not clear in my post. It is not to early to vote now. Basing our votes on comments that seeing no regression from Uwes test is to early. I hope my post has clarified the situation. |
.... oh I see it was already considered as bikesheding. That's not fair, especially because I have just complained about such voice. |
for (int i = 0; i < activeChannels->size(); ++i) { | ||
EngineMaster::ChannelInfo* pChannelInfo = activeChannels->at(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider replacing with ranged-for, might result in some performance benefit even.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could that provide a performance benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compilers may be able to optimize better when conveying the intent better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, currently activeChannels->size()
is probably called on every iteration while the ranged-for implementation would only compare the begin- and end-pointers of the iterator. I'm not saying this is a huge benefit, I just believe that we should do what compiler writers expect from us so we get better code and they can possibly produce better binaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability would benefit form a range based for loop. That's the main advantage, everything else is just bonus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please elaborate, I don't understand what you mean. |
The thumb down on my post and the comment: #1966 (comment) |
I agree, continuously repeating statements as a means to pressure actions is not nice either. I ask @Be-ing to refrain from this pattern of behavior, it's useless and happens to often. |
Just providing some context -- I can't remember when this was written, but it was a big win at the time performance wise on my Lenovo T400 and my eeePC. This was around the time I was rewriting the engine to support more than 2 decks, and we were adding samplers and other mixing sources. I'm surprised there isn't a mixing benchmark testing this code. I was pretty sure I wrote one -- maybe I never committed it. I don't unroll loops or produce annoying-to-maintain code for fun -- just so everyone is clear :). |
Of course, I don't think anyone thought you did. :P But those optimizations became irrelevant with #1254 because mixing and applying the gain of the faders cannot be done in the same step with postfader effects. |
Ah.. 2013.
Right.. the original code called out to an inlined autogenerated "mix and apply ramping gain to N channels" routine, which was the key performance win over serial application of ramping gain then accumulation into an output buffer. Writing the code that way let auto-vectorization take over and do good things. (at the time we were writing SIMD by hand in SampleUtil too, but that was a pain to maintain). The one additional bit of information we're giving the compiler here is telling it that the number of elements in this channels array is small (<32) -- but I don't think the loop body in its current form can be optimized anyway. |
I'm scratching my head trying to explain the performance differences measured by @uklotzde or @Be-ing. Even if EngineEffectsManager::processPostFaderAndMix can be inlined (don't think it can) the code in there only deals with a single buffer, then adds it to the output buffer so I believe it's no different than the serial "process then add to output" flow we had before we added generated code and the generated code in its current form. I suspect the timer measurements are noisy potentially due to reasons @daschuer suggested (e.g. did you use the start/stop timing feature once the engine was already in a steady state, or was the timing from program startup? was mixxx the only program running? cpufreq "performance" governor enabled, etc.). |
I support merging this as is for the reasons in my above two comments. Generally +1 to the request for microbenchmarks of channel mixing for a range of common channel counts, I'm sorry they don't already exist.
I think it's reasonable and not bikeshedding to request benchmarks for a change that is primarily concerning a maintenance vs. performance trade-off. |
After 5 days we have 3 👍 s and 0 👎 s. |
The benchmarks should have been provided with the auto-generated code. Requesting them as a requisite for removing code that is difficult to maintain and prevents innovation is unfair. Especially in conjunction with a veto that puts all burden on the one who proposes the change. In this respect I support Jan's position as outlined in #1966 (comment) |
Merge? |
On a second thought: Even if we had those benchmarks we would need to run them on a variety of platforms including different CPU architectures for meaningful results. We don't have the resources to do that. |
For such encapsulate code units, you can execute a unit test in callgrind. Callgrind emulates an artificial CPU and returns the needed CPU cycles as reproduceable number. Due to the CPU emulation, it's independent of the real CPU and can be run in CI. Using an optimized release binary, this is a good indicator if compiler optimizations like SIMD vectorization work for a code unit. |
The vote is already 4 in favor with 0 opposing. |
@JoergAtGithub if you want to setup callgrind on CI in another PR, that seems like it could be quite helpful. |
Is anyone going to merge this or will I have to merge my own PR? |
@Be-ing Please be patient. I will merge it. PS: Being pressured or getting into crossfire is not fun. I don't need to deal with this and could turn away from Mixxx at any time for my own peace of mind. |
What is the reason for not merging it already? I don't see that anybody has articulated one. |
I have left a comment and opened a PR for this branch as you noticed. But apparently you prefer to be impatient. I will close my PR and not merge this PR. Done. Someone else may take over. |
I don't think it's fair to ask me to be patient on a 3 year old PR. |
It is unfair to pressure me after I started to pick up this PR just a few hours ago. Please rethink your behavior. |
I agree! Sorry, again :).
I only sort of agree -- the engine code is complex and difficult to maintain, in part because the problem domain of real-time friendly code is sometimes at odds with maintainability. In the general case, I think some changes that have an obvious performance risk should come with benchmarks to show it's safe (and it's ok to put the burden of proof on the code author who claims the change is safe). |
Speaking for myself, I voted yes to "merge as is". Since this PR has generated so many bad feelings all around, I think we should merge and move on. Thanks for the improvements PR, @uklotzde! If you retarget to main, will that work? |
Previously, the autogenerated code allowed for a vectorizing
optimization by combining gain application (channel faders +
crossfader) with channel mixing. When postfader effects were
implemented in Mixxx 2.1 (PR #1254), these two operations
had to be decoupled to process effects between gain
application and channel mixing. Therefore, the
autogenerated code lost its advantage and became an
overcomplication.
Unused autogenerated functions were removed from SampleUtil
as well. Some of the autogenerated SampleUtil functions are
still in use, so they were kept.