You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We automatically track changes in mypy performance over time (#14187). Currently we can detect changes of at least 1.5% pretty reliably, but smaller changes are hard to detect. #14187 has some relevant discussion, such as this comment: #14187 (comment)
I'd estimate that a cumulative performance regression of around 15% in 2022 was due to changes that were below the 1.5% noise floor. Getting the detection threshold down to 0.5% or below could be quite helpful in finding and fixing regressions.
I looked at individual measurements, and it seems possible that measurements slowly fluctuate over time. I'm not entirely sure what might be causing this. Just increasing the number of iterations we measure probably won't help much, since different batches of runs will cluster around different averages.
Here are some things that could help:
Interleave executions of current/previous builds and measure the delta. Instead of only collecting absolute performance values, interleave runs using the previous commit and the target commit and calculate the average delta. If performance gradually fluctuates over time, this should help.
Collect samples over a long period of time (say, 1 sample every hour over 12 hours).
Collect detailed profiling data for each commit and also highlight differences in the time spent in different parts of the mypy implementation. If a single function gets 2x slower, it could be easy to detect this way, even if the change in overall performance is well below the noise floor. This could be quite noisy due to renaming/splitting functions, etc.
I'm going to start by investigating if the idea 1 seems feasible.
The text was updated successfully, but these errors were encountered:
We automatically track changes in mypy performance over time (#14187). Currently we can detect changes of at least 1.5% pretty reliably, but smaller changes are hard to detect. #14187 has some relevant discussion, such as this comment: #14187 (comment)
I'd estimate that a cumulative performance regression of around 15% in 2022 was due to changes that were below the 1.5% noise floor. Getting the detection threshold down to 0.5% or below could be quite helpful in finding and fixing regressions.
I looked at individual measurements, and it seems possible that measurements slowly fluctuate over time. I'm not entirely sure what might be causing this. Just increasing the number of iterations we measure probably won't help much, since different batches of runs will cluster around different averages.
Here are some things that could help:
I'm going to start by investigating if the idea 1 seems feasible.
The text was updated successfully, but these errors were encountered: