Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change plot resolution to per release in plot over time scripts #254

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Misty-W
Copy link
Collaborator

@Misty-W Misty-W commented Feb 24, 2025

Fixes #200.

Change plot resolution to per compiler release in plot over time scripts, so the plots will look like the following:
image
image

Note: I pasted (instead of pushed) these plots because they'd have to be removed before merge anyway.

Change plot res to per release in plot over time scripts
@Misty-W Misty-W linked an issue Feb 24, 2025 that may be closed by this pull request
@bachase
Copy link
Collaborator

bachase commented Feb 25, 2025

For the absolute error plot, does each dot still represent a new version of that software package? If so, would we want it to have the label as well?

Copy link
Collaborator

@bachase bachase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments on the logic for selecting and filtering which benchmark results end up included in the average performance on earliest version release are my main open questions.

But stepping back, I think this may still be fragile or at least challenging given how the benchmark results are stored by the date of the run currently. What if some days have more benchmark runs than others? If so, will that be a fair comparison if some days average over many runs and others only a single run?

As I (poorly) started to sketch out in #251, I think we should consider switching to associate benchmark results with the git hash (and corresponding git commit time) to ensure a singular mapping from set of compiler versions (as reflected in the lock file in the commit) to benchmark results. Then we can more cleanly pick out results when versions change and plot accordingly.

That is clearly a larger scope than this change, but I suspect getting the plot we want will be hard using the current way we run and store benchmarks. Working around it might be possible, but the code will get more complicated and arguably worth testing to be sure all the cases are handled.


fig, ax = plt.subplots(2, 1, figsize=(8, 8), sharex=False, dpi=150)
# Rotate x labels on axes 0
plt.setp(ax[0].xaxis.get_majorticklabels(), rotation=45)


filtered_avg_compiled_ratio = avg_compiled_ratio[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we thinking about which dates are included in the earlier average and this filter?

avg_compiled_ratio is calculated earlier as:

avg_compiled_ratio = (
    df_dates.groupby(["compiler", "date", "compiler_version"])[
        "compiled_ratio"
    ]
    .mean()
    .reset_index()
    .sort_values("date")
)

If we happened to have run multiple benchmarks for that compiler on that date, where say some were for version 1.2.0 and then some for version 1.3.0, would this filter be showing results that average over both versions?

Would we also need to filter to only include/average results from that date where the latest compiler version was used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this groups by the date, compiler name, and compiler version, it would not average two different versions together. On the plot I think it would just show them as different datapoints in the same color on the same date and label them correspondingly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct @jordandsullivan .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you both.

If we had multiple datapoints, would we want to only plot the "latest version" dot for that date? Just wondering if it might get cluttered otherwise.

unique_versions = sorted(df_dates["compiler_version"].unique())
dates = []
for version in unique_versions:
df_versions = df_dates[df_dates["compiler_version"] == version]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this compiler_version only the semver string? If so, is it possible that this might mix the same version of different packages? For example cirq 1.2.0 and qiskit 1.2.0? If so, I think that might end up taking the earliest date across the union of those two.

@@ -33,9 +39,18 @@
# Convert the 'date' column to datetime
df_all["date"] = pd.to_datetime(df_all["date"])

unique_versions = sorted(df_all["compiler_version"].unique())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note similar questions as above on whether this is selecting the intended dates and versions.

Copy link
Collaborator

@jordandsullivan jordandsullivan Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I added a print statement to check, and these are just the numbers, not including the compiler names:
Here are unique_versions ['0.1.0', '0.1.1', '0.2.0', '0.4.0', '0.4.2', '1.2.0', '1.3.0', '1.3.1', '1.3.2', '1.3.3', '1.35.0', '1.36.0', '1.37.0', '1.39.0', '1.4.0', '1.4.1', '1.40.0', '1.5.0.']

Copy link
Collaborator Author

@Misty-W Misty-W Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware these are only the numbers, but the actual filtering is done on the date each new version appears in the dataset:

dates = []
for version in unique_versions:
    df_versions = df_dates[df_dates["compiler_version"] == version]
    earliest_date = sorted(df_versions["date"].unique())[0]
    dates.append(earliest_date)

filtered_avg_compiled_ratio = avg_compiled_ratio[avg_compiled_ratio["date"].isin(dates)]

Copy link
Collaborator

@bachase bachase Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a toy example to consider (ignoring the actual data columns)

date, compiler, compiler_version
2025-01-01,cirq,1.2.0
2025-01-02,qiskit,1.2.0
2025-01-10,cirq,1.3.0

I think this might select only 2025-01-1 and 2025-01-10 as earliest dates, as the cirq version 1.2.0 would be earlier than the qiskit 1.2.0. Then the qiskit results for 1.2.0 wouldn't show up until much later? (In this parallel universe where their versions are close and overlapping).

Misty-W and others added 2 commits February 26, 2025 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update benchmark plot resolution to monthly
3 participants