Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for explain analyze #3484

Merged
merged 6 commits into from
Feb 28, 2025
Merged

feat: add support for explain analyze #3484

merged 6 commits into from
Feb 28, 2025

Conversation

wkalt
Copy link
Contributor

@wkalt wkalt commented Feb 27, 2025

This adds runtime execution metrics to all of our exec nodes. These metrics can be accessed by calling plan.analyze_plan().

Copy link

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wkalt wkalt force-pushed the wkalt/explain-analyze branch from fde124f to d50c378 Compare February 27, 2025 04:16
@wkalt wkalt changed the title Add support for explain analyze feat: Add support for explain analyze Feb 27, 2025
@github-actions github-actions bot added the enhancement New feature or request label Feb 27, 2025
@wkalt
Copy link
Contributor Author

wkalt commented Feb 27, 2025

Here is an example of current output:

AnalyzeExec verbose=true, metrics=[]
  ProjectionExec: expr=[author@7 as author, author_flair_css_class@8 as author_flair_css_class, author_flair_text@9 as author_flair_text, body@0 as body, can_gild@10 as can_gild, controversiality@1 as controversiality, created_utc@2 as created_utc, distinguished@11 as distinguished, gilded@3 as gilded, id@12 as id, is_submitter@13 as is_submitter, link_id@14 as link_id, parent_id@15 as parent_id, permalink@16 as permalink, retrieved_on@4 as retrieved_on, score@5 as score, stickied@17 as stickied, subreddit@18 as subreddit, subreddit_id@19 as subreddit_id, subreddit_type@20 as subreddit_type], metrics=[output_rows=48, elapsed_compute=24.256µs]
    Take: columns="body, controversiality, created_utc, gilded, retrieved_on, score, _rowid, (author), (author_flair_css_class), (author_flair_text), (can_gild), (distinguished), (id), (is_submitter), (link_id), (parent_id), (permalink), (stickied), (subreddit), (subreddit_id), (subreddit_type)", metrics=[output_rows=48, elapsed_compute=26.6µs]
      CoalesceBatchesExec: target_batch_size=8192, metrics=[output_rows=48, elapsed_compute=350.816µs]
        FilterExec: contains(body@0, foobar), metrics=[output_rows=48, elapsed_compute=11.180169927s]
          LanceScan: uri=home/wyatt/work/lance/python/reddit.lance/data, projection=[body, controversiality, created_utc, gilded, retrieved_on, score], row_id=true, row_addr=false, ordered=true, metrics=[output_rows=10000000, elapsed_compute=316.0889ms]

I think this gets us close but not quite there. I'm going to look at replacing the builtin "BaselineMetrics" with our own thing that captures some additional detail. My goal is to replicate what's available in postgres if possible:

  • time to first record (in milliseconds, not ISO date)
  • time to last record (i.e elapsed_compute)
  • avg row byte width
  • number of rows emitted

I'll keep plugging on this but I think current state is ready for review. I need to add a test as well.

One high-level question:
Does the API seem right? Should analyze_plan be a separate method from explain_plan? In rust, seems like we have no option to extend explain_plan with an optional "analyze" param, but in python we can do this. Would that be better? IMO if we were considering python in isolation the answer would be yes, but since we care about both python and rust it may be better to keep the APIs symmetric.

@wkalt wkalt changed the title feat: Add support for explain analyze feat: add support for explain analyze Feb 27, 2025
@wkalt
Copy link
Contributor Author

wkalt commented Feb 27, 2025

Cache hit/miss rates would be awesome to incorporate here too. I don't know that I will get to that.

@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2025

Codecov Report

Attention: Patch coverage is 71.63121% with 40 lines in your changes missing coverage. Please review.

Project coverage is 78.45%. Comparing base (9e614b1) to head (484df18).

Files with missing lines Patch % Lines
rust/lance/src/io/exec/fts.rs 18.18% 9 Missing ⚠️
rust/lance/src/io/exec/knn.rs 73.07% 7 Missing ⚠️
rust/lance/src/dataset/scanner.rs 0.00% 6 Missing ⚠️
rust/lance/src/io/exec/scalar_index.rs 70.00% 6 Missing ⚠️
rust/lance/src/io/exec/pushdown_scan.rs 62.50% 3 Missing ⚠️
rust/lance/src/io/exec/rowids.rs 70.00% 3 Missing ⚠️
rust/lance/src/io/exec/scan.rs 90.00% 3 Missing ⚠️
rust/lance/src/io/exec/take.rs 66.66% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3484      +/-   ##
==========================================
- Coverage   78.46%   78.45%   -0.02%     
==========================================
  Files         252      252              
  Lines       93800    93917     +117     
  Branches    93800    93917     +117     
==========================================
+ Hits        73604    73679      +75     
- Misses      17201    17245      +44     
+ Partials     2995     2993       -2     
Flag Coverage Δ
unittests 78.45% <71.63%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the framework and start.

I like the API and actually I kind of prefer two separate methods. In my mind they are quite different because analyze is going to actually execute the query where explain does not so I'm not sure how I feel about distinguishing with a flag.

Glad to see num_rows. 11s for filter compute is shocking but that does kind of match my gut of what I've been seeing in some contains queries.

Just a few nits about error handling. I'd prefer to make sure we are passing through child errors into parent error messages anytime we don't know exactly what the inner error is.

E.g. if the inner error is "table x did not exist" then it's fine to remap to "grabbing table x from database y did not exist" and drop the inner error. However, when doing a generic mapping to translate from one error type to another we generally have to assume the inner error is the "root cause" and that needs to be included somehow.

These are minor things though so marking approve and I trust your judgement with what you want to do here.

Parameters
----------
verbose : bool, default False
Use a verbose output format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the difference even is between non-verbose and verbose explain analyze and if it's worth giving users the choice? I know for explain_plan I have always passed verbose=True and have never felt it is too verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the outputs look identical to me. Maybe it's some feature we're not using. I'll remove the option.

Comment on lines 2368 to 2375
if let Ok(mut stream) = analyze.execute(0, ctx) {
while (stream.next().await).is_some() {}
} else {
return Err(Error::Execution {
message: "Failed to execute analyze plan".to_string(),
location: location!(),
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use map_err and include the error message from analyze.execute?

E.g. format!(Failed to execute analyze plan: {}, err)`

Comment on lines 237 to 239
std::task::Poll::Ready(Some(res)) => std::task::Poll::Ready(Some(
res.map_err(|e| DataFusionError::External(e.to_string().into())),
)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need map_err here? Isn't the error already a DataFusionError?

In fact, do we even need this match statement at all? Can we just do let poll = this.stream.poll_next(cx);?

This adds runtime execution metrics to all of our exec nodes. These
metrics can be accessed by calling plan.analyze_plan().
@wkalt wkalt force-pushed the wkalt/explain-analyze branch from c488bd6 to a1d1abf Compare February 28, 2025 22:09
@wkalt wkalt merged commit 949c6e7 into main Feb 28, 2025
26 of 30 checks passed
@wkalt wkalt deleted the wkalt/explain-analyze branch February 28, 2025 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants