Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: split column expression into unresolved and resolved types #3804

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Feb 12, 2025

BREAKING CHANGE: Python struct get syntactic sugar is no longer allowed (ex: col("a.b") -> col("a").struct.get("b").

I did not think it was worth continuing to maintain what is essentially a small compiler for an undocumented and somewhat unclear feature. I would have to make some significant code changes to get it to work with unresolved columns, and I figured I would just rip it out. In the future I'd like to support something like this instead: col("a")["b"].


This PR is the first step to non-equality joins in Daft. We introduce an UnresolvedColumn type so that we can resolve it in the logical plan builder into a ResolvedColumn, JoinSideColumn (for joins), or OuterReferenceColumn (for subqueries). In addition, we introduce an Alias logical op that is used to determine the sources for UnresolvedColumns with plan_id set.

Along with this PR is a refactor of our SQL planner. The Alias op is now used to encode naming and scoping information, so the Relation struct is no longer necessary.

Additionally, I discovered that we can significantly simplify SQLPlanner::plan_non_agg_query and SQLPlanner::plan_aggregate_query. I believe the prior complexity came due to the fact that in many clauses (GROUP BY, ORDER BY, etc), you can reference both columns in the input tables as well as aliases in the select clause. However, the planner's bound_columns already solves this issue.

This may yield slightly less performant plans due to redundant expressions, but that should be handled by the optimizer instead of the SQL planner. Moreover, I was able to remove a sort from plan_aggregate_query, which may yield both performance and correctness improvements, since the prior code assumed you can split a sort up into two, however that may lead to different results due to sort orders (ORDER BY a, b != ORDER BY b, a) as well as whether our sort impl is stable or unstable.

Copy link

codspeed-hq bot commented Feb 13, 2025

CodSpeed Performance Report

Merging #3804 will not alter performance

Comparing kevin/unresolved-col (539f384) with main (f9a4b70)

Summary

✅ 27 untouched benchmarks

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 88.16946% with 148 lines in your changes missing coverage. Please review.

Project coverage is 77.79%. Comparing base (f9a4b70) to head (539f384).

Files with missing lines Patch % Lines
src/daft-logical-plan/src/builder/resolve_expr.rs 77.08% 22 Missing ⚠️
src/daft-dsl/src/expr/mod.rs 74.62% 17 Missing ⚠️
src/daft-sql/src/planner.rs 93.72% 17 Missing ⚠️
src/common/scan-info/src/expr_rewriter.rs 33.33% 12 Missing ⚠️
src/daft-dsl/src/python.rs 33.33% 12 Missing ⚠️
src/daft-logical-plan/src/logical_plan.rs 75.51% 12 Missing ⚠️
src/daft-local-plan/src/translate.rs 10.00% 9 Missing ⚠️
src/daft-sql/src/modules/aggs.rs 30.76% 9 Missing ⚠️
src/daft-catalog/src/lib.rs 0.00% 8 Missing ⚠️
src/daft-core/src/join.rs 0.00% 5 Missing ⚠️
... and 10 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3804      +/-   ##
==========================================
+ Coverage   75.60%   77.79%   +2.19%     
==========================================
  Files         748      748              
  Lines       99035    94588    -4447     
==========================================
- Hits        74875    73589    -1286     
+ Misses      24160    20999    -3161     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 85.26% <100.00%> (+0.01%) ⬆️
src/daft-algebra/src/boolean.rs 100.00% <100.00%> (ø)
src/daft-algebra/src/simplify/mod.rs 100.00% <ø> (ø)
src/daft-connect/src/functions/aggregate.rs 91.66% <100.00%> (ø)
src/daft-dsl/src/arithmetic/tests.rs 70.00% <100.00%> (ø)
src/daft-dsl/src/expr/tests.rs 95.58% <100.00%> (ø)
src/daft-dsl/src/lib.rs 100.00% <100.00%> (ø)
src/daft-dsl/src/optimization.rs 100.00% <100.00%> (ø)
src/daft-functions/src/coalesce.rs 93.27% <100.00%> (ø)
src/daft-local-execution/src/pipeline.rs 89.23% <100.00%> (ø)
... and 45 more

... and 26 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant