Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: port DaftContext to rust side #3767

Merged
merged 25 commits into from
Feb 6, 2025

Conversation

universalmind303
Copy link
Contributor

@universalmind303 universalmind303 commented Feb 4, 2025

ports the DaftContext to the rust side so we can more easily reuse it for connect & catalog work.

notes for reviewers:

  • after removing daft-local-execution from daft-connect, it made cargo check fail, so some of the changes to unrelated Cargo.toml's was to properly feature flag them. We missed a couple that were causing things to break.
  • @rchowell I requested review from you specifically to take a look at context.rs. This contains the "context"/"session", and the global state that was previously only available in python. I added the catalog to this state as well.

Copy link

codspeed-hq bot commented Feb 4, 2025

CodSpeed Performance Report

Merging #3767 will improve performances by 54.97%

Comparing universalmind303:rs-context (144ad60) with main (86adc44)

Summary

⚡ 1 improvements
✅ 26 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_show[100 Small Files] 24.2 ms 15.6 ms +54.97%

static DAFT_CONTEXT: OnceCell<DaftContext> = OnceCell::new();

#[cfg(feature = "python")]
pub fn get_context() -> DaftContext {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we only have a singleton context as this is how it was done in py, but we may want to think about of ways that a user could initialize multiple contexts/sessions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global singleton context is pretty useful, I wonder if there are other ways we can produce the required functionality? In my current mental model, all user interactions with the system go through a session/connection whether that is implicit or explicit. Whereas the Context I think of a global program state for the process itself. I'm curious what functionality you had in mind where multiple contexts would be necessary which cannot be achieved by pulling that necessary state into the session?

I know that you have said that currently "context" and "session" are synonymous, but perhaps this is the reason to no longer make them so?

  • context – process state
  • session – connection state

We are pushing up against the same concepts and patterns described in SQL standard Part 1 where our context represents state for the SQL-environment (singleton), whereas our session should represent state for the SQL-connection. This is not meant to be definitive, but hopefully this helps us tease out the role of context vs the role of session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah I was actually thinking a bit about this yesterday as well. I think that seems like a reasonable distinction to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're going to make a hard distinction between the context and the session, I'll update the PR remove the catalog from the context.

Copy link

codecov bot commented Feb 6, 2025

Codecov Report

Attention: Patch coverage is 84.18972% with 80 lines in your changes missing coverage. Please review.

Project coverage is 77.82%. Comparing base (62d6b97) to head (144ad60).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-context/src/lib.rs 80.00% 40 Missing ⚠️
src/daft-context/src/python.rs 76.36% 26 Missing ⚠️
src/daft-py-runners/src/lib.rs 90.98% 11 Missing ⚠️
daft/context.py 91.66% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3767      +/-   ##
==========================================
+ Coverage   75.88%   77.82%   +1.93%     
==========================================
  Files         733      737       +4     
  Lines       96091    93361    -2730     
==========================================
- Hits        72919    72656     -263     
+ Misses      23172    20705    -2467     
Files with missing lines Coverage Δ
src/daft-connect/src/execute.rs 89.33% <100.00%> (+3.42%) ⬆️
src/daft-connect/src/lib.rs 94.44% <ø> (ø)
src/daft-connect/src/session.rs 100.00% <100.00%> (ø)
src/daft-connect/src/spark_analyzer.rs 73.97% <100.00%> (+1.33%) ⬆️
src/lib.rs 96.15% <100.00%> (+0.04%) ⬆️
daft/context.py 83.33% <91.66%> (-4.33%) ⬇️
src/daft-py-runners/src/lib.rs 90.98% <90.98%> (ø)
src/daft-context/src/python.rs 76.36% <76.36%> (ø)
src/daft-context/src/lib.rs 80.00% <80.00%> (ø)

... and 47 files with indirect coverage changes

Comment on lines +42 to +46
def __init__(self, ctx: PyDaftContext | None = None):
if ctx is not None:
self._ctx = ctx
else:
self._ctx = PyDaftContext()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to stop using __new__ with the lock? I see how the other methods are protected by the mutex on the Rust side, but I'm not sure I follow how the singleton is thread safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the __new__ because having multiple python instances doesn't matter anymore. they are all backed by the single rust based context now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, my question is about how the Rust singleton's construction is protected from multiple instantiations.

Copy link
Contributor Author

@universalmind303 universalmind303 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh got it. It's protected by making the constructor and the fields private so there is no way to create a DaftContext outside of using get_context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, and get_context is protected by the OnceCell. So now it's possible to have multiple python context objects, but they are always backed by the same Rust context. I think that's fine? The original issue was in regards to having multiple runners, so this is probably ok.

static DAFT_CONTEXT: OnceCell<DaftContext> = OnceCell::new();

#[cfg(feature = "python")]
pub fn get_context() -> DaftContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global singleton context is pretty useful, I wonder if there are other ways we can produce the required functionality? In my current mental model, all user interactions with the system go through a session/connection whether that is implicit or explicit. Whereas the Context I think of a global program state for the process itself. I'm curious what functionality you had in mind where multiple contexts would be necessary which cannot be achieved by pulling that necessary state into the session?

I know that you have said that currently "context" and "session" are synonymous, but perhaps this is the reason to no longer make them so?

  • context – process state
  • session – connection state

We are pushing up against the same concepts and patterns described in SQL standard Part 1 where our context represents state for the SQL-environment (singleton), whereas our session should represent state for the SQL-connection. This is not meant to be definitive, but hopefully this helps us tease out the role of context vs the role of session.

@universalmind303
Copy link
Contributor Author

universalmind303 commented Feb 6, 2025

I'm curious what functionality you had in mind where multiple contexts would be necessary which cannot be achieved by pulling that necessary state into the session?

Ideally i think the config objects should be pulled into the session though I'm not sure yet what makes the most sense for the `runner'.

I'm totally open to adapting this to our needs for session as things evolve. This PR is solely to get a 1-1 mapping with our existing python DaftContext that can be used within connect and potentially elsewhere.

Copy link
Contributor

@rchowell rchowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is solely to get a 1-1 mapping with our existing python DaftContext

Understood, and this is a huge help for when I lower catalog things into Rust. For this switchover, I believe the proof is in the tests, no concerns beyond the thread safe singleton which you've addressed.

@universalmind303
Copy link
Contributor Author

cc @raunakab, just a heads up, this PR will cause conflicts with your PR #3756

@raunakab
Copy link
Contributor

raunakab commented Feb 6, 2025

@universalmind303 Looks like we're moving a lot of global state over to the Rust side of things? If so, I don't mind having this merge into main and then merging main into my changes!

I believe this should be a simple update. I will just need to see how to you've proposed global state in Rust and follow that pattern for daft broadcasting.

@universalmind303 universalmind303 merged commit d40aef8 into Eventual-Inc:main Feb 6, 2025
41 checks passed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@universalmind303 What's the purpose of decorating functions with #[cfg(not(feature = "python"))] and having their bodies remain as unimplemented!()? I see this pattern has been used for all methods / functions defined in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much all of the context depends on python as the runners are implemented in python, so this is just needed to not break things when running with cargo test|check|clippy --no-default-features.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly, i'd like to just remove the python feature flag all together, but it makes rust testing less portable as it needs to link to python.

jessie-young pushed a commit that referenced this pull request Feb 14, 2025
ports the `DaftContext` to the rust side so we can more easily reuse it
for connect & catalog work.


notes for reviewers: 
- after removing `daft-local-execution` from daft-connect, it made
`cargo check` fail, so some of the changes to unrelated `Cargo.toml`'s
was to properly feature flag them. We missed a couple that were causing
things to break.
- @rchowell I requested review from you specifically to take a look at
[context.rs](https://github.com/Eventual-Inc/Daft/pull/3767/files#diff-5b536482a8303505c0c91a561a632218591bf8f02fff21f9f8a3535a3f0ff8a5).
This contains the "context"/"session", and the global state that was
previously only available in python. I added the `catalog` to this state
as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants