Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: read_csv #1112

Closed
MarcoGorelli opened this issue Oct 1, 2024 · 4 comments · Fixed by #1551
Closed

feat: read_csv #1112

MarcoGorelli opened this issue Oct 1, 2024 · 4 comments · Fixed by #1551
Labels
enhancement New feature or request needs discussion

Comments

@MarcoGorelli
Copy link
Member

I was initially hesitant about adding IO methods, the idea being "users provide their own dataframe, we just deal with how to process it", but we already have from_dict, and ImperialCollegeLondon/pycsvy#83 and Temporian look like good use cases for read_csv

pandas and Polars each have dozens of read_csv methods...so we may need to careful here about which ones we add, and perhaps only start with the most common ones

The api would be something like

import pandas as pd
nw.read_csv(file, native_namespace=pd)
import polars as pl
nw.read_csv(file, native_namespace=pl)

We could do:

  • nw.read_csv: this is eager-only and always returns nw.DataFrame
  • nw.scan_csv: this is the most generic one, and returns nw.LazyFrame if possible (e.g. Polars), else nw.DataFrame

Alternatives

Keep the status-quo: users are responsible for doing their own IO

@lucianosrp
Copy link
Member

I would generally prefer to keep narwhals's "just-pass-me-the-df" philosophy.


We could infer which namespace to use based on which module is already imported?

import pandas as pd
import narwhals as nw

df = nw.read_csv("data.csv") # < uses pandas

But then, what to do with the already imported pandas...? If you are importing it, you might as well use it for I/O

The only major reason to have an I/O support (that I can think of) would be if you would want to replace an entire "narwhals workflow/script" with one setting.

Other way I could think of:

nw.set_io_backend("pandas")
df = nw.read_csv("data.csv")

@benrutter
Copy link
Contributor

benrutter commented Nov 14, 2024

This sounds interesting - as a library user, how would somebody use it? At the moment, the "give me some-kinda-df, get back some-kinda-df" gives a neat boundary to figure out what the end user is expecting, if I was writing a library with Narwhal's IO, would I do something like this:

def get_a_csv_and_do_some_stuff(namespace: str) -> nw.DataFrame:
    library = get_library_from_namespace_name(namespace)
    return nw.read_csv("data.csv", native_namespace=library).with_columns(z=nw.col("x") * nw.col("y"))

I'm thinking, for this to be useful, a library needs a way of figuring out the namespace an end use wants, would Narwhal's do this, or would that be the library maintainers responsibility?

@raisadz
Copy link
Contributor

raisadz commented Dec 9, 2024

I would like to work on this issue

@MarcoGorelli
Copy link
Member Author

Thanks!

or would that be the library maintainers responsibility?

Yup that's right

This was referenced Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants