feat: read_csv #1112

MarcoGorelli · 2024-10-01T10:42:18Z

I was initially hesitant about adding IO methods, the idea being "users provide their own dataframe, we just deal with how to process it", but we already have from_dict, and ImperialCollegeLondon/pycsvy#83 and Temporian look like good use cases for read_csv

pandas and Polars each have dozens of read_csv methods...so we may need to careful here about which ones we add, and perhaps only start with the most common ones

The api would be something like

import pandas as pd
nw.read_csv(file, native_namespace=pd)

import polars as pl
nw.read_csv(file, native_namespace=pl)

We could do:

nw.read_csv: this is eager-only and always returns nw.DataFrame
nw.scan_csv: this is the most generic one, and returns nw.LazyFrame if possible (e.g. Polars), else nw.DataFrame

Alternatives

Keep the status-quo: users are responsible for doing their own IO

The text was updated successfully, but these errors were encountered:

lucianosrp · 2024-10-26T15:02:09Z

I would generally prefer to keep narwhals's "just-pass-me-the-df" philosophy.

We could infer which namespace to use based on which module is already imported?

import pandas as pd
import narwhals as nw

df = nw.read_csv("data.csv") # < uses pandas

But then, what to do with the already imported pandas...? If you are importing it, you might as well use it for I/O

The only major reason to have an I/O support (that I can think of) would be if you would want to replace an entire "narwhals workflow/script" with one setting.

Other way I could think of:

nw.set_io_backend("pandas")
df = nw.read_csv("data.csv")

benrutter · 2024-11-14T10:27:27Z

This sounds interesting - as a library user, how would somebody use it? At the moment, the "give me some-kinda-df, get back some-kinda-df" gives a neat boundary to figure out what the end user is expecting, if I was writing a library with Narwhal's IO, would I do something like this:

def get_a_csv_and_do_some_stuff(namespace: str) -> nw.DataFrame:
    library = get_library_from_namespace_name(namespace)
    return nw.read_csv("data.csv", native_namespace=library).with_columns(z=nw.col("x") * nw.col("y"))

I'm thinking, for this to be useful, a library needs a way of figuring out the namespace an end use wants, would Narwhal's do this, or would that be the library maintainers responsibility?

raisadz · 2024-12-09T15:15:51Z

I would like to work on this issue

MarcoGorelli · 2024-12-09T15:32:17Z

Thanks!

or would that be the library maintainers responsibility?

Yup that's right

MarcoGorelli added enhancement New feature or request needs discussion labels Oct 1, 2024

AdrianDAlessandro mentioned this issue Oct 1, 2024

Consider Narwhals to enable multiple dataframe types for free ImperialCollegeLondon/pycsvy#96

Open

This was referenced Dec 10, 2024

feat: read_csv #1551

Merged

feat: scan_csv #1555

Merged

MarcoGorelli closed this as completed in #1551 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: read_csv #1112

feat: read_csv #1112

MarcoGorelli commented Oct 1, 2024

lucianosrp commented Oct 26, 2024

benrutter commented Nov 14, 2024 •

edited

Loading

raisadz commented Dec 9, 2024

MarcoGorelli commented Dec 9, 2024

feat: read_csv #1112

feat: read_csv #1112

Comments

MarcoGorelli commented Oct 1, 2024

Alternatives

lucianosrp commented Oct 26, 2024

benrutter commented Nov 14, 2024 • edited Loading

raisadz commented Dec 9, 2024

MarcoGorelli commented Dec 9, 2024

benrutter commented Nov 14, 2024 •

edited

Loading