Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: csv.QUOTE_NOTNULL doesn't seem to be supported by DataFrame.to_csv(), but there's no error and it's not documented #60423

Open
3 tasks done
wjandrea opened this issue Nov 26, 2024 · 17 comments
Labels
Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action

Comments

@wjandrea
Copy link
Contributor

wjandrea commented Nov 26, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.
    • No issues mention QUOTE_NOTNULL
  • I have confirmed this bug exists on the latest version of pandas.
  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import csv
import pandas as pd

df = pd.DataFrame(
    [[0, None, np.nan, pd.NA]],
    columns=['a', 'b', 'c', 'd'])

print(df.to_csv(quoting=csv.QUOTE_NOTNULL, index=False))

Issue Description

All fields are quoted, even nulls.

"a","b","c","d"
"0","","",""

Expected Behavior

All nulls should be unquoted.

"a","b","c","d"
"0",,,

Another possibility I might have expected is only unquoting None, as described in the csv docs, but this doesn't make sense for Pandas IMHO since Pandas has multiple "NULL"s.

"a","b","c","d"
"0",,"",""

Installed Versions

INSTALLED VERSIONS
------------------
commit                : b1c2ba793bb1f3d7001de41ca4c73377006b9e4d
python                : 3.12.7
python-bits           : 64
OS                    : Linux
OS-release            : 5.4.0-200-generic
Version               : #220-Ubuntu SMP Fri Sep 27 13:19:16 UTC 2024
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : fr_CA.UTF-8
LOCALE                : fr_CA.UTF-8

pandas                : 3.0.0.dev0+1729.gb1c2ba793b
numpy                 : 2.1.0.dev0+git20240403.e59c074
dateutil              : 2.9.0.post0
pip                   : 24.2
Cython                : None
sphinx                : None
IPython               : 8.22.2
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : None
fastparquet           : None
fsspec                : None
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
psycopg2              : None
pymysql               : None
pyarrow               : None
pyreadstat            : None
pytest                : None
python-calamine       : None
pytz                  : 2024.1
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None
@wjandrea wjandrea added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 26, 2024
@asishm
Copy link
Contributor

asishm commented Nov 26, 2024

Thanks, looks like csv.QUOTE_NOTNULL was added in python 3.12 along with csv.QUOTE_STRINGS. The user guide at https://pandas.pydata.org/docs/user_guide/io.html#quoting-compression-and-file-format also only mentions 0,1,2,3

quotingint or csv.QUOTE_* instance, default 0
Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

@asishm asishm added the IO CSV read_csv, to_csv label Nov 26, 2024
@rhshadrach
Copy link
Member

Contributions implementing QUOTE_NOTNULL are welcome!

@rhshadrach rhshadrach added Enhancement and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 26, 2024
@asishm
Copy link
Contributor

asishm commented Dec 27, 2024

@rhshadrach should a separate issue be opened for QUOTE_STRINGS ?

@rhshadrach
Copy link
Member

I think it's okay either way - fine with having both be this issue. It's okay for a PR to only implement one, it just won't fully close this issue.

@snitish
Copy link
Contributor

snitish commented Jan 11, 2025

@wjandrea it seems QUOTE_NOTNULL (and other QUOTE_*) are already implicitly supported. They are passed through as-is to the csv module's writer. The problem in your case is that there is a na_rep argument in to_csv that defaults to '' (empty string), so the NAs are being replaced with empty strings and QUOTE_NOTNULL is not taking effect. I explicitly set na_rep to None -

>>> print(df.to_csv(quoting=csv.QUOTE_NOTNULL, index=False, na_rep=None))
"a","b","c","d"
"0",,,

and it worked as expected. In light of this, is there anything we really need to fix here? @rhshadrach @asishm

@rhshadrach
Copy link
Member

I explicitly set na_rep to None -

Is this tested? The docstring type-hints na_rep as a str, so if we are supporting None it should be updated. This seems quite useful, so I'm supportive of allowing None.

@wjandrea
Copy link
Contributor Author

wjandrea commented Jan 11, 2025

@rhshadrach I think it would make more sense to have to_csv() ignore the na_rep parameter if quoting == csv.QUOTE_NOTNULL and instead use None internally, something like this:

def to_csv(
    ...,
    na_rep: str = ''
):
    ...
    _na_rep: str | None = na_rep
    if quoting == csv.QUOTE_NOTNULL:
        _na_rep = None

    formatter = DataFrameFormatter(
        ...
        na_rep=_na_rep,
    )
    ...

@rhshadrach
Copy link
Member

In general, I'm negative on us ignoring values specified by users. Isn't that what your suggested approach is doing?

@wjandrea
Copy link
Contributor Author

@rhshadrach Yes it ignores the na_rep argument, but if the user specifies csv.QUOTE_NOTNULL, what else would they expect? I mean, the csv docs say:

csv.QUOTE_NOTNULL ... quote all fields which are not None. ... if a field value is None an empty (unquoted) string is written.

@snitish
Copy link
Contributor

snitish commented Jan 11, 2025

I explicitly set na_rep to None -

Is this tested?

I did a quick grep and couldn't find any tests that set na_rep=None. Agree that supporting (officially) na_rep=None might be cleaner than ignoring na_rep when the user specifies csv.QUOTE_NOTNULL.

@rhshadrach
Copy link
Member

@rhshadrach Yes it ignores the na_rep argument, but if the user specifies csv.QUOTE_NOTNULL, what else would they expect? I mean, the csv docs say:

The pandas API often goes beyond the features of Python, and this can cause conflicts. In this particular case pandas allows specifying the NA representation while I believe Python does not (correct me if this is wrong). In such cases, I do not think it is always best to follow the Python API to the letter. At times, disagreeing with Python behavior is appropriate.

Some options:

  1. Ignore na_rep when csv.QUOTE_NOTNULL is specified.
  2. Raise when na_rep is not None when csv.QUOTE_NOTNULL is specified.
  3. Change the default of na_rep to lib.no_default, making the real default of na_rep dependent on other arguments (None when csv.QUOTE_NOTNULL is specified, "" otherwise).
  4. Keep the current default and behavior of na_rep="", making a user specify na_rep=None and csv.QUOTE_NOTNULL to get the same behavior as Python's CSV module.

cc @pandas-dev/pandas-core

@rhshadrach rhshadrach added the Needs Discussion Requires discussion from core team before further action label Jan 12, 2025
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 13, 2025

I'd vote for option 2:

Raise when na_rep is not None when csv.QUOTE_NOTNULL is specified.

@wjandrea
Copy link
Contributor Author

wjandrea commented Jan 13, 2025

@rhshadrach Ah, I see what you mean.

How about a combination of options 2 and 3 to allow the user to specify None explicitly or implicitly? i.e. if the user specifies csv.QUOTE_NOTNULL, then for na_rep:

  • unspecified: use None
  • None: continue as normal
  • any other value: raise

In code:

def to_csv(
    ...,
    na_rep: str | None | lib.NoDefault = lib.no_default
):
    ...
    _na_rep: str | None
    if quoting == csv.QUOTE_NOTNULL:
        if na_rep is lib.no_default or na_rep is None:
            _na_rep = None
        else:
            raise ValueError('For "quoting=csv.QUOTE_NOTNULL", "na_rep" must be "None" or unspecified')
    else:
        _na_rep = '' if na_rep is lib.no_default else na_rep

    formatter = DataFrameFormatter(
        ...
        na_rep=_na_rep,
    )
    ...

@rhshadrach
Copy link
Member

How about a combination of options 2 and 3...:

  • ...
  • any other value: raise

It seems to me this would be raising unnecessarily. Why is that preferred over adhering to the value the user specified?

@wjandrea
Copy link
Contributor Author

@rhshadrach If the function adhered to the value the user specified, what would it do with it? As far as I can figure, if the user specifies csv.QUOTE_NOTNULL, the only reasonable value for na_rep is None.

@rhshadrach
Copy link
Member

Why isn't

df = pd.DataFrame([[0, None, np.nan, pd.NA]], columns=['a', 'b', 'c', 'd'])
print(df.to_csv(na_rep="x", index=False, quoting=csv.QUOTE_ALL))
# "a","b","c","d"
# "0",x,x,x

reasonable?

@wjandrea
Copy link
Contributor Author

wjandrea commented Jan 22, 2025

That's perfectly reasonable, but how would we implement it? .to_csv() ultimately uses the csv module, which doesn't support that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants