-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
df.to_stata fails when a column of type object contains only None #23572
Comments
Thanks for the report. Investigation and PRs are always welcome! |
What should it write? Empty strings, Nans (with with type, integers will the missing keystone value, or ['None', 'None']? This is an ambiguous case and the best solution is to raise with an exception. TBH the case with ['a',None] should also probably raise and so that columns containing only strings should be written as string columns. |
My point was that
|
I'm not familiar with stata but based off of the examples I agree with @bashtage here - I think it should raise rather than silently coerce for you. Raising would be an explicit sign to the user that they may want to replace or fill values before export, especially since its not a lossless round trip |
It seems like there are two sub-issues being discussed:
I'm also somewhat split on warning about it. On the one hand, it would be clearer for users. On the other, if you have a lot of code that relies on this, you'll get some warning fatigue. Another option is adding a parameter, but I wonder how often coercing is actually the wrong behavior. |
It is tempting to not change the current behavior vis-a-vis mixed string and The correct behavior in a column with inferred_dtype = infer_dtype(column.dropna()) This drops na-like values, including One issue that afects the ambiguity here is that both datetime and string arrays can be object and are supported. Converting all |
That makes sense to me about the mixed The all Beyond that, maybe the error text could be better. I'd like to know that (a) the more descriptive reason is that it's all |
@jtkiley I looked into the Parquet details more and it turns out that a string of length 0 and a null string are two different values. So they map exactly to It appears to me that the core issue is that Stata doesn't have a missing value for strings. In Pandas, |
I linked a PR with a proposed changed error message that
|
@kylebarron Good sleuthing. So, just to be clear, you're only talking about raising in the all So, I think my (revised) preferences are:
Also, to get another look at a known source of |
Above I was referencing the all
I agree with @bashtage raising an error is the best solution. |
Are there still any remaining fixes/enhancements related to this issue? |
@jreback #23692 is the mixed |
FWIW the change that allows
If |
Improve the error message shown when an object array is empty closes pandas-dev#23572
Improve the error message shown when an object array is empty closes pandas-dev#23572
* ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes #23572 * TST: Add tests for all None Test exception is hit when all values in an object column are None Extend the test for strl conversion to ensure this case passes (as expected)
* ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes pandas-dev#23572 * TST: Add tests for all None Test exception is hit when all values in an object column are None Extend the test for strl conversion to ensure this case passes (as expected)
* ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes pandas-dev#23572 * TST: Add tests for all None Test exception is hit when all values in an object column are None Extend the test for strl conversion to ensure this case passes (as expected)
* ENH: Improve error message for empty object array Improve the error message shown when an object array is empty closes pandas-dev#23572 * TST: Add tests for all None Test exception is hit when all values in an object column are None Extend the test for strl conversion to ensure this case passes (as expected)
I was trying to save the results of querying datasets across several years, where column names change, etc. I ended up adding the following "hack" to get it to automatically set
I'm sure this isn't "safe", but it might be worth adding an option to enable this in a "safe way". |
Safer would be
so that you don't try to write all missing value numeric columns. |
Great point thanks @bashtage! Do you think it might be worth adding an option to |
Just want to add that there should be brackets around
|
Code Sample, a copy-pastable example if possible
Problem description
The
df.to_stata()
method writes columns containingNone
without error when there is at least one string value in the column, but fails if the column contains onlyNone
. It's unclear what data type to write a column ofNone
as, so maybe that's why this isn't supported? I would propose that a column with values of onlyNone
be written asstr1
with empty strings.I came across this error because I read in a Parquet file with
pd.read_parquet()
and was unable to write the file to Stata format. In the Parquet schema, the column had typeBYTE_ARRAY UTF8
, but since the column had only missing values, it was read into Pandas as onlyNone
.Expected Output
Stata file written to disk with missing values for the column with
None
.Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: