-
-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dolt Python SDK write_pandas
is not consistent with dolt table import
#3226
Comments
For reference, the current best workaround (if you still need access from Python) db.table_import('objects', 'objects.csv', update_table=True) |
We appear to have a filter that removes primary key rows with NaNs. As a side-effect, if the I think the will behave more predictably: write_pandas(db, 'objects', df, import_mode='update', primary_key=['id']) Implicit filtering is perhaps not ideal, but some customers may depend on the existing behavior. Would you mind sharing more about the context of how you are using the library? Ex: ETL, or ad hoc data analysis, or as CI scripts? We spend a tremendous amount of time testing certain interfaces and patterns of use, which admittedly has not been Doltpy's file and CLI interfaces recently. We might be able to point you towards more heavily used pathways, or shore up deficiencies in a more reliable way. |
Good to know about the Our use-case is ETL + ML. Specifically, loading annotations from external labeling services into Dolt and then consuming the versioned tables in downstream training scripts |
So it appears that we already implemented that feature in a backwards-compatible way sometime in late 2021, but did not update the doltpy interface accordingly. This import script is the last ETL job i wrote that uses the Here are my thoughts:
Thanks for sharing that context! Feel free to continue asking follow up questions if you have any. |
This should fix the specific bug you linked dolthub/doltpy#176. Because we rewrote the import path recently, there might be similar incompatibilities I am overlooking right now. Feel free to ping me here or in our discord if you find other discrepancies! |
The PR you linked looks good. If I understand correctly, that change would fix the bug without requiring an explicit write_pandas(db, 'objects', df, import_mode='update') |
@addisonklinke That was my intention, thanks for taking a look! Will merge and release today. |
@max-hoffman is this released yet? Can you close if so? |
@max-hoffman This fix is noted in the v0.40.0 release notes, but after upgrading I still get the same behavior with my example above. FWIW the same goes for the latest v0.40.11 |
@VinaiRachakonda Can you look into this? |
@addisonklinke I am unable to reproduce. When following your Python code I'm correctly getting the following output
Do you mind checking the following with me requirements.txt
dolt version
|
@VinaiRachakonda thanks for the detailed environment. I checked my differences (noted below)
After upgrading |
Specifically in the case of schemas with nullable columns. See below for minimal reproducible example
The text was updated successfully, but these errors were encountered: