Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dolt Python SDK write_pandas is not consistent with dolt table import #3226

Closed
addisonklinke opened this issue Apr 14, 2022 · 12 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@addisonklinke
Copy link

addisonklinke commented Apr 14, 2022

Specifically in the case of schemas with nullable columns. See below for minimal reproducible example

$ dolt init
$ dolt sql
test> CREATE TABLE objects (id INT PRIMARY KEY AUTO_INCREMENT, label TEXT NOT NULL, bbox TEXT);

$ cat objects.csv  # Create this to use for importing
id,label,bbox
1,cat,
2,dog,"[1, 2, 3, 4]"
3,dog,

# APPROACH 1: Expected behavior with CLI import ------------------------------------------
$ dolt table import objects objects.csv -u 
$ dolt sql
test> SELECT * FROM objects;
+----+-------+--------------+
| id | label | bbox         |
+----+-------+--------------+
| 1  | cat   | NULL         |
| 2  | dog   | [1, 2, 3, 4] |
| 3  | dog   | NULL         |
+----+-------+--------------+
test> DELETE FROM objects;
Query OK, 3 rows affected
test> SELECT * FROM objects;  # Confirm table is empty before trying SDK approach
+----+-------+------+
| id | label | bbox |
+----+-------+------+
+----+-------+------+

# APPROACH 2: Bug with Python SDK --------------------------------------------------------
$ python
>>> from doltpy.cli.write import write_pandas
>>> import doltcli as dolt
>>> import pandas as pd
>>> db = dolt.Dolt('.')
>>> df = pd.read_csv('objects.csv')
>>> df
   id label          bbox
0   1   cat           NaN
1   2   dog  [1, 2, 3, 4]
2   3   dog           NaN
>>> write_pandas(db, 'objects', df, import_mode='update')

$ dolt sql
test> SELECT * FROM objects;
+----+-------+--------------+
| id | label | bbox         |
+----+-------+--------------+
| 2  | dog   | [1, 2, 3, 4] |  # IDs 1 and 3 are missing because their bbox is null
+----+-------+--------------+
test> DESCRIBE objects;
+-------+------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra          |
+-------+------+------+-----+---------+----------------+
| id    | int  | NO   | PRI |         | auto_increment |
| label | text | NO   |     |         |                |
| bbox  | text | YES  |     |         |                |  # Even though the schema allows for this
+-------+------+------+-----+---------+----------------+
@addisonklinke
Copy link
Author

For reference, the current best workaround (if you still need access from Python)

db.table_import('objects', 'objects.csv', update_table=True)

@fulghum fulghum added the bug Something isn't working label Apr 15, 2022
@max-hoffman
Copy link
Contributor

We appear to have a filter that removes primary key rows with NaNs. As a side-effect, if the write_pandas call has not set primary_key=[...], we assume every column as a PK considered for removal.

I think the will behave more predictably:

write_pandas(db, 'objects', df, import_mode='update', primary_key=['id'])

Implicit filtering is perhaps not ideal, but some customers may depend on the existing behavior.

Would you mind sharing more about the context of how you are using the library? Ex: ETL, or ad hoc data analysis, or as CI scripts? We spend a tremendous amount of time testing certain interfaces and patterns of use, which admittedly has not been Doltpy's file and CLI interfaces recently. We might be able to point you towards more heavily used pathways, or shore up deficiencies in a more reliable way.

@addisonklinke
Copy link
Author

Good to know about the primary_key parameter, I can confirm that's another valid workaround to the issue. Ideally, I would say the database table schema should be used to infer the primary key if it's not given explicitly. What do you think?

Our use-case is ETL + ML. Specifically, loading annotations from external labeling services into Dolt and then consuming the versioned tables in downstream training scripts

@max-hoffman
Copy link
Contributor

So it appears that we already implemented that feature in a backwards-compatible way sometime in late 2021, but did not update the doltpy interface accordingly. This import script is the last ETL job i wrote that uses the --pks flag, as a reference point.

Here are my thoughts:

  1. I can cut a ticket to patch up the import interface in Doltpy.

  2. Using the CLI for that last import step of ETL is often a good tradeoff for crossing the impedance mismatch between an in memory DataFrame and a disk-backed SQL database right now (see the sample script I linked above). Materializing Pandas transforms in a cacheable file on its own can be useful, but also hooking into our SQL engine's bulk import mechanism from the CLI removes the "magic" we try to automate away in client libraries like Doltpy, which you already picked up on reproducing the bug above.

  3. Adding to point (2), the CLI's interfaces, options, and error messages are actively improved, and there is a small team devoted specifically to discovering and fixing those sorts of bugs before customers find them. I am happy to respond to errors you find in the client libraries, but my active attention is focused on the SQL engine's performance, optimizer, and execution pathways, so Doltpy gets less active attention these days.

Thanks for sharing that context! Feel free to continue asking follow up questions if you have any.

@max-hoffman
Copy link
Contributor

This should fix the specific bug you linked dolthub/doltpy#176. Because we rewrote the import path recently, there might be similar incompatibilities I am overlooking right now. Feel free to ping me here or in our discord if you find other discrepancies!

@addisonklinke
Copy link
Author

addisonklinke commented Apr 18, 2022

The PR you linked looks good. If I understand correctly, that change would fix the bug without requiring an explicit primary_key parameter as long as import_mode='update'? So the following would work after merging

write_pandas(db, 'objects', df, import_mode='update')

@max-hoffman
Copy link
Contributor

@addisonklinke That was my intention, thanks for taking a look! Will merge and release today.

@zachmu
Copy link
Member

zachmu commented May 5, 2022

@max-hoffman is this released yet? Can you close if so?

@addisonklinke
Copy link
Author

addisonklinke commented Jun 27, 2022

@max-hoffman This fix is noted in the v0.40.0 release notes, but after upgrading I still get the same behavior with my example above. FWIW the same goes for the latest v0.40.11

@timsehn
Copy link
Contributor

timsehn commented Jun 27, 2022

@VinaiRachakonda Can you look into this?

@timsehn timsehn reopened this Jun 27, 2022
@VinaiRachakonda VinaiRachakonda self-assigned this Jun 27, 2022
@VinaiRachakonda
Copy link
Contributor

VinaiRachakonda commented Jun 27, 2022

@addisonklinke I am unable to reproduce. When following your Python code I'm correctly getting the following output

➜  ballertv dolt sql -q "select * from objects"
+----+-------+--------------+
| id | label | bbox         |
+----+-------+--------------+
| 1  | cat   | NULL         |
| 2  | dog   | [1, 2, 3, 4] |
| 3  | dog   | NULL         |
+----+-------+--------------+

Do you mind checking the following with me

requirements.txt

attrs==19.3.0
decorator==4.4.2
doltcli==0.1.17
doltpy==2.0.13
more-itertools==8.13.0
mysql-connector-python==8.0.29
numpy==1.23.0
packaging==20.4
pandas==1.4.3
pluggy==0.13.1
protobuf==3.12.2
psutil==5.7.2
py==1.11.0
pyarrow==8.0.0
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2021.3
retry==0.9.2
six==1.15.0
SQLAlchemy==1.3.18
wcwidth==0.2.5

dolt version

dolt version 0.40.11

@addisonklinke
Copy link
Author

@VinaiRachakonda thanks for the detailed environment. I checked my differences (noted below)

$ pip freeze | egrep "^(attrs|decorator|doltcli|...)==[\.0-9]+$" > installed.txt
$ diff -u <(sort installed.txt) <(sort requirements.txt) | egrep --color=none "^\-|\+"
--- /dev/fd/63	2022-06-29 15:18:11.443609907 -0600
+++ /dev/fd/62	2022-06-29 15:18:11.443609907 -0600
@@ -1,16 +1,17 @@
-doltpy==2.0.10
-more-itertools==8.12.0
-mysql-connector-python==8.0.28
-numpy==1.19.0
+doltpy==2.0.13
+more-itertools==8.13.0
+mysql-connector-python==8.0.29
+numpy==1.23.0
-pandas==1.3.4
+pandas==1.4.3
+pyarrow==8.0.0

After upgrading doltpy from 2.0.10 to 2.0.13, the issue is resolved! So both the SDK and CLI binary need to be updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants