Dolt Python SDK `write_pandas` is not consistent with `dolt table import` #3226

addisonklinke · 2022-04-14T21:03:56Z

Specifically in the case of schemas with nullable columns. See below for minimal reproducible example

$ dolt init
$ dolt sql
test> CREATE TABLE objects (id INT PRIMARY KEY AUTO_INCREMENT, label TEXT NOT NULL, bbox TEXT);

$ cat objects.csv  # Create this to use for importing
id,label,bbox
1,cat,
2,dog,"[1, 2, 3, 4]"
3,dog,

# APPROACH 1: Expected behavior with CLI import ------------------------------------------
$ dolt table import objects objects.csv -u 
$ dolt sql
test> SELECT * FROM objects;
+----+-------+--------------+
| id | label | bbox         |
+----+-------+--------------+
| 1  | cat   | NULL         |
| 2  | dog   | [1, 2, 3, 4] |
| 3  | dog   | NULL         |
+----+-------+--------------+
test> DELETE FROM objects;
Query OK, 3 rows affected
test> SELECT * FROM objects;  # Confirm table is empty before trying SDK approach
+----+-------+------+
| id | label | bbox |
+----+-------+------+
+----+-------+------+

# APPROACH 2: Bug with Python SDK --------------------------------------------------------
$ python
>>> from doltpy.cli.write import write_pandas
>>> import doltcli as dolt
>>> import pandas as pd
>>> db = dolt.Dolt('.')
>>> df = pd.read_csv('objects.csv')
>>> df
   id label          bbox
0   1   cat           NaN
1   2   dog  [1, 2, 3, 4]
2   3   dog           NaN
>>> write_pandas(db, 'objects', df, import_mode='update')

$ dolt sql
test> SELECT * FROM objects;
+----+-------+--------------+
| id | label | bbox         |
+----+-------+--------------+
| 2  | dog   | [1, 2, 3, 4] |  # IDs 1 and 3 are missing because their bbox is null
+----+-------+--------------+
test> DESCRIBE objects;
+-------+------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra          |
+-------+------+------+-----+---------+----------------+
| id    | int  | NO   | PRI |         | auto_increment |
| label | text | NO   |     |         |                |
| bbox  | text | YES  |     |         |                |  # Even though the schema allows for this
+-------+------+------+-----+---------+----------------+

The text was updated successfully, but these errors were encountered:

addisonklinke · 2022-04-14T21:20:32Z

For reference, the current best workaround (if you still need access from Python)

db.table_import('objects', 'objects.csv', update_table=True)

max-hoffman · 2022-04-15T17:18:20Z

We appear to have a filter that removes primary key rows with NaNs. As a side-effect, if the write_pandas call has not set primary_key=[...], we assume every column as a PK considered for removal.

I think the will behave more predictably:

write_pandas(db, 'objects', df, import_mode='update', primary_key=['id'])

Implicit filtering is perhaps not ideal, but some customers may depend on the existing behavior.

Would you mind sharing more about the context of how you are using the library? Ex: ETL, or ad hoc data analysis, or as CI scripts? We spend a tremendous amount of time testing certain interfaces and patterns of use, which admittedly has not been Doltpy's file and CLI interfaces recently. We might be able to point you towards more heavily used pathways, or shore up deficiencies in a more reliable way.

addisonklinke · 2022-04-15T17:34:31Z

Good to know about the primary_key parameter, I can confirm that's another valid workaround to the issue. Ideally, I would say the database table schema should be used to infer the primary key if it's not given explicitly. What do you think?

Our use-case is ETL + ML. Specifically, loading annotations from external labeling services into Dolt and then consuming the versioned tables in downstream training scripts

max-hoffman · 2022-04-15T18:40:53Z

So it appears that we already implemented that feature in a backwards-compatible way sometime in late 2021, but did not update the doltpy interface accordingly. This import script is the last ETL job i wrote that uses the --pks flag, as a reference point.

Here are my thoughts:

I can cut a ticket to patch up the import interface in Doltpy.
Using the CLI for that last import step of ETL is often a good tradeoff for crossing the impedance mismatch between an in memory DataFrame and a disk-backed SQL database right now (see the sample script I linked above). Materializing Pandas transforms in a cacheable file on its own can be useful, but also hooking into our SQL engine's bulk import mechanism from the CLI removes the "magic" we try to automate away in client libraries like Doltpy, which you already picked up on reproducing the bug above.
Adding to point (2), the CLI's interfaces, options, and error messages are actively improved, and there is a small team devoted specifically to discovering and fixing those sorts of bugs before customers find them. I am happy to respond to errors you find in the client libraries, but my active attention is focused on the SQL engine's performance, optimizer, and execution pathways, so Doltpy gets less active attention these days.

Thanks for sharing that context! Feel free to continue asking follow up questions if you have any.

max-hoffman · 2022-04-15T23:50:53Z

This should fix the specific bug you linked dolthub/doltpy#176. Because we rewrote the import path recently, there might be similar incompatibilities I am overlooking right now. Feel free to ping me here or in our discord if you find other discrepancies!

addisonklinke · 2022-04-18T15:22:48Z

The PR you linked looks good. If I understand correctly, that change would fix the bug without requiring an explicit primary_key parameter as long as import_mode='update'? So the following would work after merging

write_pandas(db, 'objects', df, import_mode='update')

max-hoffman · 2022-04-18T16:07:11Z

@addisonklinke That was my intention, thanks for taking a look! Will merge and release today.

zachmu · 2022-05-05T18:33:47Z

@max-hoffman is this released yet? Can you close if so?

addisonklinke · 2022-06-27T15:40:48Z

@max-hoffman This fix is noted in the v0.40.0 release notes, but after upgrading I still get the same behavior with my example above. FWIW the same goes for the latest v0.40.11

timsehn · 2022-06-27T17:13:21Z

@VinaiRachakonda Can you look into this?

VinaiRachakonda · 2022-06-27T20:54:38Z

@addisonklinke I am unable to reproduce. When following your Python code I'm correctly getting the following output

➜  ballertv dolt sql -q "select * from objects"
+----+-------+--------------+
| id | label | bbox         |
+----+-------+--------------+
| 1  | cat   | NULL         |
| 2  | dog   | [1, 2, 3, 4] |
| 3  | dog   | NULL         |
+----+-------+--------------+

Do you mind checking the following with me

requirements.txt

attrs==19.3.0
decorator==4.4.2
doltcli==0.1.17
doltpy==2.0.13
more-itertools==8.13.0
mysql-connector-python==8.0.29
numpy==1.23.0
packaging==20.4
pandas==1.4.3
pluggy==0.13.1
protobuf==3.12.2
psutil==5.7.2
py==1.11.0
pyarrow==8.0.0
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2021.3
retry==0.9.2
six==1.15.0
SQLAlchemy==1.3.18
wcwidth==0.2.5

dolt version

dolt version 0.40.11

addisonklinke · 2022-06-29T21:28:36Z

@VinaiRachakonda thanks for the detailed environment. I checked my differences (noted below)

$ pip freeze | egrep "^(attrs|decorator|doltcli|...)==[\.0-9]+$" > installed.txt
$ diff -u <(sort installed.txt) <(sort requirements.txt) | egrep --color=none "^\-|\+"
--- /dev/fd/63	2022-06-29 15:18:11.443609907 -0600
+++ /dev/fd/62	2022-06-29 15:18:11.443609907 -0600
@@ -1,16 +1,17 @@
-doltpy==2.0.10
-more-itertools==8.12.0
-mysql-connector-python==8.0.28
-numpy==1.19.0
+doltpy==2.0.13
+more-itertools==8.13.0
+mysql-connector-python==8.0.29
+numpy==1.23.0
-pandas==1.3.4
+pandas==1.4.3
+pyarrow==8.0.0

After upgrading doltpy from 2.0.10 to 2.0.13, the issue is resolved! So both the SDK and CLI binary need to be updated

fulghum added the bug Something isn't working label Apr 15, 2022

max-hoffman closed this as completed May 5, 2022

timsehn reopened this Jun 27, 2022

VinaiRachakonda self-assigned this Jun 27, 2022

addisonklinke closed this as completed Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dolt Python SDK `write_pandas` is not consistent with `dolt table import` #3226

Dolt Python SDK `write_pandas` is not consistent with `dolt table import` #3226

addisonklinke commented Apr 14, 2022 •

edited

Loading

addisonklinke commented Apr 14, 2022

max-hoffman commented Apr 15, 2022

addisonklinke commented Apr 15, 2022

max-hoffman commented Apr 15, 2022

max-hoffman commented Apr 15, 2022

addisonklinke commented Apr 18, 2022 •

edited

Loading

max-hoffman commented Apr 18, 2022

zachmu commented May 5, 2022

addisonklinke commented Jun 27, 2022 •

edited

Loading

timsehn commented Jun 27, 2022

VinaiRachakonda commented Jun 27, 2022 •

edited

Loading

addisonklinke commented Jun 29, 2022

Dolt Python SDK write_pandas is not consistent with dolt table import #3226

Dolt Python SDK write_pandas is not consistent with dolt table import #3226

Comments

addisonklinke commented Apr 14, 2022 • edited Loading

addisonklinke commented Apr 14, 2022

max-hoffman commented Apr 15, 2022

addisonklinke commented Apr 15, 2022

max-hoffman commented Apr 15, 2022

max-hoffman commented Apr 15, 2022

addisonklinke commented Apr 18, 2022 • edited Loading

max-hoffman commented Apr 18, 2022

zachmu commented May 5, 2022

addisonklinke commented Jun 27, 2022 • edited Loading

timsehn commented Jun 27, 2022

VinaiRachakonda commented Jun 27, 2022 • edited Loading

addisonklinke commented Jun 29, 2022

Dolt Python SDK `write_pandas` is not consistent with `dolt table import` #3226

Dolt Python SDK `write_pandas` is not consistent with `dolt table import` #3226

addisonklinke commented Apr 14, 2022 •

edited

Loading

addisonklinke commented Apr 18, 2022 •

edited

Loading

addisonklinke commented Jun 27, 2022 •

edited

Loading

VinaiRachakonda commented Jun 27, 2022 •

edited

Loading