Add transfer operator S3 to (generic) SQL #28964

maggesssss · 2023-01-15T21:51:44Z

This PR includes a new Transfer Operator that reads a CSV File from S3 Storage and loads it into an existing Table of a generic SQL Database

I used csv.reader to read the file and insert_rows method of the existing DbApiHook.
Due to the fact that csv.reader is not reading the complete file into the memory, also large files can be loaded somehow efficiently.

I am happy for any feedback.

…nection edit screen, existing extra data will be displayed idented

…Tests

…ear and explicit

…ample DAG

airflow/providers/amazon/aws/transfers/s3_to_sql.py

Taragolis · 2023-01-15T23:30:31Z

airflow/providers/amazon/aws/transfers/s3_to_sql.py

+        This operator downloads a file from an S3, reads it via `csv.reader`
+        and inserts the data into a SQL database using `insert_rows` method.
+        All SQL hooks are supported, as long as it is of type DbApiHook


Hmmm... Operator limited by CSV sources?

Agree with this. Either you explicitly specify in the operator name that this operator reads only CSV, either you can make this operator generic by adding a new parameter parser which would be a function responsible of reading the input from the source. The actual processing of CSV in your would be handled by this new function parser. Between the 2 solutions, I prefer the second one. My 2 cents

@vincbeck I agree. I like the Idea of using a parser parameter. Do you think it makes sense to use a default parser?

I dont think it makes sense as default but I do think that providing at least one parser in code makes sense. A CSV parser would be excellent!

Hi @vincbeck, I have changed the operator to use a parser instead and have documented it accordingly.
Could you please have a and let me know what you think about it

It looks really good! Thanks for making the changes. I added some comments but they are mostly nitpicks and minor comments. Good job!

airflow/providers/amazon/aws/transfers/s3_to_sql.py

vincbeck · 2023-01-16T15:03:44Z

airflow/providers/amazon/aws/example_dags/example_s3_to_sql.py

@@ -0,0 +1,84 @@
+# Licensed to the Apache Software Foundation (ASF) under one


Following up on the discussion here #22438, this file should be moved in tests/system/providers/amazon/aws/example_s3.py

@vincbeck I have modified the example DAG according to AIP-47, but I was not able to test it yet. Will do it as soon as I have fixed my breeze environment

The system test looks really good! Thanks for the effort to have converted the example DAG!

@vincbeck I added a SQL Check operator to count the rows inserted. Tests were running fine locally.
Can you give me a hint which conn_id's I have to use?
For S3 topics I guess aws_default, but for generic sql? sql_default?

vincbeck · 2023-01-16T15:06:39Z

airflow/providers/amazon/aws/transfers/s3_to_sql.py

+        This operator downloads a file from an S3, reads it via `csv.reader`
+        and inserts the data into a SQL database using `insert_rows` method.
+        All SQL hooks are supported, as long as it is of type DbApiHook


Agree with this. Either you explicitly specify in the operator name that this operator reads only CSV, either you can make this operator generic by adding a new parameter parser which would be a function responsible of reading the input from the source. The actual processing of CSV in your would be handled by this new function parser. Between the 2 solutions, I prefer the second one. My 2 cents

airflow/providers/amazon/aws/transfers/s3_to_sql.py

vincbeck · 2023-01-16T15:09:46Z

airflow/providers/amazon/aws/transfers/s3_to_sql.py

+            # Remove file downloaded from s3 to be idempotent.
+            os.remove(self._file)
+
+    def _get_hook(self) -> DbApiHook:


nit. You can decorate this function with @cached_property

Perhaps this doesn't need to be a property at all since it's only used in the execute() method?

There is some work going on to standardize the hook access in Amazon provider package. See #29001. I agree with you it is not necessary to store the hook in a property but (and this is only my personal opinion), using @cached_property makes the code cleaner

@vincbeck I have pushed some changes, please let me know if it's fine now

docs/apache-airflow-providers-amazon/operators/transfer/s3_to_sql.rst

vincbeck · 2023-01-16T15:11:17Z

docs/apache-airflow-providers-amazon/operators/transfer/s3_to_sql.rst

+
+
+


Suggested change

airflow/providers/amazon/aws/transfers/s3_to_sql.py

josh-fell · 2023-01-17T17:16:21Z

airflow/providers/amazon/aws/transfers/s3_to_sql.py

+            # Remove file downloaded from s3 to be idempotent.
+            os.remove(self._file)
+
+    def _get_hook(self) -> DbApiHook:


Perhaps this doesn't need to be a property at all since it's only used in the execute() method?

docs/apache-airflow-providers-amazon/operators/transfer/s3_to_sql.rst

parameter which allows the user to add a custom parser. Example parser added to docs removed following args: - csv_reader_kwargs - skip_first_row - column_list "infer" option These arguments are not working with a customer parser at the moment Changed to NamedTempoaryFile Added s3_hook.get_key before downloading to check if file exists Updated test and docs

airflow/providers/amazon/aws/transfers/s3_to_sql.py

vincbeck · 2023-01-18T21:37:47Z

airflow/providers/amazon/aws/transfers/s3_to_sql.py

+        e.g. to use a CSV parser that yields rows line-by-line, pass the following
+        function:
+
+        def parse_csv(filepath):


vincbeck · 2023-01-18T21:41:12Z

airflow/providers/amazon/aws/transfers/s3_to_sql.py

+            # Remove file downloaded from s3 to be idempotent.
+            os.remove(self._file)
+
+    def _get_hook(self) -> DbApiHook:


There is some work going on to standardize the hook access in Amazon provider package. See #29001. I agree with you it is not necessary to store the hook in a property but (and this is only my personal opinion), using @cached_property makes the code cleaner

tests/providers/amazon/aws/transfers/test_s3_to_sql.py

tests/system/providers/amazon/aws/example_s3_to_sql.py

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

airflow/providers/amazon/aws/transfers/s3_to_sql.py

tests/providers/amazon/aws/transfers/test_s3_to_sql.py

Co-authored-by: Niko Oliveira <onikolas@amazon.com>

to cached property db_hook

use imported watcher task

for SqlExecuteQueryOperators

downloaded

string and added SQLTableCheckOperator to check if lines have been successfully imported

…pache#28990)

…che#29075)

Move the logic from __init__ to executor for FTP operator

…he#29071) This version of the chart uses different variable names for setting usernames and passwords in the postgres database. `postgresql.auth.enablePostgresUser` is used to determine if the "postgres" admin account will be created. `postgresql.auth.postgresPassword` sets the password for the "postgres" user. `postgresql.auth.username` and `postrgesql.auth.password` are used to set credentials for a non-admin account if desired. `postgresql.postgresqlUsername` and `postgresql.postresqlPassword`, which were used in the previous version of the chart, are no longer used. Users will need to change these variable names in their values files if they are using the helm chart. Co-authored-by: Caleb Woofenden <caleb.woofenden@bitsighttech.com>

…29066)

…ternal API (apache#28976)

As long as the hook has an insert_rows method, this operator can work.

local_tempfile after download is finished

self._get_hook() db_hook is now a cached property

maggesssss · 2023-01-21T13:05:56Z

sorry, I think I did something wrong when rebasing...
Most of those commits are not from me

Shall I create a new PR?

potiuk · 2023-01-21T13:08:13Z

Feel free.

maggesssss · 2023-01-21T14:03:25Z

Closed (replaced by #29085)

maggesssss and others added 3 commits December 25, 2022 01:46

added JSON linter to connection edit / add UI for field extra. On con…

1fd240f

…nection edit screen, existing extra data will be displayed idented

Merge branch 'apache:main' into feat_conn_ui

b84cc56

Released new S3 to SQL Transfer Operator Including Documentation and …

cd204a3

…Tests

maggesssss requested review from eladkal and o-nikolas as code owners January 15, 2023 21:51

boring-cyborg bot added area:providers kind:documentation provider:amazon-aws AWS/Amazon - related issues labels Jan 15, 2023

root added 2 commits January 15, 2023 23:13

added missing sql_conn_id parameter to example DAG to make it more cl…

a1394dc

…ear and explicit

fixed non-unique task_ids and operator variable names in S3 to Sql ex…

02c3252

…ample DAG

Taragolis previously requested changes Jan 15, 2023

View reviewed changes

vincbeck requested changes Jan 16, 2023

View reviewed changes

josh-fell reviewed Jan 17, 2023

View reviewed changes

maggesssss and others added 3 commits January 18, 2023 21:44

Merge branch 'apache:main' into feat_s3_to_sql_transfer

172e6e9

save WIP

f147113

Merge branch 'backup' into feat_s3_to_sql_transfer

74a63ec

maggesssss marked this pull request as draft January 18, 2023 20:56

maggesssss added 3 commits January 18, 2023 22:07

WIP: migrated example DAG to system tests

d13cd79

removed example DAG (migrated to system tests)

e313bb7

vincbeck requested changes Jan 18, 2023

View reviewed changes

Update airflow/providers/amazon/aws/transfers/s3_to_sql.py

f6893a1

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

o-nikolas reviewed Jan 19, 2023

View reviewed changes

airflow/providers/amazon/aws/transfers/s3_to_sql.py Outdated Show resolved Hide resolved

tests/providers/amazon/aws/transfers/test_s3_to_sql.py Outdated Show resolved Hide resolved

maggesssss and others added 7 commits January 19, 2023 18:41

Update airflow/providers/amazon/aws/transfers/s3_to_sql.py

197dd54

Co-authored-by: Niko Oliveira <onikolas@amazon.com>

changed internal method _get_hook

49c921a

to cached property db_hook

removed unused imports

28f0cd5

use imported watcher task

changed conn type from s3 to aws

feab443

added sql_default conn_id

24cc4d5

for SqlExecuteQueryOperators

added seek(0) to tempfile after it was

07af1b7

downloaded

converted sample_data to raw

663903c

string and added SQLTableCheckOperator to check if lines have been successfully imported

Bowrna and others added 15 commits January 21, 2023 13:48

listener plugin example added (apache#27905)

8250a1c

Refactor TestLocalTaskJob.test_process_sigterm_works_with_retries (a…

6a93ff6

…pache#28990)

Mark test_process_sigterm_works_with_retries quarantined again (apa…

b8fcd20

…che#29075)

FTP operator has logic in __init__ (apache#29073)

cb8308c

Move the logic from __init__ to executor for FTP operator

Chart: add doc note about podtemplate images (apache#29032)

3cc6673

bump (apache#28602)

10427eb

Check for run_id url param when linking to graph/gantt views (apache#…

1b40759

…29066)

emit dagrun failed duration when timeout (apache#29076)

ce46baa

Migrate DagFileProcessorManager.clear_nonexistent_import_errors to In…

d870a9b

…ternal API (apache#28976)

save WIP

dcf7e87

removed type-hint from property db_hook.

db11996

As long as the hook has an insert_rows method, this operator can work.

fixed import sorting

db1cb02

added flush() to

1d902ff

local_tempfile after download is finished

removed unneccessary

42a9b76

self._get_hook() db_hook is now a cached property

maggesssss requested review from potiuk, ashb, jedcunningham, kaxil, ryanahamilton, bbovenzi, ephraimbuddy, dstandish and XD-DENG as code owners January 21, 2023 12:59

Merge branch 'apache:main' into feat_s3_to_sql_transfer

aa5e2da

maggesssss mentioned this pull request Jan 21, 2023

Add transfer operator S3 to (generic) SQL #29085

Merged

maggesssss closed this Jan 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transfer operator S3 to (generic) SQL #28964

Add transfer operator S3 to (generic) SQL #28964

maggesssss commented Jan 15, 2023

Taragolis Jan 15, 2023

vincbeck Jan 16, 2023

maggesssss Jan 16, 2023

vincbeck Jan 16, 2023 •

edited

Loading

maggesssss Jan 18, 2023

vincbeck Jan 18, 2023

vincbeck Jan 16, 2023

maggesssss Jan 18, 2023

vincbeck Jan 18, 2023

maggesssss Jan 19, 2023

vincbeck Jan 16, 2023

vincbeck Jan 16, 2023

josh-fell Jan 17, 2023

vincbeck Jan 18, 2023

maggesssss Jan 19, 2023

vincbeck Jan 16, 2023

josh-fell Jan 17, 2023

vincbeck Jan 18, 2023

vincbeck Jan 18, 2023

maggesssss commented Jan 21, 2023

potiuk commented Jan 21, 2023

maggesssss commented Jan 21, 2023

		@@ -0,0 +1,84 @@
		# Licensed to the Apache Software Foundation (ASF) under one

Add transfer operator S3 to (generic) SQL #28964

Add transfer operator S3 to (generic) SQL #28964

Conversation

maggesssss commented Jan 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincbeck Jan 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maggesssss commented Jan 21, 2023

potiuk commented Jan 21, 2023

maggesssss commented Jan 21, 2023

vincbeck Jan 16, 2023 •

edited

Loading