Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure Amazon provider operators files #20139

Closed
13 tasks done
eladkal opened this issue Dec 8, 2021 · 27 comments
Closed
13 tasks done

Restructure Amazon provider operators files #20139

eladkal opened this issue Dec 8, 2021 · 27 comments
Assignees
Labels
good first issue kind:task A task that needs to be completed as part of a larger issue provider:amazon-aws AWS/Amazon - related issues

Comments

@eladkal
Copy link
Contributor

eladkal commented Dec 8, 2021

Body

Spun out of #19665 (comment)

Info:
Currently operators in Amazon provider has a structure of operator per file (link) A single service S3 has 7 operators files! This makes it very hard to discover classes.
The proposed structure is what we have for GCP (link) where file is created for a service and all operators for that services are in a single file.

Proposed changes:

Operators:

files that don't require changes:

athena.py
batch.py
cloud_formation.py
datasync.py
ecs.py
eks.py
glacier.py
glue.py
glue_crawler.py
redshift.py
sns.py
sqs.py

files that require changes:

Current New
dms_create_task.py dms.py
dms_delete_task.py dms.py
dms_describe_tasks.py dms.py
dms_start_task.py dms.py
dms_stop_task.py dms.py
ec2_start_instance.py ec2.py
ec2_stop_instance.py ec2.py
emr_add_steps.py emr.py
emr_containers.py emr.py
emr_create_job_flow.py emr.py
emr_modify_cluster.py emr.py
emr_terminate_job_flow.py emr.py
s3_bucket.py s3.py
s3_bucket_tagging.py s3.py
s3_copy_object.py s3.py
s3_delete_objects.py s3.py
s3_file_transform.py s3.py
s3_list.py s3.py
s3_list_prefixes.py s3.py
sagemaker_base.py sagemaker.py
sagemaker_endpoint.py sagemaker.py
sagemaker_endpoint_config.py sagemaker.py
sagemaker_model.py sagemaker.py
sagemaker_processing.py sagemaker.py
sagemaker_training.py sagemaker.py
sagemaker_transform.py sagemaker.py
sagemaker_tuning.py sagemaker.py
step_function_get_execution_output.py step_function.py
step_function_start_execution.py step_function.py

Sensors:

files that don't require changes:

athena.py
batch.py
cloud_formation.py
eks.py
glacier.py
glue.py
glue_crawler.py
redshift.py
sqs.py

files that require changes:

Current New Note
dms_task.py dms.py
ec2_instance_state.py ec2.py
emr_base.py emr.py
emr_containers.py emr.py
emr_job_flow.py emr.py
emr_step.py emr.py
glue_catalog_partition.py glue.py New file already exist
s3_key.py s3.py
s3_keys_unchanged.py s3.py
s3_prefix.py s3.py
sagemaker_base.py sagemaker.py
sagemaker_endpoint.py sagemaker.py
sagemaker_training.py sagemaker.py
sagemaker_transform.py sagemaker.py
sagemaker_tuning.py sagemaker.py
step_function_execution.py step_function.py

Action items:
Create new files and deprecate old ones. This is a backward compatible change.
old file should raise deprecation warnings.

FYI @dstandish @ferruzzi @o-nikolas @dbarrundiag

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.
@o-nikolas
Copy link
Contributor

Hey Elad,

Thanks for cutting this ticket!

We can take this work item over here at AWS; assign the task to me unless someone else is keen to do it, and we'll pick away at it 👍

@eladkal
Copy link
Contributor Author

eladkal commented Dec 8, 2021

Hey Elad,

Thanks for cutting this ticket!

We can take this work item over here at AWS; assign the task to me unless someone else is keen to do it, and we'll pick away at it 👍

Multiple people can work on it together. Each action item is stand alone. Together we can finish this quickly.
I started with ec2. When you actively work on a class just comment hete.

@o-nikolas
Copy link
Contributor

Hey Elad,
Thanks for cutting this ticket!
We can take this work item over here at AWS; assign the task to me unless someone else is keen to do it, and we'll pick away at it +1

Multiple people can work on it together. Each action item is stand alone. Together we can finish this quickly. I started with ec2. When you actively work on a class just comment hete.

Ah okay, I didn't think there was any urgency here. Fair enough. But yes, definitely agree that it is parallelizable 👍

@potiuk
Copy link
Member

potiuk commented Dec 8, 2021

Nice. you just might want to make it easier to get people to migrate and keep the deprecation working @o-nikolas @eladkal . We also have very nice framework for deprecation warnings (used during 1.10/2.0 migration: https://github.com/apache/airflow/blob/main/tests/deprecated_classes.py.

@kaxil
Copy link
Member

kaxil commented Dec 9, 2021

Agree with the proposed changes @eladkal

@KennyRich
Copy link
Contributor

KennyRich commented Dec 9, 2021

Can I work on one of the classes? I can work on S3. @eladkal and @o-nikolas
This would be my first contribution to airflow

@bhavaniravi
Copy link
Contributor

bhavaniravi commented Dec 9, 2021

I can take up a few of them since @KennyRich wants to do S3. I'll start with EC2

@eladkal
Copy link
Contributor Author

eladkal commented Dec 9, 2021

I can take up a few of them since @KennyRich wants to do S3. I'll start with EC2

Already started EC2 as mentioned earlier.. I'll PR shortly... There is plenty to pick up...

Can I work on one of the classes? I can work on S3
@KennyRich yeah go ahead

@KennyRich
Copy link
Contributor

@eladkal Please kindly link your PR here so I can easily see and learn from it.

@bhavaniravi
Copy link
Contributor

bhavaniravi commented Dec 9, 2021

Let me start with dms.py then. Quick question. Should we add the deprecated warning like the following on existing modules, or is this auto-generated?

import warnings

from airflow.providers.amazon.aws.operators.ec2 import EC2StartInstanceOperator

warnings.warn(
    "This module is deprecated. Please use `airflow.providers.amazon.aws.operators.ec2`.",
    DeprecationWarning,
    stacklevel=2,
)

@bhavaniravi
Copy link
Contributor

Found the following error in provider.yaml in pre-commit, how to tackle them?

Found 2 errors in providers
Error: Incorrect content of key 'sensors/python-modules' in file: airflow/providers/amazon/provider.yaml
      -- Items in the left set but not the right:
         'airflow.providers.amazon.aws.sensors.dms_task'
Error: Incorrect content of key 'operators/python-modules' in file: airflow/providers/amazon/provider.yaml
      -- Items in the left set but not the right:
         'airflow.providers.amazon.aws.operators.dms_create_task'
         'airflow.providers.amazon.aws.operators.dms_delete_task'
         'airflow.providers.amazon.aws.operators.dms_describe_tasks'
         'airflow.providers.amazon.aws.operators.dms_start_task'
         'airflow.providers.amazon.aws.operators.dms_stop_task'

@eladkal
Copy link
Contributor Author

eladkal commented Dec 9, 2021

@eladkal Please kindly link your PR here so I can easily see and learn from it.

See the linked PR to the action items.

@yeshbash
Copy link
Contributor

yeshbash commented Dec 13, 2021

I see sensors/glue.py is not taken. I would be happy to send a PR. Let me know if it's already taken.

Also, noticed that glue hooks follow the multi-file pattern. Probably the only hook which follows that pattern.

  • airflow/providers/amazon/aws/hooks/glue.py
  • airflow/providers/amazon/aws/hooks/glue_crawler.py
  • airflow/providers/amazon/aws/hooks/glue_catalog.py

I can unify them as well as part of the PR. cc: @eladkal

@eladkal
Copy link
Contributor Author

eladkal commented Dec 13, 2021

@yeshbash I'm not 100% on combining glue, glue_crawler, glue_catalog together - which is why I left it out and focused only on glue as stand alone.
@o-nikolas WDYT?

@ferruzzi
Copy link
Contributor

ferruzzi commented Dec 13, 2021

Hey cool initiative, I definitely agree with this plan! Sorry for the delay in response, I was enjoying a seaside cabin without internet last week. 🏖️ I will take a closer look at what's been claimed and add some work on this tomorrow.

For the Glue question, the docstring on glue.py:L28 says "Interact with AWS Glue - create job, trigger, crawler"; looking at the history that was never included in the same file. Maybe that as the intent all along?

glue.py was created 2020-05-02 by abdulbasitds
glue_crawler.py was created 2021-01-24 by Marshall Mamiya

from the initial PR here it looks like there was some copy/pasta from someone else's fork so maybe the plan was to combine them all along? I don't really use Glue so maybe someone with a better working knowledge of how it's actually used in the wild would have an opinion on this, and I'd definitely defer to that, but it looks to me like maybe we can combine them.

@ferruzzi
Copy link
Contributor

ferruzzi commented Dec 13, 2021

Found the following error in provider.yaml in pre-commit, how to tackle them?

[EDIT: It looks like you already sorted it out, but for posterity in case this pops up in a search later: ]

You need to edit providers/amazon/providers.yaml:L188-192. That section should contain the names of the Operators files/modules for that service. So if you are deleting those five files and combining them into a single operators/dms.py, then the deleted files need to be removed from that yaml section and add airflow.providers.amazon.aws.operators.dms in their place.

@ferruzzi
Copy link
Contributor

@eladkal

Since we are making these changes and going to be deprecating old operators in favor of new ones, it may be a good time to address some standardization issues as well? Two that have bugged me a bit are acronym capitalization across the Operators, and whether to prefix them all with AWS or not.

These are incomplete lists just for example: we have SnsPublishOperator but we capitalize the acronyms in SQSPublishOperator and ECSOperator. We prefix AwsGlueCrawlerOperator and AwsBatchOperator in lowercase, AWSAthenaOperator is prefixed in uppercase, and no prefix at all on GlacierCreateJobOperator or CloudFormationCreateStackOperator

@eladkal
Copy link
Contributor Author

eladkal commented Dec 14, 2021

Since we are making these changes and going to be deprecating old operators in favor of new ones, it may be a good time to address some standardization issues as well? Two that have bugged me a bit are acronym capitalization across the Operators, and whether to prefix them all with AWS or not.

These are incomplete lists just for example: we have SnsPublishOperator but we capitalize the acronyms in SQSPublishOperator and ECSOperator. We prefix AwsGlueCrawlerOperator and AwsBatchOperator in lowercase, AWSAthenaOperator is prefixed in uppercase, and no prefix at all on GlacierCreateJobOperator or CloudFormationCreateStackOperator

The instructions are listed in AIP 21 cases 3-4 and case 6.
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths#AIP21:Changesinimportpaths-Case#3{aws/azure/gcp}_*

https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths#AIP21:Changesinimportpaths-Case#6Otherisolatedcases

The sensors/operators classes names should not have Aws/AWS as part of the class name.

So:
SnsPublishOperator -> No change
SQSPublishOperator -> SqsPublishOperator
AwsBatchOperator -> BatchOperator
AWSAthenaOperator -> AthenaOperator
etc...

@ferruzzi would you like to work on it?

@ferruzzi
Copy link
Contributor

Yup, I can take that on.

@ferruzzi
Copy link
Contributor

Batch has multiple Hook files, that may fall under this campaign? One is for the waiter and the other for the client though, so maybe not?

@eladkal
Copy link
Contributor Author

eladkal commented Dec 14, 2021

Batch has multiple Hook files, that may fall under this campaign? One is for the waiter and the other for the client though, so maybe not?

Batch is not part of the action items of this issue as here we are focused on Operators/Sensors.
Regardless my approach would be the same - We should follow what the user expects.
If users expect to find the classes on the same file then we should merge - if the user doesn't expect it and the services are really distinct then we should keep separated. You might be interested in a (possibly) similar dilemma we had with redshift executing SQL and redshift managing the cluster:
https://apache-airflow.slack.com/archives/CCPRP7943/p1639424289318500
Eventualy we went with having redshift_sql and redshift_cluster as it's unlikely that the same ETL will do both cluster manipulation and executing sql. It's two distinct usages.

@eladkal
Copy link
Contributor Author

eladkal commented Dec 15, 2021

@ferruzzi what do you think about the case of glue.py / glue_catalog_partition.py / glue_crawler.py ?
Should we leave it as is or should we make changes?

@ferruzzi
Copy link
Contributor

@eladkal To be honest, I'm not a Glue user and I agree with you that it should be whatever a regular user would expect, but I don't know what that would be. I only have a passing knowledge of that service so I'm not in the best position to make that call. I'll see if I can find someone who uses it "in the wild" that I can chat with tomorrow and see what a user's perspective is.

@eladkal
Copy link
Contributor Author

eladkal commented Dec 15, 2021

OK so i'm leaving Glue as is as this point. We can always revisit in the future.
Lets not make such changes if we are not 100% sure.

@ferruzzi
Copy link
Contributor

Alright, FWIW, my huge sample size of two says they should be fine to combine. It's not a lot, and one of those two said they are "very new" to Glue but have used it a couple times, but they both said to combine them into one package.

@ferruzzi
Copy link
Contributor

For what it's worth, a third Glue user got back to me today, also claiming they are "not an expert" but they suggested that combining them sounded logical to them. Three out of three users agree? Not exactly a huge sample size, but it's fairly consistent.

@eladkal
Copy link
Contributor Author

eladkal commented Dec 21, 2021

Thank you everyone for getting this done so quickly!
@ferruzzi @KennyRich @bhavaniravi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue kind:task A task that needs to be completed as part of a larger issue provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

No branches or pull requests

8 participants