Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws Glue operator fails to upload local script to s3 due to wrong argument order #16418

Closed
mmenarguezpear opened this issue Jun 12, 2021 · 0 comments · Fixed by #16216
Closed
Labels
kind:bug This is a clearly a bug provider:amazon AWS/Amazon - related issues

Comments

@mmenarguezpear
Copy link
Contributor

mmenarguezpear commented Jun 12, 2021

Apache Airflow version: 2.1.0

Kubernetes version (if you are using kubernetes) (use kubectl version): NA

Environment: bare metal k8s in AWS EC2

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release):
cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

What happened:
Upon providing valid arguments, the following error appeared:


[2021-06-12 16:31:46,277] {base_aws.py:395} INFO - Creating session using boto3 credential strategy region_name=None
[2021-06-12 16:31:47,339] {taskinstance.py:1481} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1137, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
    result = task_copy.execute(context=context)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/operators/glue.py", line 106, in execute
    s3_hook.load_file(self.script_location, self.s3_bucket, self.s3_artifacts_prefix + script_name)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 62, in wrapper
    return func(*bound_args.args, **bound_args.kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 91, in wrapper
    return func(*bound_args.args, **bound_args.kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 499, in load_file
    if not replace and self.check_for_key(key, bucket_name):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 62, in wrapper
    return func(*bound_args.args, **bound_args.kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 91, in wrapper
    return func(*bound_args.args, **bound_args.kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py", line 323, in check_for_key
    self.get_conn().head_object(Bucket=bucket_name, Key=key)
  File "/home/airflow/.local/lib/python3.8/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/botocore/client.py", line 648, in _make_api_call
    request_dict = self._convert_to_request_dict(
  File "/home/airflow/.local/lib/python3.8/site-packages/botocore/client.py", line 694, in _convert_to_request_dict
    api_params = self._emit_api_params(
  File "/home/airflow/.local/lib/python3.8/site-packages/botocore/client.py", line 723, in _emit_api_params
    self.meta.events.emit(
  File "/home/airflow/.local/lib/python3.8/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/botocore/handlers.py", line 236, in validate_bucket_name
    raise ParamValidationError(report=error_msg)
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid bucket name "artifacts/glue-scripts/example.py": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
[2021-06-12 16:31:47,341] {taskinstance.py:1524} INFO - Marking task as UP_FOR_RETRY. dag_id=glue-example, task_id=example_glue_job_operator, execution_date=20210612T163143, start_date=20210612T163145, end_date=20210612T163147
[2021-06-12 16:31:47,386] {local_task_job.py:151} INFO - Task exited with return code 1

Upon looking at the order of arguments, seems like the 2nd and 3rd are reversed. Furthermore, the operator does not expose the replace option if desired, which is vary valuable.
Note key and bucket name are passed by position and not reference https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/operators/glue.py#L104
and they are reversed https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/hooks/s3.py#L466

What you expected to happen: To succeed uploading the script. To be able to replace existing script in s3

How to reproduce it:
Try to upload the file to any S3 buckets

 t2 = AwsGlueJobOperator(
        task_id="example_glue_job_operator",
        job_desc="Example Airflow Glue job",
        # Note the operator will upload the script if it is not an s3:// reference
        # See https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/operators/glue.py#L101
        script_location="/opt/airflow/dags_lib/example.py",
        concurrent_run_limit=1,
        script_args={},
        num_of_dpus=1,  # This parameter is deprecated (from boto3). Use MaxCapacity instead on kwargs.
        aws_conn_id='aws_default',
        region_name="aws-region",
        s3_bucket="bucket-name", 
        iam_role_name="iam_role_name_here",
        create_job_kwargs={}
)

Anything else we need to know:

How often does this problem occur? Every time id using local script

I can take a stub at fixing it. I did notice the operator does not allow to update a glue job definition after its creation. boto3 offers an api to do so but it is not exposed in this operator https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.update_job It would be great if I could add that as well, but might fall out of scope

@mmenarguezpear mmenarguezpear added the kind:bug This is a clearly a bug label Jun 12, 2021
@eladkal eladkal linked a pull request Jun 12, 2021 that will close this issue
@eladkal eladkal added the provider:amazon AWS/Amazon - related issues label Jun 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug provider:amazon AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants