Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airflow jinja template render error #33694

Closed
1 of 2 tasks
jaegwonseo opened this issue Aug 24, 2023 · 16 comments · Fixed by #35017
Closed
1 of 2 tasks

airflow jinja template render error #33694

jaegwonseo opened this issue Aug 24, 2023 · 16 comments · Fixed by #35017
Labels
area:core area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@jaegwonseo
Copy link
Contributor

jaegwonseo commented Aug 24, 2023

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

version 2.6.2

An error occurs when *.json is included in the parameters of BigQueryInsertJobOperator.

to_gcs_task = BigQueryInsertJobOperator(
        dag=dag,
        task_id='to_gcs',
        gcp_conn_id='xxxx',
        configuration={
            "extract": {
                 # The error occurred at this location.          
                "destinationUris": ['gs://xxx/yyy/*.json'],


                "sourceTable": {
                    "projectId": "abc",
                    "datasetId": "def",
                    "tableId": "ghi"
                },
                "destinationFormat": "NEWLINE_DELIMITED_JSON"
            }
        }
    )

error log

jinja2.exceptions.TemplateNotFound: gs://xxx/yyy/*.json

What you think should happen instead

According to the airflow.template.templater
source : https://github.com/apache/airflow/blob/main/airflow/template/templater.py#L152

      if isinstance(value, str):
            if any(value.endswith(ext) for ext in self.template_ext):  # A filepath.
                template = jinja_env.get_template(value)
            else:
                template = jinja_env.from_string(value)
            return self._render(template, context)

In the Jinja template source, if the value ends with .json or .sql, an attempt is made to read the resource file by calling jinja_env.get_template.

How to reproduce

just call BigQueryInsertJobOperator with configuration what i added

Operating System

m2 mac

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@jaegwonseo jaegwonseo added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Aug 24, 2023
@nathadfield
Copy link
Collaborator

@jaegwonseo Thanks for logging this and I have been able to replicate it. Basically this means that we cannot extract a BigQuery table to JSON delimited files with the extension .json. As a workaround I think you should be able to extract the table if you don't specify the extension.

@SamWheating
Copy link
Contributor

SamWheating commented Aug 24, 2023

This is expected behaviour - the BigQueryInsertJobOperator will treat anything in one of the template fields (configuration, job_id, impersonation_chain, project_id,) ending with .sql, or .json as a reference to a local file, which it will then try to load.

This is defined here:

template_fields: Sequence[str] = (
"configuration",
"job_id",
"impersonation_chain",
"project_id",
)
template_ext: Sequence[str] = (
".json",
".sql",
)

I'm not sure if there's a preferred way around this, but I would just subclass the operator and remove the extension:

class BigQueryInsertJobOperatorNoTemplateExt(BigQueryInsertJobOperator):
  template_ext = []

@nathadfield
Copy link
Collaborator

@SamWheating I wonder if it might be possible to prevent the renderer from attempting to load a file for named keys in configuration; in this case destinationUris.

@jaegwonseo
Copy link
Contributor Author

jaegwonseo commented Aug 24, 2023

how about add no_template_fields parameter to BaseOperator constructor ?

in this case no_template_fields = (configuration.extract.destinationUris)

class BaseOperator(AbstractOperator, metaclass=BaseOperatorMeta):
    def __init__(
         xxx, 
         yyy,
         no_template_fields: Sequence[str] = ():

and skip rendering from https://github.com/apache/airflow/blob/main/airflow/template/templater.py#L152

@nathadfield
Copy link
Collaborator

nathadfield commented Aug 24, 2023

@jaegwonseo Why not give it a try? I think we would still want the filenames in destinationUris to be rendered in case they look something like gs://my-project/my-bucket/{{ ds }}/*.json

@nathadfield nathadfield added provider:google Google (including GCP) related issues area:providers good first issue labels Aug 24, 2023
@SamWheating
Copy link
Contributor

I wonder if it might be possible to prevent the renderer from attempting to load a file for named keys in configuration; in this case destinationUris.

I think that this would get pretty complicated pretty fast, as a dict argument like configuration has no enforced structure, so we'd need some jq-like way of indicating which maybe-present fields should and should not be subjected to templating.

Additionally, we'd have to be really specific to only disable this on fields in which we know a user would never ever want to apply this sort of formatting. I think in the case of destinationURI (which accepts a list of URIs) there is a possible use-case for using a .json file rather than an inlined list of strings so I wouldn't want to disable this templating feature for everyone.

Thoughts?

@potiuk
Copy link
Member

potiuk commented Aug 24, 2023

I think the simplest case would be to add a feature to disable templating engine - either for all fields or for specific fields or maybe just disable specific extensions. I think we could couple this with something that we should have for quite some time - i.e. ability to dynamically change list of templated fields and template extensions by overriding them in the instance of the task.

There is fundamentally no problem to disable or enable it selectively "per task". It could even be as simple as "template_fields", "template_ext" specified in the constructor of BaseOperator.

Ir's a little bit more complex than just overriding it in classes, it would also have to be overrideable in serialized form of dag - i.e. whe template_fields are rendered in the UI. But I do not think there is anything that would prevent us from doing it.

@SamWheating
Copy link
Contributor

SamWheating commented Aug 24, 2023

Also one more suggestion (not sure why I didn't think of this earlier today), I think that this simple fix allows you to selectively disable template fields and extensions on a task-by-task basis - since template_ext is an attribute on the operator class but not exposed in the __init__ method:

to_gcs_task = BigQueryInsertJobOperator(
        dag=dag,
        task_id='to_gcs',
        gcp_conn_id='xxxx',
        configuration={
            "extract": {
                "destinationUris": ['gs://xxx/yyy/*.json'],
                "sourceTable": {
                    "projectId": "abc",
                    "datasetId": "def",
                    "tableId": "ghi"
                },
                "destinationFormat": "NEWLINE_DELIMITED_JSON"
            }
        }
    )
    
to_gcs_task.template_ext = ()  # Don't try to load up .json or .sql files 

I tested it locally and it appeared to work just fine, both at run-time and when displaying the rendered-task-instances in the UI:

image

With this in mind, is there still need to surface overrides in the __init__ method? Or can we just recommend this as a workaround (and document it as such).

@potiuk
Copy link
Member

potiuk commented Aug 24, 2023

Yeah. Interesting approach.. I can't think of a bad side effect of it - this should not only work for parsing + execution of tasks, but it should also serialize well for the UI and appears to not have unintended side effects.

Appears to be a solution that we never thought of, but was always possible. And it's super-appealing to have a problem that can be solved by documenting it. Let me summon a few people what they think @dstandish @uranusjr @hussein-awala @eladkal - WDYT ?

I find it deceptively simple, but also - quite surprisingly - pretty workable. We had a number of discussions on whether we should make all fields templateable by default etc. but seems that this simple trick might do the job and we could make it "official".

See #33694 (comment) comment by @SamWheating

@potiuk
Copy link
Member

potiuk commented Aug 24, 2023

BTW. No wonder @SamWheating you have not thought about it. It seems no-one did for the last few years I was around. I kept on explaining the users how easy it is to just extend existing operators and modify template_fields and template_exts but it never occured to me, that we could modify it in the tasks during dag construction. Apparently - we can.

@SamWheating
Copy link
Contributor

Any objection to just adding this to the documentation as a footnote under the Template Rendering section?

@potiuk
Copy link
Member

potiuk commented Aug 24, 2023

I have absolutely No objections, even more, I would love that as an "official" way of modifying templated/template_ext fields on-the-flight - as long as we get a few more people say "well, indeed that does not seem to have some side effects" :D

@josh-fell josh-fell removed the needs-triage label for new issues that we didn't triage yet label Aug 25, 2023
@eladkal
Copy link
Contributor

eladkal commented Aug 25, 2023

Didn't we have the exact same bug when we added json as templated ext for KubernetesPodOperator ?
#16922

@uranusjr
Copy link
Member

uranusjr commented Sep 6, 2023

Maybe the easiest way out would be to add a Literally wrapper that disables rendering? So you can write

BigQueryInsertJobOperator(
    ...,
    configuration={
        "extract": {
            "destinationUris": [Literally("gs://xxx/yyy/*.json")],
            ...,
        },
    },
)

Please do bikeshed on the name. I’m intentionally avoiding Literal to avoid confusion with typing.Literal.

@potiuk
Copy link
Member

potiuk commented Sep 7, 2023

Please do bikeshed on the name. I’m intentionally avoiding Literal to avoid confusion with typing.Literal.

I like the idea.

@michalsosn
Copy link
Contributor

michalsosn commented Oct 17, 2023

How about LiteralValue?
It appears that I have successfully implemented it (link).
However, the true challenge is in ensuring its clear usage for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants