Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jinja functions remain static in YAML files on RPC until modified #2330

Closed
1 of 5 tasks
dwallace0723 opened this issue Apr 15, 2020 · 7 comments
Closed
1 of 5 tasks
Labels
bug Something isn't working partial_parsing stale Issues that have gone stale

Comments

@dwallace0723
Copy link

dwallace0723 commented Apr 15, 2020

Describe the bug

We leverage various jinja functions heavily in our dbt project (e.g. run_started_at, invocation_id, etc.). In addition to using them inside of dbt models, we also use them inside of YAML files, such as schema.yaml. A real world example for us is as follows:

version: 2

sources:
  - name: dbt_meta
    database: analytics
    schema: dbt_meta

    tables:
      - name: dbt_audit_log
        identifier: dbt_audit_log at(timestamp => '{{ run_started_at }}'::timestamp_ntz)

Using {{ run_started_at }} in tandem with Snowflake's time travel in the identifier (shown above) is a workaround we have been using to circumvent the non-atomic nature of dbt snapshots (#1884). It effectively prevents the source data from changing out from under us during a snapshot being taken, which happens incredibly often, leading to snapshots duplication bugs.

We recently migrated much of our production dbt workload to RPC server and noticed that this {{ run_started_at }} jinja function was not changing between runs in our compiled source definitions. We dug deeper and noticed that restarting the server and even manually re-compiling the project on the server did not solve the issue. However, what did make the function update, was modifying the underlying schema.yaml file. What makes this even more curious is that when using jinja functions in dbt models, they do seem to update on every single invocation without needing to modify files, restart the server, or recompile. Also worth noting that this behavior is identical whether partial parsing is turned on or off.

I'm curious if this is inherent in the way RPC functions and is thus intended by design, or if this is a bug.

Steps To Reproduce

(see above)

Expected behavior

I would expect these functions to behave similarly as it would when using the CLI. Or at the very least, to be able to update these functions using a manual re-compile.

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.16.0
   latest version: 0.16.1

The operating system you're using:
macOS Mojave 10.14.6

The output of python --version:

Python 3.8.2
@dwallace0723 dwallace0723 added bug Something isn't working triage labels Apr 15, 2020
@drewbanin drewbanin added rpc Issues related to dbt's RPC server and removed triage labels Apr 16, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 30, 2021

In order for --partial-parse to always return the same result as a full parse, it seems we'd need to capture and categorize implicit dependencies of Jinja expressions that are passed as configs/properties in YAML files. Here are the categories I'm thinking about:

  1. Re-render every time: the Jinja expression depends on a special dbt-Jinja variable, such as run_started_at or invocation_id, that changes in every dbt invocation. (I'm not actually sure if these vary for different RPC methods, though, since those technically occur during the same "parent" invocation! Something like run_started_at probably should, invocation_id I'm not sure.)
  2. Re-render if a variable has changed (--vars or env var), and the Jinja expression depends on that variable via {{var()}} or {{env_var()}}
  3. Re-render if another file has changed, e.g. a description that depends on a docs block defined elsewhere
  4. None of the above: only re-render if the YAML file has itself changed

This becomes one level more complex in the RPC server, where partial-parsing is only triggered by startup + sighup, not by submitting a 'run' command. So categories 1+2 may need to work differently in RPC, with selective re-rendering of values within the in-memory manifest produced by the last startup/sighup.

@jtcohen6
Copy link
Contributor

If a volatile Jinja variable is needed at parse time (input to ref/source/config or yaml property), and partial parsing is enabled, dbt will not re-parse affected nodes and is likely to yield incorrect results.

Reproduction case: gist

I believe our options here are:

  • Non-solution (better than nothing): Continue to document this in known partial parsing limitations; add to docs for run_started_at + invocation_id as well. Instruct users to disable partial parsing if they rely on using these variables in the parse-time context.
  • Complex solution: For each node in the graph, statically analyze calls to volatile context variables (run_started_at + invocation_id), and then store those env vars in depends_on for that node. Then, if we detect a change in env var values, we can trigger re-parse for only those nodes which depend on that volatile variables, plus (if the node is a macro) any nodes that depend on the macro.

I'm leaning toward the non-solution for this one, by documenting the limitation.

@gshank
Copy link
Contributor

gshank commented Sep 16, 2021

What we did for the 'doc' calls was to add code in the context 'doc' method to save the node referring to the doc in the doc source_file object. We could do something like that for these calls. It seems like we might need to save the current value of the variables somewhere, then check to see if the variable value has changed and mark all of the nodes for reparsing. So something like: { "var_name": { "value": something, "referenced_by": [ { file_id, yaml_key, name}, or {file_id, unique_id}...]. The variables could be both internal things like 'run_started_at', cli variables, and env vars. Statically parsing models would get more complicated, of course.

@jtcohen6
Copy link
Contributor

@gshank That feels like a fruitful way to approach this: rather than store the reference in the node itself, store the value and a child map of all nodes that will need updating when the value changes. I bet we could do something similar for env vars (#3885). Then, an additional step in partial parsing looks like: compare these values; if any value has changed, schedule dependent nodes for re-parse. Outside of exceptional circumstances (the RPC server! ugh!), run_started_at and invocation_id will always have changed.

The simple version could be: all nodes that call one of these variables anywhere should be considered dependent on that variable.

The cleverer, more precise version: all nodes that depend on one of these variables for a parse-time configuration, i.e. ref/source/config/yaml property. We don't need to re-parse nodes that include {{ run_started_at }} in their compiled SQL, because SQL gets re-compiled at execution time.

What do you think is the difficulty of an approach along those lines? Is it worth pursuing now, or punting to after v1.0 (and clearly documenting the limitations in the meantime)?

@jtcohen6
Copy link
Contributor

At its heart, this issue has more to do with partial parsing than with the RPC server. I'm going to leave it in this repo

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Apr 11, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working partial_parsing stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

4 participants