Jinja functions remain static in YAML files on RPC until modified #2330

dwallace0723 · 2020-04-15T20:54:56Z

Describe the bug

We leverage various jinja functions heavily in our dbt project (e.g. run_started_at, invocation_id, etc.). In addition to using them inside of dbt models, we also use them inside of YAML files, such as schema.yaml. A real world example for us is as follows:

version: 2

sources:
  - name: dbt_meta
    database: analytics
    schema: dbt_meta

    tables:
      - name: dbt_audit_log
        identifier: dbt_audit_log at(timestamp => '{{ run_started_at }}'::timestamp_ntz)

Using {{ run_started_at }} in tandem with Snowflake's time travel in the identifier (shown above) is a workaround we have been using to circumvent the non-atomic nature of dbt snapshots (#1884). It effectively prevents the source data from changing out from under us during a snapshot being taken, which happens incredibly often, leading to snapshots duplication bugs.

We recently migrated much of our production dbt workload to RPC server and noticed that this {{ run_started_at }} jinja function was not changing between runs in our compiled source definitions. We dug deeper and noticed that restarting the server and even manually re-compiling the project on the server did not solve the issue. However, what did make the function update, was modifying the underlying schema.yaml file. What makes this even more curious is that when using jinja functions in dbt models, they do seem to update on every single invocation without needing to modify files, restart the server, or recompile. Also worth noting that this behavior is identical whether partial parsing is turned on or off.

I'm curious if this is inherent in the way RPC functions and is thus intended by design, or if this is a bug.

Steps To Reproduce

(see above)

Expected behavior

I would expect these functions to behave similarly as it would when using the CLI. Or at the very least, to be able to update these functions using a manual re-compile.

System information

Which database are you using dbt with?

The output of dbt --version:

installed version: 0.16.0
   latest version: 0.16.1

The operating system you're using:
macOS Mojave 10.14.6

The output of python --version:

Python 3.8.2

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-03-30T17:56:41Z

In order for --partial-parse to always return the same result as a full parse, it seems we'd need to capture and categorize implicit dependencies of Jinja expressions that are passed as configs/properties in YAML files. Here are the categories I'm thinking about:

Re-render every time: the Jinja expression depends on a special dbt-Jinja variable, such as run_started_at or invocation_id, that changes in every dbt invocation. (I'm not actually sure if these vary for different RPC methods, though, since those technically occur during the same "parent" invocation! Something like run_started_at probably should, invocation_id I'm not sure.)
Re-render if a variable has changed (--vars or env var), and the Jinja expression depends on that variable via {{var()}} or {{env_var()}}
Re-render if another file has changed, e.g. a description that depends on a docs block defined elsewhere
None of the above: only re-render if the YAML file has itself changed

This becomes one level more complex in the RPC server, where partial-parsing is only triggered by startup + sighup, not by submitting a 'run' command. So categories 1+2 may need to work differently in RPC, with selective re-rendering of values within the in-memory manifest produced by the last startup/sighup.

jtcohen6 · 2021-09-15T08:10:47Z

If a volatile Jinja variable is needed at parse time (input to ref/source/config or yaml property), and partial parsing is enabled, dbt will not re-parse affected nodes and is likely to yield incorrect results.

Reproduction case: gist

I believe our options here are:

Non-solution (better than nothing): Continue to document this in known partial parsing limitations; add to docs for run_started_at + invocation_id as well. Instruct users to disable partial parsing if they rely on using these variables in the parse-time context.
Complex solution: For each node in the graph, statically analyze calls to volatile context variables (run_started_at + invocation_id), and then store those env vars in depends_on for that node. Then, if we detect a change in env var values, we can trigger re-parse for only those nodes which depend on that volatile variables, plus (if the node is a macro) any nodes that depend on the macro.

I'm leaning toward the non-solution for this one, by documenting the limitation.

gshank · 2021-09-16T19:45:17Z

What we did for the 'doc' calls was to add code in the context 'doc' method to save the node referring to the doc in the doc source_file object. We could do something like that for these calls. It seems like we might need to save the current value of the variables somewhere, then check to see if the variable value has changed and mark all of the nodes for reparsing. So something like: { "var_name": { "value": something, "referenced_by": [ { file_id, yaml_key, name}, or {file_id, unique_id}...]. The variables could be both internal things like 'run_started_at', cli variables, and env vars. Statically parsing models would get more complicated, of course.

jtcohen6 · 2021-09-17T08:50:05Z

@gshank That feels like a fruitful way to approach this: rather than store the reference in the node itself, store the value and a child map of all nodes that will need updating when the value changes. I bet we could do something similar for env vars (#3885). Then, an additional step in partial parsing looks like: compare these values; if any value has changed, schedule dependent nodes for re-parse. Outside of exceptional circumstances (the RPC server! ugh!), run_started_at and invocation_id will always have changed.

The simple version could be: all nodes that call one of these variables anywhere should be considered dependent on that variable.

The cleverer, more precise version: all nodes that depend on one of these variables for a parse-time configuration, i.e. ref/source/config/yaml property. We don't need to re-parse nodes that include {{ run_started_at }} in their compiled SQL, because SQL gets re-compiled at execution time.

What do you think is the difficulty of an approach along those lines? Is it worth pursuing now, or punting to after v1.0 (and clearly documenting the limitations in the meantime)?

jtcohen6 · 2021-10-12T13:46:01Z

At its heart, this issue has more to do with partial parsing than with the RPC server. I'm going to leave it in this repo

github-actions · 2022-04-11T02:09:34Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions · 2022-04-18T02:12:13Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

dwallace0723 added bug Something isn't working triage labels Apr 15, 2020

drewbanin added rpc Issues related to dbt's RPC server and removed triage labels Apr 16, 2020

matt-winkler pinned this issue May 11, 2021

matt-winkler unpinned this issue May 11, 2021

jtcohen6 mentioned this issue May 16, 2021

Execute-only invocations #3360

Closed

jtcohen6 added the partial_parsing label Jun 5, 2021

jtcohen6 mentioned this issue Jun 5, 2021

[Tracking] Partial parsing polish #3371

Closed

1 task

jtcohen6 mentioned this issue Sep 1, 2021

Turn on performance improvements by default #3502

Closed

leahwicz mentioned this issue Sep 9, 2021

Turn on partial parser by default #3867

Closed

3 tasks

jtcohen6 removed the rpc Issues related to dbt's RPC server label Oct 12, 2021

joellabes mentioned this issue Nov 30, 2021

[CT-530] [Bug] Partial parsing should handle volatile variables like invocation_id and run_started_at #4364

Open

1 task

github-actions bot added the stale Issues that have gone stale label Apr 11, 2022

github-actions bot closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jinja functions remain static in YAML files on RPC until modified #2330

Jinja functions remain static in YAML files on RPC until modified #2330

dwallace0723 commented Apr 15, 2020 •

edited

Loading

jtcohen6 commented Mar 30, 2021 •

edited

Loading

jtcohen6 commented Sep 15, 2021

gshank commented Sep 16, 2021

jtcohen6 commented Sep 17, 2021

jtcohen6 commented Oct 12, 2021

github-actions bot commented Apr 11, 2022

github-actions bot commented Apr 18, 2022

Jinja functions remain static in YAML files on RPC until modified #2330

Jinja functions remain static in YAML files on RPC until modified #2330

Comments

dwallace0723 commented Apr 15, 2020 • edited Loading

Describe the bug

Steps To Reproduce

Expected behavior

System information

jtcohen6 commented Mar 30, 2021 • edited Loading

jtcohen6 commented Sep 15, 2021

gshank commented Sep 16, 2021

jtcohen6 commented Sep 17, 2021

jtcohen6 commented Oct 12, 2021

github-actions bot commented Apr 11, 2022

github-actions bot commented Apr 18, 2022

dwallace0723 commented Apr 15, 2020 •

edited

Loading

jtcohen6 commented Mar 30, 2021 •

edited

Loading