Dynamic check dependent tables #447

shawncrawley · 2023-05-17T18:53:20Z

There are now several existing service workflows that depend on the existence of and appropriate state of data from the tables of other services. An existing example of this is the peak_flow_arrival_time services, which depend upon (i.e. build off of) the high_water_arrival_time services. @TylerSchrag-NOAA recently implemented a good way to handle this by adding a new product configuration property called dependent_on_db_tables that can optionally be specified under the postprocess_sql section. This was beautifully implemented, but unfortunately I quickly found that it fell short for another use case: that of the re-developed Replace and Route services. The reason it falls short here is because the rfc_5day_max_downstream_inundation service now is completely built off of the final state of the rfc_5day_max_downstream_streamflow service. What's different in this case as opposed to the high water arrival time services is that initial dependency occurs earlier in the workflow - not in the "postprocess_sql" section, but the "fim_data_prep" section.

The solution that @CoreyKrewson-NOAA and I came up with was to generalize and automate the check for these dependent tables for every service and at every stage of running any SQL. This was done by creating a required_tables_updated method on the Database class in viz_classes.py. This function is passed the SQL string or file that needs to be verified before being executed, along with an optional sql_replace dictionary, a reference_time, whether the check should stop after finding a single issue or thoroughly find all issues, and if issues should raise an exception if an issue is found or simply return false.

This function works by using regex to pull out every table name that appears in a "FROM" clause and then proceed to check if these tables exist, and if they exist if they have a reference_time column, and if so if a reference_time was provided and if so if they match. If an issue is found (i.e. either a table does not exist or the reference_time is not updated as expected), then either the return value will be false or a custom RequiredTableNotUpdated exception will be thrown.

A call to this new function was added both to the fim_data_prep and postprocess_sql lambda functions and then these functions as defined in our viz_pipeline step functions were modified to retry 20 times, once every 30 seconds if the RequiredTableNotUpdated error is thrown - thus allowing time for any dependent data to be updated before actually failing the entire pipeline run.

I tested this in TI for a day and in doing so found a couple of gotcha use cases that required some hard-coded logic. Namely, the SQL that produces the cache.max_flows_ana also pushes that tables "about to be an hour old" data into a cache.max_flows_ana_past_hour table prior to truncating its own table and updating it with the new data. Thus, the required_tables_updated function was failing because it found that cache.max_flows_ana was being called in a "FROM" clause (since the data from that table is written into the past_hour table), but it also found that the reference_time in that table wasn't matching the expected reference time - but that's because it's supposed to not match an be an hour earlier.

The other gotcha use case was the "rnr" service was failing because it depends upon an "ahps_metadata" table that for whatever reason has a "reference_time" column, but that column I believe more accurately reflects an update time and thus never matches the expected reference time of the table being created using that data.

To hard-code out the above two gotcha use cases, I just have any table with "past" or "ahps" skip the reference_time check.

Please take a look at this all and let me know what you think!

Refs #441

CoreyKrewson-NOAA · 2023-05-17T19:02:00Z

@shawncrawley This is great! I know we had also talked about checking the tables that are joined. Is that in this PR or something we want to try to do?

The reason the AHPS metadata doesn't line up too is because that is ran every 5 minutes. So maybe we shouldn't be joining it anyway because the forecasts might not align in the time between the two tables are updated. We should find a different way to do this which would most likely be just hitting the DB directly for the data we want.

shawncrawley · 2023-05-17T20:03:14Z

I also check the joined tables - I should have clarified. So the regex checks all tables that follow FROM and JOIN

shawncrawley · 2023-05-17T20:10:11Z

I'll also clarify that both the rnr and rfc_max_forecast use that ahps_metadata table. My updates to rework the rnr post-wrf-hydro processing into our HydroVIS framework will remove the ahps_metadata dependency from the rnr service queries. At that point, we'll only need to rework the rfc_max_forecast SQL to remove that dependency.

CoreyKrewson-NOAA · 2023-05-17T21:43:29Z

gotcha and sounds good. That table is actually created from the rfc max forecast product so we shouldn't need to mess with that. Unless we want to just hit the DB directly to create that service which Tyler has suggested and I like

CoreyKrewson-NOAA

looks good except for that 1 line that I mentioned in the comment.

Core/LAMBDA/viz_functions/viz_fim_data_prep/lambda_function.py

TylerSchrag-NOAA · 2023-05-18T15:00:24Z

This is a great enhancement, thanks for taking what I started to the next level! I see that Corey already merged this, but only one minor suggestion from me is to use the word check or something somewhere in the method name, like check_required_tables_updated. Just to be a little more readable.

I could also see it making sense to have check_table_reference_time as a separate method that we could use elsewhere as needed... but that can be something to parking lot for now until we actually need it.

One other important thing I just started reading about, that I'll bring up in the next scrum - it looks like there is some confusion online about how the psycopg2 context manager works, and I guess we're actually supposed to close our connections, even when using with statements (oops, I should have known this). I also started recently putting the cursors in with statements within the connection context (or declaring both in the same with line)... but I don't think that actually does anything.

Probably a good task for any one of us to research the best way and implement the same connection and cursor syntax everywhere / update the viz classes methods with what Corey's continued iterating on in the notebook helper functions to be universal.

TylerSchrag-NOAA · 2023-05-18T15:01:23Z

https://stackoverflow.com/questions/55334704/what-is-proper-way-to-close-connection-in-psyopcg2-with-with-statement

https://www.psycopg.org/docs/connection.html

CoreyKrewson-NOAA · 2023-05-18T15:10:19Z

@TylerSchrag-NOAA good suggestion on adding the word "check". I agree and can add that to my next PR as well. Could you create an issue for the psycopg2 stuff that you mentioned? That is good to know and something we should track and do.

shawncrawley added 7 commits May 11, 2023 12:44

Implements dynamic check for dependent tables

e5023d4

Refs #441

Merge branch 'ti' into dynamic-check-dependent-tables

7eb1097

Fix fail on creating max_flows_ana_past_hour table

29bb8a3

Refs #441

Ignore ahps and past tables for reftime check

eaf0d3a

Merge branch 'ti' into dynamic-check-dependent-tables

aef6fa5

remove post-merge dependent_on_db_tables

a92b7ce

Adds RequiredTableNotUpdated custom error handling

0b72067

CoreyKrewson-NOAA suggested changes May 17, 2023

View reviewed changes

Core/LAMBDA/viz_functions/viz_fim_data_prep/lambda_function.py Outdated Show resolved Hide resolved

fixed path to sql file to use

53b4e2c

CoreyKrewson-NOAA approved these changes May 18, 2023

View reviewed changes

CoreyKrewson-NOAA linked an issue May 18, 2023 that may be closed by this pull request

Make check for dependent tables dynamic and relevant to everywhere that SQL is executed #441

Closed

CoreyKrewson-NOAA merged commit 9c4a6c6 into ti May 18, 2023

CoreyKrewson-NOAA deleted the dynamic-check-dependent-tables branch May 18, 2023 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic check dependent tables #447

Dynamic check dependent tables #447

shawncrawley commented May 17, 2023

CoreyKrewson-NOAA commented May 17, 2023

shawncrawley commented May 17, 2023

shawncrawley commented May 17, 2023

CoreyKrewson-NOAA commented May 17, 2023

CoreyKrewson-NOAA left a comment

TylerSchrag-NOAA commented May 18, 2023

TylerSchrag-NOAA commented May 18, 2023 •

edited

Loading

CoreyKrewson-NOAA commented May 18, 2023

Dynamic check dependent tables #447

Dynamic check dependent tables #447

Conversation

shawncrawley commented May 17, 2023

CoreyKrewson-NOAA commented May 17, 2023

shawncrawley commented May 17, 2023

shawncrawley commented May 17, 2023

CoreyKrewson-NOAA commented May 17, 2023

CoreyKrewson-NOAA left a comment

Choose a reason for hiding this comment

TylerSchrag-NOAA commented May 18, 2023

TylerSchrag-NOAA commented May 18, 2023 • edited Loading

CoreyKrewson-NOAA commented May 18, 2023

TylerSchrag-NOAA commented May 18, 2023 •

edited

Loading