Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic check dependent tables #447

Merged
merged 8 commits into from
May 18, 2023
Merged

Conversation

shawncrawley
Copy link
Collaborator

There are now several existing service workflows that depend on the existence of and appropriate state of data from the tables of other services. An existing example of this is the peak_flow_arrival_time services, which depend upon (i.e. build off of) the high_water_arrival_time services. @TylerSchrag-NOAA recently implemented a good way to handle this by adding a new product configuration property called dependent_on_db_tables that can optionally be specified under the postprocess_sql section. This was beautifully implemented, but unfortunately I quickly found that it fell short for another use case: that of the re-developed Replace and Route services. The reason it falls short here is because the rfc_5day_max_downstream_inundation service now is completely built off of the final state of the rfc_5day_max_downstream_streamflow service. What's different in this case as opposed to the high water arrival time services is that initial dependency occurs earlier in the workflow - not in the "postprocess_sql" section, but the "fim_data_prep" section.

The solution that @CoreyKrewson-NOAA and I came up with was to generalize and automate the check for these dependent tables for every service and at every stage of running any SQL. This was done by creating a required_tables_updated method on the Database class in viz_classes.py. This function is passed the SQL string or file that needs to be verified before being executed, along with an optional sql_replace dictionary, a reference_time, whether the check should stop after finding a single issue or thoroughly find all issues, and if issues should raise an exception if an issue is found or simply return false.

This function works by using regex to pull out every table name that appears in a "FROM" clause and then proceed to check if these tables exist, and if they exist if they have a reference_time column, and if so if a reference_time was provided and if so if they match. If an issue is found (i.e. either a table does not exist or the reference_time is not updated as expected), then either the return value will be false or a custom RequiredTableNotUpdated exception will be thrown.

A call to this new function was added both to the fim_data_prep and postprocess_sql lambda functions and then these functions as defined in our viz_pipeline step functions were modified to retry 20 times, once every 30 seconds if the RequiredTableNotUpdated error is thrown - thus allowing time for any dependent data to be updated before actually failing the entire pipeline run.

I tested this in TI for a day and in doing so found a couple of gotcha use cases that required some hard-coded logic. Namely, the SQL that produces the cache.max_flows_ana also pushes that tables "about to be an hour old" data into a cache.max_flows_ana_past_hour table prior to truncating its own table and updating it with the new data. Thus, the required_tables_updated function was failing because it found that cache.max_flows_ana was being called in a "FROM" clause (since the data from that table is written into the past_hour table), but it also found that the reference_time in that table wasn't matching the expected reference time - but that's because it's supposed to not match an be an hour earlier.

The other gotcha use case was the "rnr" service was failing because it depends upon an "ahps_metadata" table that for whatever reason has a "reference_time" column, but that column I believe more accurately reflects an update time and thus never matches the expected reference time of the table being created using that data.

To hard-code out the above two gotcha use cases, I just have any table with "past" or "ahps" skip the reference_time check.

Please take a look at this all and let me know what you think!

@CoreyKrewson-NOAA
Copy link
Contributor

@shawncrawley This is great! I know we had also talked about checking the tables that are joined. Is that in this PR or something we want to try to do?

The reason the AHPS metadata doesn't line up too is because that is ran every 5 minutes. So maybe we shouldn't be joining it anyway because the forecasts might not align in the time between the two tables are updated. We should find a different way to do this which would most likely be just hitting the DB directly for the data we want.

@shawncrawley
Copy link
Collaborator Author

I also check the joined tables - I should have clarified. So the regex checks all tables that follow FROM and JOIN

@shawncrawley
Copy link
Collaborator Author

I'll also clarify that both the rnr and rfc_max_forecast use that ahps_metadata table. My updates to rework the rnr post-wrf-hydro processing into our HydroVIS framework will remove the ahps_metadata dependency from the rnr service queries. At that point, we'll only need to rework the rfc_max_forecast SQL to remove that dependency.

@CoreyKrewson-NOAA
Copy link
Contributor

gotcha and sounds good. That table is actually created from the rfc max forecast product so we shouldn't need to mess with that. Unless we want to just hit the DB directly to create that service which Tyler has suggested and I like

Copy link
Contributor

@CoreyKrewson-NOAA CoreyKrewson-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good except for that 1 line that I mentioned in the comment.

@CoreyKrewson-NOAA CoreyKrewson-NOAA merged commit 9c4a6c6 into ti May 18, 2023
@CoreyKrewson-NOAA CoreyKrewson-NOAA deleted the dynamic-check-dependent-tables branch May 18, 2023 13:38
@TylerSchrag-NOAA
Copy link
Contributor

This is a great enhancement, thanks for taking what I started to the next level! I see that Corey already merged this, but only one minor suggestion from me is to use the word check or something somewhere in the method name, like check_required_tables_updated. Just to be a little more readable.

I could also see it making sense to have check_table_reference_time as a separate method that we could use elsewhere as needed... but that can be something to parking lot for now until we actually need it.

One other important thing I just started reading about, that I'll bring up in the next scrum - it looks like there is some confusion online about how the psycopg2 context manager works, and I guess we're actually supposed to close our connections, even when using with statements (oops, I should have known this). I also started recently putting the cursors in with statements within the connection context (or declaring both in the same with line)... but I don't think that actually does anything.

Probably a good task for any one of us to research the best way and implement the same connection and cursor syntax everywhere / update the viz classes methods with what Corey's continued iterating on in the notebook helper functions to be universal.

@CoreyKrewson-NOAA
Copy link
Contributor

@TylerSchrag-NOAA good suggestion on adding the word "check". I agree and can add that to my next PR as well. Could you create an issue for the psycopg2 stuff that you mentioned? That is good to know and something we should track and do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make check for dependent tables dynamic and relevant to everywhere that SQL is executed
3 participants