-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic check dependent tables #447
Conversation
@shawncrawley This is great! I know we had also talked about checking the tables that are joined. Is that in this PR or something we want to try to do? The reason the AHPS metadata doesn't line up too is because that is ran every 5 minutes. So maybe we shouldn't be joining it anyway because the forecasts might not align in the time between the two tables are updated. We should find a different way to do this which would most likely be just hitting the DB directly for the data we want. |
I also check the joined tables - I should have clarified. So the regex checks all tables that follow |
I'll also clarify that both the rnr and rfc_max_forecast use that ahps_metadata table. My updates to rework the rnr post-wrf-hydro processing into our HydroVIS framework will remove the ahps_metadata dependency from the rnr service queries. At that point, we'll only need to rework the rfc_max_forecast SQL to remove that dependency. |
gotcha and sounds good. That table is actually created from the rfc max forecast product so we shouldn't need to mess with that. Unless we want to just hit the DB directly to create that service which Tyler has suggested and I like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good except for that 1 line that I mentioned in the comment.
This is a great enhancement, thanks for taking what I started to the next level! I see that Corey already merged this, but only one minor suggestion from me is to use the word check or something somewhere in the method name, like check_required_tables_updated. Just to be a little more readable. I could also see it making sense to have check_table_reference_time as a separate method that we could use elsewhere as needed... but that can be something to parking lot for now until we actually need it. One other important thing I just started reading about, that I'll bring up in the next scrum - it looks like there is some confusion online about how the psycopg2 context manager works, and I guess we're actually supposed to close our connections, even when using with statements (oops, I should have known this). I also started recently putting the cursors in with statements within the connection context (or declaring both in the same with line)... but I don't think that actually does anything. Probably a good task for any one of us to research the best way and implement the same connection and cursor syntax everywhere / update the viz classes methods with what Corey's continued iterating on in the notebook helper functions to be universal. |
@TylerSchrag-NOAA good suggestion on adding the word "check". I agree and can add that to my next PR as well. Could you create an issue for the psycopg2 stuff that you mentioned? That is good to know and something we should track and do. |
There are now several existing service workflows that depend on the existence of and appropriate state of data from the tables of other services. An existing example of this is the peak_flow_arrival_time services, which depend upon (i.e. build off of) the high_water_arrival_time services. @TylerSchrag-NOAA recently implemented a good way to handle this by adding a new product configuration property called
dependent_on_db_tables
that can optionally be specified under thepostprocess_sql
section. This was beautifully implemented, but unfortunately I quickly found that it fell short for another use case: that of the re-developed Replace and Route services. The reason it falls short here is because the rfc_5day_max_downstream_inundation service now is completely built off of the final state of the rfc_5day_max_downstream_streamflow service. What's different in this case as opposed to the high water arrival time services is that initial dependency occurs earlier in the workflow - not in the "postprocess_sql" section, but the "fim_data_prep" section.The solution that @CoreyKrewson-NOAA and I came up with was to generalize and automate the check for these dependent tables for every service and at every stage of running any SQL. This was done by creating a
required_tables_updated
method on theDatabase
class inviz_classes.py
. This function is passed the SQL string or file that needs to be verified before being executed, along with an optional sql_replace dictionary, a reference_time, whether the check should stop after finding a single issue or thoroughly find all issues, and if issues should raise an exception if an issue is found or simply return false.This function works by using regex to pull out every table name that appears in a "FROM" clause and then proceed to check if these tables exist, and if they exist if they have a reference_time column, and if so if a reference_time was provided and if so if they match. If an issue is found (i.e. either a table does not exist or the reference_time is not updated as expected), then either the return value will be false or a custom
RequiredTableNotUpdated
exception will be thrown.A call to this new function was added both to the fim_data_prep and postprocess_sql lambda functions and then these functions as defined in our viz_pipeline step functions were modified to retry 20 times, once every 30 seconds if the RequiredTableNotUpdated error is thrown - thus allowing time for any dependent data to be updated before actually failing the entire pipeline run.
I tested this in TI for a day and in doing so found a couple of gotcha use cases that required some hard-coded logic. Namely, the SQL that produces the
cache.max_flows_ana
also pushes that tables "about to be an hour old" data into acache.max_flows_ana_past_hour
table prior to truncating its own table and updating it with the new data. Thus, therequired_tables_updated
function was failing because it found thatcache.max_flows_ana
was being called in a "FROM" clause (since the data from that table is written into the past_hour table), but it also found that the reference_time in that table wasn't matching the expected reference time - but that's because it's supposed to not match an be an hour earlier.The other gotcha use case was the "rnr" service was failing because it depends upon an "ahps_metadata" table that for whatever reason has a "reference_time" column, but that column I believe more accurately reflects an update time and thus never matches the expected reference time of the table being created using that data.
To hard-code out the above two gotcha use cases, I just have any table with "past" or "ahps" skip the reference_time check.
Please take a look at this all and let me know what you think!