Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add variables that can replace compset regex matches in config_component and similar files #3816

Open
billsacks opened this issue Dec 24, 2020 · 8 comments

Comments

@billsacks
Copy link
Member

This idea has been a constant refrain for me over the last few years, but I have brought it up in the context of other issues – see especially #119 and #1522. It's finally time that it gets its own issue.

I find the current system for matching various component options via regex matches on the compset long name in config_component.xml and similar files to be fragile and error-prone. We've had a number of bugs over the years due to too-specific or too-general regex matches in some component's xml file, and even though I consider myself decent with regexes, I still get nervous any time I need to write a new match in config_component.xml. I would like cime to do some initial parsing of the compset long name, creating a set of variables that could then be used for matches in config_component.xml, etc., without resorting to regex matching the whole compset long name. This way, the regex parsing would be done in a single place, which could be made robust and well-tested. For each component (I'm using lnd as an example), we'd have variables like:

  • lnd_name: Any number of alphabetic characters - e.g., "CLM"

  • lnd_version: Any number of numeric characters following the lnd_name - e.g., "50" (may be empty for some components)

  • lnd_opts: Any number of options following the '%'. This is a key one. It would allow us to replace xml that currently looks like this:

<value compset="_CLM45%[^_]*SP">-bgc sp</value>
<value compset="_CLM45%[^_]*BGC">-bgc bgc</value>
<value compset="_CLM50%[^_]*SP">-bgc sp</value>
<value compset="_CLM50%[^_]*BGC">-bgc bgc</value>

with simply

<value lnd_opts="SP">-bgc sp</value>
<value lnd_opts="BGC">-bgc bgc</value>
  • lnd_type: One of 'active', 'data', 'stub' or 'coupler_test'. It's important to be able to match on lnd_type='active'. The other 3 are less important, because we're unlikely to rename DLND, SLND, XLND, but I'm including them for consistency.

For some more examples and thinking around this issue, see #1522 (comment) and #1522 (comment) and some follow-on comments in that issue.

@ekluzek
Copy link
Contributor

ekluzek commented Dec 24, 2020

I think this makes a ton of sense and I can see where often it's the way we should have been doing things all along. Separating out the "options" for each component is something that we've already setup because we use the "%" character to delineate the options. So this is a straightforward change that would make so many things much clearer.

I thought I had an exception to @billsacks case for things where there's a dependence on other components. But, it looks like that case works as well.

An example of this is LND_TUNING_MODE where you currently have this...

value compset="DATM%GSWP3.+_CLM45" >clm4_5_GSWP3v1</value>

In @billsacks formula this would become...

value atm_name="DATM" lnd_version="45" atm_opts="GSWP3" >clm4_5_GSWP3v1</value>

There's also this example of DATM_PRESAERO where a multiple component match is done..

<value compset="_DATM.*_DICE.*_DOCN.*_DROF">none</value>

which would become...

<value atm_name="DATM" ice_name="DICE" ocn_name="DOCN" rof_name="DROF" >none</value>

It could also be...

<value atm_type="data" ice_type="data" ocn_type="data" rof_type="data" >none</value>

taking advantage of the component type category. That might often be the better and more clearer thing to do.

So I think this covers everything we'd want it to do, and it makes it clearer. The xxx_opts would still often have to be a reg-ex because of multiple possible options. So the rules about exact match verses a regular expression would need to be clear everywhere.

There is also this case for CLM_CO2_TYPE that I suppose would still use compset...

  `<value compset="HIST.*_DATM" >diagnostic</value>`

I suppose it would become...

  `<value compset="HIST" atm_name="DATM">diagnostic</value>`

The things you'd still need the full compset for are for the starting period and any possible ending on the compset. If these become variables they could be queried as well in the same way.

@jedwards4b
Copy link
Contributor

A word of caution - every new attribute in the xml requires new code and logic in the python. This change simply shifts the complicated logic from the xml file in the form of regex to the python file in the form of multiple attributes. In the opening message @billsacks states that the regex is error prone, but the original issue has been out there for 5 years and I don't think that there have been that many problems with it.

@billsacks
Copy link
Member Author

every new attribute in the xml requires new code and logic in the python. This change simply shifts the complicated logic from the xml file in the form of regex to the python file in the form of multiple attributes.

The difference is that, rather than having hundreds of places where we do regex matches on the compset for this kind of logic, we would do it in one place, covered by tests.

I don't think that there have been that many problems with it.

It's true that we haven't had that many KNOWN problems. But we have a lot of regexes on compset that aren't quite right. And as we support more and more components, with more and more options, it's inevitable that we're going to have situations where current matches work incorrectly, due to being too general or too specific. I would turn this question around and ask: How confident are we that all of our regexes on compset are working right all of the time? My confidence is low on that point.

@billsacks
Copy link
Member Author

billsacks commented Nov 12, 2021

Probably a prerequisite for #4125 given that the plan for that may involve allowing order independence in the compset long name.

@billsacks
Copy link
Member Author

@jedwards4b was wondering in a discussion today whether it's important to split out each component's piece into individual pieces like lnd_name, lnd_version and lnd_opts, as opposed to just having an lnd string that contains all of it. It's true that the further separation isn't important if the only goal is to allow order independence of components in the compset long name. However, part of the goal of this issue (in my mind) is also to allow us to replace fragile regexes with something more fool-proof - for example, avoiding issues like the one I needed to fix in ESCOMP/CESM_CICE5@ac908ac . See also the example in my initial comment in this issue for motivation.

@billsacks billsacks moved this from Todo ~ weeks to Todo ~ months in CESM: infrastructure / cross-component SE priorities Feb 4, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 16, 2023
@billsacks
Copy link
Member Author

I still think this is worth doing at some point.

@billsacks billsacks removed the Stale label May 16, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants