Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIXME at line 676 of case.py: validate compset settings for user-compset #1522

Closed
jedwards4b opened this issue May 12, 2017 · 42 comments
Closed

Comments

@jedwards4b
Copy link
Contributor

I would like to remove this comment in case.py - can someone give me an example that
gets this far with invalid compset settings?

    self._components = self.get_compset_components()
    #FIXME - if --user-compset is True then need to determine that
    #all of the compset settings are valid
@billsacks
Copy link
Member

I'm not sure if this is what the FIXME note is referring to, but I see:

(1) Valid compset (as a "control"):

./create_newcase --compset 2000_DATM%NYF_SLND_DICE%SSMI_DOCN%DOM_DROF%NYF_SGLC_SWAV --user-compset --res f45_g37 --run-unsupported --case test_0512a --pesfile ../src/drivers/mct/cime_config/config_pes.xml

(2) Invalid compset:

./create_newcase --compset 2000_DATM%FOO_SLND_DICE%SSMI_DOCN%DOM_DROF%NYF_SGLC_SWAV --user-compset --res f45_g37 --run-unsupported --case test_0512c --pesfile ../src/drivers/mct/cime_config/config_pes.xml

Specifically, note the use of DATM%FOO. This happily moves ahead; I haven't checked what the resulting case looks like.

The only way I can think of to catch errors like this is if every component lists allowed settings in its config_components.xml file... actually, that might be a good requirement to catch typos in user_compsets.

@billsacks billsacks changed the title FIXME at line 676 of case.py FIXME at line 676 of case.py: validate compset settings for user-compset May 12, 2017
@jedwards4b
Copy link
Contributor Author

So we could have a new variable in config_component.xml

<entry id="COMPONENTNAME_MODIFIERS">
<valid_values>NYF, NXF, FOO</valid_values>

</entry>

I can make this optional so that all components do not need to supply it right away - if a component does supply it then any modifier provided for that component in the compset name will be checked against this list. If it's not supplied than no error check is done for modifiers in that component.

@billsacks
Copy link
Member

@jedwards4b - yes, I really like that solution.

@billsacks
Copy link
Member

Actually, I think (but am not sure) that datm, and maybe other components (probably mostly data models) require at least one modifier. i.e., I think it's invalid to have a compset like 2000_DATM_SLND_DICE%SSMI_DOCN%DOM_DROF%NYF_SGLC_SWAV. If I'm right about that, then is there a way we can also denote that a modifier is required for a given component?

@jedwards4b
Copy link
Contributor Author

Yes, but can you confirm that?

@billsacks
Copy link
Member

@jedwards4b - I don't know how to confirm this.

@mvertens or @ekluzek - do you know if datm and/or other data models require a modifier? e.g., is it invalid to do a compset with _DATM_ rather than _DATM%something_?

@ekluzek
Copy link
Contributor

ekluzek commented May 12, 2017

@billsacks I'm pretty sure you have to specify what DATM_MODE as a %. It's possible there is a default of CORE forcing, but I doubt we EVER rely on the default.

@mvertens
Copy link
Contributor

mvertens commented May 13, 2017 via email

@jedwards4b
Copy link
Contributor Author

jedwards4b commented Jun 7, 2017

Trying to sort out the requirements for the description matches on compset names:

The compset name consists of a date field followed by a description of each specific component in COMP_CLASSES in the order of the COMP_CLASSES list. Currently for cesm this is: CPL,ATM,LND,ICE,OCN,ROF,GLC,WAV,ESP where the ESP component may or may not be explicitly listed. The compset name may optionally include at the end a BGC modifier - we need to be careful not to confuse this with the ESP component. Components are separated by _ symbols.

Each component is described by a specific component name (cam40, clm45, etc) optionally followed by one or more modifiers separated by % symbols.

For some components at least one modifier is required. An attribute in the xml description will indicate whether the component requires modifiers.

One possible addition I see is that some components (datm, docn )? may only allow a single modifier. Perhaps the attribute should be "modifiers" with allowed values of '*','?','1','+' to indicate 0 or more, 0 or 1, exactly 1, and 1 or more respectively.

@gold2718
Copy link

gold2718 commented Jun 7, 2017

For some components at least one modifier is required. An attribute in the xml description will indicate whether the component requires modifiers.

Are you talking about the proposed COMPONENTNAME_MODIFIERS entry tag?
Since we are using the <description> section to determine valid entries, I wonder if that area is enough to provide matches. For instance, CLM has desc tags like:
compset="_CLM50%[^_]*BGC_"
and
compset="_CLM50%[^_]*BGC-CROP"
which I think allows an entry like CLM50%BGC-CROP%BGC which may or may not be valid.
Assuming for a moment that this is not a valid entry for CLM50, then just saying that CLM allows 2 modifiers will not prevent invalid entries like this. I could rewrite BGC-CROP entry as:
compset="_CLM50%[^_]*BGC-CROP_"
to specify that it is a singleton or
compset="_CLM50%[^_]*BGC-CROP(%[^_%]*)?_"
to indicate that it can be followed by another modifier or
compset="_CLM50%[^_]*BGC-CROP(%FATES)?_"
to specify that it could be combined with FATES. While the last invites combinatorial complexity, it also spells out allowed combinations while just a number of modifiers does not.

@jedwards4b
Copy link
Contributor Author

So in the latest discussion we came up with the following:
The description field will have an optional boolean attribute called modifier_required if present and true
it would say that the component has to have at least one modifier.
the desc field would also have an optional boolean attribute called additive if present and set to true this would allow this modifier to be combined with any other modifier.

@gold2718
Copy link

gold2718 commented Jun 7, 2017

How does this prevent invalid combinations?

@jedwards4b
Copy link
Contributor Author

jedwards4b commented Jun 7, 2017

Only fields with the additive attribute could be combined with other fields, although this would not absolutely prevent invalid combinations it should greatly reduce the potential for creating one.

@gold2718
Copy link

gold2718 commented Jun 7, 2017

It doesn't prevent invalid combinations and also is not usable if an option isn't 'additive' for every entry in the <description> section. In the CAM case, ramping CO2 can be added to any item:
compset="_CAM60_"
becomes
compset="_CAM60(%RCO2)?_"
and the behavior is explicit and local. This gives fine-grained control and prevents invalid entries at the earliest stage.
Flagging @cacraigucar and @brian-eaton in case they are not following this discussion.

@mvertens
Copy link
Contributor

mvertens commented Jun 7, 2017 via email

@gold2718
Copy link

gold2718 commented Jun 12, 2017

Here is a pass through the CAM description section using new attributes which are component type names (e.g., atm). Rather than regular expression syntax, each allowed value is spelled out with optional additions in brackets. Thus,
atm="CAM40[%TMOZ][%RCO2] means the atmosphere section of the compset must be CAM4 optionally followed by either %TMOZ or %RCO2. Since this is a 'set' match, this refers to the following possible matches:
CAM40
CAM40%TMOZ
CAM40%RCO2
CAM40%TMOZ%RCO2
CAM40%RCO2%TMOZ
With that idea, here is a proposed CAM description section

  <description>
    <!-- CAM4 physics -->
    <desc atm="CAM40[%RCO2]"             >CAM4 physics:
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM40[%TMOZ][%RCO2]"      >CAM4 physics:
    with optional tropospheric chemistry with bulk aerosols [%TMOZ]
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM40%WXIE[%RCO2]"        >CAM4 physics:
       with WACCM-X enhanced ionosphere, transport, and electrodynamics
    with optional CO2 ramp [%RCO2]
    </desc>
    <!-- CAM4 Single Column -->
    <desc atm="AR95_CAM40%SCAM[%RCO2]" >Single column CAM ARM95 IOP test case:
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="AR97_CAM40%SCAM[%RCO2]" >Single column CAM ARM97 IOP test case:
    with optional CO2 ramp [%RCO2]
    </desc>
    <!-- CAM5 physics -->
    <desc atm="CAM50[%RCO2]"             >CAM5 physics:
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM50[%CLB][%RCO2]"       >CAM5 physics:
    with optional CLUBB [%CLB]
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM50[%PM][%RCO2]"        >CAM5 physics:
    with optional prescribed modal aerosols [%PM]
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM50[%WCCM][%RCO2]"      >CAM5 physics:
    with optional middle atmosphere chemistry (WACCM) [%WCCM]
    with optional CO2 ramp [%RCO2]
    </desc>
    <!-- CAM5 Single Column -->
    <desc atm="AR95_CAM50%SCAM[%RCO2]" >Single column CAM ARM95 IOP test case:
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="AR97_CAM50%SCAM[%RCO2]" >Single column CAM ARM97 IOP test case:
    with optional CO2 ramp [%RCO2]
    </desc>
    <!-- CAM6 physics -->
    <desc atm="CAM60[%RCO2]"             >CAM6 physics:
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM60%WCCM[%RCO2]"        >CAM6 physics:
    with middle atmosphere chemistry (WACCM) [%WCCM]
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM60%WCSC[%RCO2]"        >WACCM with specified chemistry:
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM60%WCTS[%RCO2]"        >WACCM with tropospheric, stratospheric, mesospheric, and lower thermospheric chemistry:
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM60%CCTS[%RCO2]"        >CAM-Chem troposphere / stratosphere chemistry with simplified volatility basis set SOA scheme and modal aersols :
    with optional CO2 ramp [%RCO2]
    </desc>
    <!-- SPCAM -->
    <desc atm="CAM%SPCAMS[%RCO2]"        >Super-parameterized CAM with one moment SAM microphysics 
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM%SPCAMCLBS[%RCO2]"     >Super-parameterized CAM with one moment SAM microphysics using CLUBB 
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM%SPCAMM[%RCO2]"        >Super-parameterized CAM with double moment m2005 SAM microphysics 
    with optional CO2 ramp [%RCO2]
    </desc>
    <desc atm="CAM%SPCAMCLBM[%RCO2]"     >Super-parameterized CAM with double moment m2005 SAM microphysics using CLUBB 
    with optional CO2 ramp [%RCO2]
    </desc>
    <!-- Specified dynamics -->
    <desc atm="SDYN_CAM[%RCO2]"          >CAM with winds and temperature nudged towards prescribed meteorology:
    with optional CO2 ramp [%RCO2]
    </desc>
    <!-- Simple physics compsets -->
    <desc atm="CAM%ADIAB"                >CAM with adiabatic physics:</desc>
    <desc atm="CAM%DABIP04"              >CAM with dry adiabatic baroclinic instability (Polvani 2004):</desc>
    <desc atm="CAM%HS94"                 >CAM with Held-Suarez physics:</desc>
    <desc atm="CAM%DCTBM"                >CAM dynamical core test with baroclinic wave IC and terminator chemistry:</desc>
    <desc atm="CAM%KESSLER"              >CAM dynamical core test with baroclinic wave IC and Kessler physics:</desc>
    <desc atm="CAM%PORT"                 >CAM Parallel Offline Radiation Tool:</desc>
  </description>

One thing I'm not sure what to do with is the initial-condition / forcing scenario which is the first block of the compset. An example is the AR97_CAM5%SCAM[%RCO2] entry which includes a description of what AR97 means. I feel that that information should be in the driver's config_component file with CAM using the information in in its use case section ( <entry id="CAM_NML_USE_CASE">). I don't have a good name for this section, any ideas?

Any feedback on this proposal?

@billsacks
Copy link
Member

@gold2718 - at a glance, I do agree that this is more understandable than the regex syntax.

Is it intentional that some of these say CAM40%, CAM50% or CAM60%, some say CAM5%, and some just say CAM%?

At least for CLM, it would really help moving forward if we separated lnd_name from lnd_opts. We have a lot of options that apply to both CLM45 and CLM50 - and in the future will also apply to CLM55, CLM60, etc. On a number of occasions, we've had buggy behavior because someone forgot to add a CLM50%FOO section. So it would help if we could just say lnd_opts="BGC" rather than having separate lines for "CLM45%BGC", "CLM50%BGC", "CLM55%BGC", "CLM60%BGC", etc. The land model would appear in lnd_name - so we'd be able to match lnd_name="CLM45", lnd_name="CLM50", etc. This does beg the question of why the physics version is considered part of the component name rather than an option, though....

I'm okay deferring that to post-CESM2, but I feel like it's pretty important to address at some point, because this is getting problematic for CLM.

Also, I wanted to point out that @mvertens and I thought through some of this last year and came up with this proposal: #119 (comment)

@jedwards4b
Copy link
Contributor Author

@gold2718 It seems that you diverged from the conversation that we had leading to this proposal.
I don't have a problem using the component class for the attribute name - that will remove the order dependence on the comp_classes. The time filed AR97 does not belong in this file at all and should be moved to the driver. And to be clear the [] is a regular expression - the discussion was that there would be no regular expression syntax - is this to be the only regular expression syntax allowed in this field?

@billsacks You seem to be proposing something different but you need to spell it out clearly with examples. Are you suggesting another layer such as

<desc lnd_name="CLM50">
          <option lnd_opt="BGC"> clm50 with bgc physics </option>
          <option lnd_opt="FATE"> some kind of thing </option>
</desc>

How would you specify which options can be combined and which are exclusive?

@billsacks
Copy link
Member

What I meant was something like this:

<desc lnd_name="CLM45"> clm45</desc>
<desc lnd_name="CLM50"> clm50</desc>
<desc lnd_opt="BGC"> with biogeochemistry</desc>
<desc lnd_opt="BGC-CROP"> with biogeochemistry and the crop model active</desc>

rather than what we currently have, which has separate listings for CLM45%BGC, CLM50%BGC, CLM45%BGC-CROP, CLM50%BGC-CROP, etc.

If we had an option that only applied to a certain physics option, it could be done like this:

<desc lnd_name="CLM50" lnd_opt="FOO"> some option that only applies to CLM50</desc>

(Although in the case of CLM, I think those restrictions might be easier to handle by having appropriate error checking in build-namelist rather than trying to list the allowed / disallowed combinations in the xml. But let's not get hung up on that point, because I think it's purely academic for now - since I think all options are supported with both CLM45 and CLM50.)

@jedwards4b
Copy link
Contributor Author

Ah therein lies the problem. CLM has determined that only one option at a time is allowed. Instead of CLM50%BGC%CROP you do CLM50%BGC-CROP but other components want to allow multiple options CAM50%CLUBB%RMP We either need to standardize one syntax or the other or propose something that works for both - I think that what Steve has will work for both cases, what you have will not.

@billsacks
Copy link
Member

@jedwards4b - my comment was not meant to address the combination of options. I am perfectly happy with @gold2718 's suggestion for that. I was really just trying to address the separation of the component name from the component's options. Sorry for creating confusion here. So we could have something like:

<desc lnd_name="CLM45"> clm45</desc>
<desc lnd_name="CLM50"> clm50</desc>
<desc lnd_opt="%BGC[%CROP]"> with biogeochemistry, with optional crop model</desc>

@gold2718
Copy link

@billsacks - The issue with the separation is that in CAM, not all options apply to every CAM physics option. I think the choice comes down to duplication or lots of regular expression syntax.

@gold2718
Copy link

@jedwards4b - [%CLB] is only regular expression syntax if you treat the string as a regular expression. In our discussion last week, I thought we decided to try using sets with optional members with no regular expressions. My proposal above is using that idea. Parsing the string would create an exact string match plus a set of options.

@gold2718
Copy link

@billsacks - The CAM4% and CAM5% were typos (edited above) but everything else is intentional. CAM currently supports nine separate physics packages (roughly equivalent to to CLM's CLM45 and CLM50). The allowed option sets vary widely between these different packages.

@jedwards4b
Copy link
Contributor Author

@gold2718 instead of just splitting the string on % I also have to deal with the [] now,
shouldn't the brackets be inside and not outside the % symbol? %[CLB] instead of [%CLB]
This will make parsing easier I think.

@gold2718
Copy link

@jedwards4b, I was hoping we could keep this syntax because it is clearer for users (mimics shell optional argument syntax). For parsing, something like:
[ x.rstrip("]") for x in atm.split("[%") ]
can easily handle getting rid of the brackets with the first element of the list being the required part and the rest of the list as options.

@jedwards4b
Copy link
Contributor Author

Here is a little test program that I think implements @gold2718 proposal. I'm not sure how to parse the description to separate the required from the optional part of the comment. It seems like you don't want 'with optional CO2 Ramp [%RCO2]' if your compset didn't specify it. Can we put optional argument descriptions separately? That is we might have

<desc option="RCO2">CO2 Ramp</desc> 

in addition to what you have above?

#!/usr/bin/env python

compsetname = "B1850_CAM50%WCCM%RCO2_CLM50%BGC-CROP_DICE%SSMI_POP2_DROF%NYF_SGLC_WW3"
atm = "CAM50%WCCM[%RCO2]"
opt_parts = [ x.rstrip("]") for x in atm.split("[%") ]
parts = opt_parts.pop(0).split("%")

reqset = set(parts)
fullset = set(parts+opt_parts)

print "parts %s opt_parts %s"%(parts,opt_parts)
print "set %s fullset %s"%(reqset,fullset)

comparts = compsetname.split('_')
for comp in comparts:
    cset = set(comp.split("%"))
    if cset == fullset:
        print "Here match to fullset %s %s"%(cset, fullset)
    elif cset == reqset:
        print "Here match to reqset %s %s"%(cset, reqset)
    else:
        print "no match %s %s"%(cset, fullset)

@billsacks
Copy link
Member

To record a conversation I just had with @gold2718 - regarding this point:

@billsacks - The issue with the separation is that in CAM, not all options apply to every CAM physics option. I think the choice comes down to duplication or lots of regular expression syntax.

For CAM, I think this amounts to a pretty simple change for each line of xml: Anywhere where @gold2718 's xml file would have something like

<desc atm="CAM50[%WCCM][%RCO2]"      >

my proposal simply separates the full 'atm' string into 'atm_name' and 'atm_opts' like this:

<desc atm_name="CAM50" atm_opts="[%WCCM][%RCO2]"      >

For CAM, it sounds like this just adds some verbosity without much benefit. But for CLM, this would make things simpler (with less duplication) and more robust moving forward.

@jedwards4b
Copy link
Contributor Author

@billsacks - where is the analogous clm example that shows the benefit of this additional change? Show me how it makes things simpler and more robust.

@billsacks
Copy link
Member

Let me try to summarize my thinking on this, as well as what @jedwards4b and I just discussed. @jedwards4b helped clarify for me a misunderstanding I had about exactly how this would look.

Under the combined @gold2718 & @jedwards4b proposal, I think CLM's options would look like this (partial example, but adding the rest of the entries doesn't help illuminate things any more):

<desc lnd="CLM45[%SP][%BGC][%BGC-CROP]"> CLM45 physics</desc>
<desc lnd="CLM50[%SP][%BGC][%BGC-CROP]"> CLM50 physics</desc>
<desc option="SP"> satellite phenology</desc>
<desc option="BGC"> biogeochemistry</desc>
<desc option="BGC-CROP"> biogeochemistry with crop model</desc>

I now see why my suggestion didn't get much traction, because under that scheme, the separation of lnd_name doesn't help much, if at all: Actually, @jedwards4b 's most recent suggestion went a long way towards what I was thinking.

I have been approaching this more from the perspective of what I've wanted for a long time in matches outside of the entries. I understand that we're not dealing with that right now, but I'm getting some messages that it may be worth at least keeping that long-term goal in mind here. To that end, I was thinking about the compset matches that we have in these places:

  1. components' config_component.xml files, outside of the entry

Example (CLM's LND_TUNING_MODE)

      <value compset="_CLM40"      >clm4_0_default</value>
      <value compset="_CLM45"      >clm4_5_CRUNCEP</value>
      <value compset="_CLM50"      >clm5_0_cam5.5</value>
      <value compset="DATM.+_CLM50" >clm5_0_GSW3P</value>
      <value compset="DATM%GSWP3.+_CLM50" >clm5_0_GSW3P</value>
      <value compset="CAM.+_CLM50" >clm5_0_cam5.5</value>

Example (CLM_BLDNML_OPTS, currently buggy because I think it's incomplete)

      <value compset="_CLM45%[^_]*SP"		 >-bgc sp</value>
      <value compset="_CLM45%[^_]*CN"		 >-bgc cn</value>
      <value compset="_CLM45%[^_]*BGC"		 >-bgc bgc</value>
      <value compset="_CLM45%[^_]*CN-CROP"	 >-bgc cn -crop</value>
      <value compset="_CLM45%[^_]*BGC-CROP"	 >-bgc bgc -crop</value>
      <value compset="_CLM45%[^_]*CNDV"		 >-bgc cn -dynamic_vegetation</value>
      <value compset="_CLM45%[^_]*BGCDV"	 >-bgc bgc -dynamic_vegetation</value>
      <value compset="_CLM45%[^_]*CNDV-CROP"	 >-bgc cn -dynamic_vegetation -crop</value>
      <value compset="_CLM45%[^_]*BGCDV-CROP"	 >-bgc bgc -dynamic_vegetation -crop</value>
      <value compset="_CLM45%[^_]*FATES"	 >-bgc fates -no-megan</value>

      <value compset="_CLM50%[^_]*SP"		 >-bgc sp</value>
      <value compset="_CLM50%[^_]*BGC"		 >-bgc bgc</value>
      <value compset="_CLM50%[^_]*BGCDV"	 >-bgc bgc -dynamic_vegetation</value>
      <value compset="_CLM50%[^_]*BGC-CROP"	 >-bgc bgc -crop</value>
      <value compset="HIST.*_CLM50%[^_]*BGC-CROP">-bgc bgc -crop -irrig .true.</value>
      <value compset="2000.*_CLM50%[^_]*BGC-CROP">-bgc bgc -crop -irrig .true.</value>
      <value compset="_CLM50%[^_]*BGCDV-CROP"	 >-bgc bgc -dynamic_vegetation -crop</value>
      <value compset="_CLM50%[^_]*FATES"         >-bgc fates -no-megan</value>
  1. the driver's config_component.xml files - particularly config_component_cesm.xml

Example: CCSM_BGC

      <value compset="_CAM">CO2A</value>
      <value compset="_DATM">none</value>
      <value compset="_DATM%CPLHIST.+POP\d">CO2A</value>
      <value compset="HIST.*_DATM.*_CLM">CO2A</value>
      <value compset="RCP.*_DATM.*_CLM">CO2A</value>
      <value compset="_BGC%BPRP">CO2C</value>
      <value compset="_BGC%BDRD">CO2C</value>

Example: ATM_NCPL

      <value compset="_CAM.*">48</value>
      <value compset="_CAM\d+%WCBC">144</value>
      <value compset="_CAM\d+%WCMX">288</value>
      <value compset="_CAM\d+%WCXI">288</value>
      <value compset="_CAM%ADIAB">48</value>
      <value compset="_CAM%IDEAL">48</value>
      <value compset="_DATM%COPYALL_NPS">72</value>
      <value compset="_DATM.*_CLM">48</value>
      <value compset="_DATM.*_DICE.*_POP2">4</value>
      <value compset="_DATM.*_SLND.*_CICE.*_POP2">24</value>
      <value compset="_DATM.*_CICE.*_DOCN">24</value>
      <value compset="_DATM.*_DOCN%US20">24</value>
      <value compset="_DATM%CPLHIST.+POP\d">48</value>
      <value compset="_MPAS">1</value>

Example: OCN_NCPL (note likely bugginess: it references CLM4, but not CLM5!)

      <value compset="_POP2">1</value>
      <value compset="_POP2" grid="oi%tx0.1v2">4</value>
      <value compset="_POP2" grid="oi%gx1v6">24</value>
      <value compset="_POP2" grid="oi%gx1v7">24</value>
      <value compset="_DATM.*_CLM4.*_SICE.*_SOCN">1</value>
      <value compset="_DATM%NYF.*_SLND.*_DICE.*_DOCN.*_SWAV">1</value>
      <value compset="_DATM%NYF.*_DLND.*_DICE.*_DOCN.*_DWAV">1</value>
      <value compset="_XATM.*_XLND.*_XICE.*_XOCN">1</value>
      <value compset="_DATM.*_CLM4.*_SICE.*_SOCN">1</value>
      <value compset="_SATM.*_SLND.*_SICE.*_SOCN">1</value>
      <value compset="_DLND.*_CISM\d">1</value>
  1. config_pes.xml

Example:

      <pes pesize="any" compset="CAM.+CLM.+CICE.+POP.+SWAV">
  1. config_grids.xml

Example:

    <model_grid alias="1x1_mexicocityMEX" compset="DATM.+CLM">

The way I could see to avoid buggy and hard-to-maintain regular expressions in all of these is to separate the compset into its various pieces: atm_model, atm_opts (and probably a separate atm_version (40, 50, 60, etc.)), lnd_model, lnd_opts (and probably a separate lnd_version (40, 45, 50, 55, etc.) - etc. Also, possibly having the more general variables atm_type, lnd_type, etc., which can be active, stub, data, coupler_test.

I have to run now, but if people want me to write out what each of my examples would look like with this proposed separation, I can do so.

@billsacks
Copy link
Member

billsacks commented Jun 12, 2017

@jedwards4b @gold2718 @mvertens - Actually, I'm feeling overwhelmed about everything I'm supposed to get done this week, so I'm hesitant to take more time to think through this. So I'd propose that you move ahead with what you have laid out for the <desc> entry, because I'm fine with all of that. Then we can revisit some of these other issues post-cesm2 release: let's not take more time to discuss them now. From what I've seen, the proposal will at least be moving things in the right direction for what I see in terms of long-term needs.

@jedwards4b
Copy link
Contributor Author

Cam description updated to the current proposal.

 <description>
    <!-- CAM4 physics -->
    <desc atm="CAM40[%RCO2]"             >CAM4 physics </desc>
    <desc option="RCO2">with CO2 ramp [%RCO2]</desc>
    <desc option="TMOZ">with tropospheric chemistry with bulk aerosols [%TMOZ]</desc>

    <desc atm="CAM40[%TMOZ][%RCO2]"      >CAM4 physics: </desc>
    <desc atm="CAM40%WXIE[%RCO2]"        >CAM4 physics:
       with WACCM-X enhanced ionosphere, transport, and electrodynamics
     </desc>
    <!-- CAM4 Single Column -->
    <desc atm="AR95_CAM40%SCAM[%RCO2]" >Single column CAM ARM95 IOP test case:
    </desc>
    <desc atm="AR97_CAM40%SCAM[%RCO2]" >Single column CAM ARM97 IOP test case:
    </desc>
    <!-- CAM5 physics -->
    <desc atm="CAM50[%RCO2]"             >CAM5 physics:
    </desc>
    <desc option="CLB"> CLUBB microphysics </desc>
    <desc atm="CAM50[%CLB][%RCO2]"       >CAM5 physics:
    </desc>
    <desc atm="CAM50[%PM][%RCO2]"        >CAM5 physics: </desc>
    <desc option="PM"> with prescribed modal aerosols [%PM] </desc>
    <desc atm="CAM50[%WCCM][%RCO2]"      >CAM5 physics:    </desc>
    <desc option="WCCM">with  middle atmosphere chemistry (WACCM) [%WCCM]</desc>
    <!-- CAM5 Single Column -->
    <desc atm="AR95_CAM50%SCAM[%RCO2]" >Single column CAM ARM95 IOP test case:
    </desc>
    <desc atm="AR97_CAM50%SCAM[%RCO2]" >Single column CAM ARM97 IOP test case:
    </desc>
    <!-- CAM6 physics -->
    <desc atm="CAM60[%RCO2]"             >CAM6 physics:
    </desc>
    <desc atm="CAM60%WCCM[%RCO2]"        >CAM6 physics:
    with middle atmosphere chemistry (WACCM) [%WCCM]
    </desc>
    <desc atm="CAM60%WCSC[%RCO2]"        >WACCM with specified chemistry:
    </desc>
    <desc atm="CAM60%WCTS[%RCO2]"        >WACCM with tropospheric, stratospheric, mesospheric, and lower thermospheric chemistry:
    </desc>
    <desc atm="CAM60%CCTS[%RCO2]"        >CAM-Chem troposphere / stratosphere chemistry with simplified volatility basis set SOA scheme and modal aersols :
    </desc>
    <!-- SPCAM -->
    <desc atm="CAM%SPCAMS[%RCO2]"        >Super-parameterized CAM with one moment SAM microphysics 
    </desc>
    <desc atm="CAM%SPCAMCLBS[%RCO2]"     >Super-parameterized CAM with one moment SAM microphysics using CLUBB 
    </desc>
    <desc atm="CAM%SPCAMM[%RCO2]"        >Super-parameterized CAM with double moment m2005 SAM microphysics 
    </desc>
    <desc atm="CAM%SPCAMCLBM[%RCO2]"     >Super-parameterized CAM with double moment m2005 SAM microphysics using CLUBB 
    </desc>
    <!-- Specified dynamics -->
    <desc atm="SDYN_CAM[%RCO2]"          >CAM with winds and temperature nudged towards prescribed meteorology:
    </desc>
    <!-- Simple physics compsets -->
    <desc atm="CAM%ADIAB"                >CAM with adiabatic physics:</desc>
    <desc atm="CAM%DABIP04"              >CAM with dry adiabatic baroclinic instability (Polvani 2004):</desc>
    <desc atm="CAM%HS94"                 >CAM with Held-Suarez physics:</desc>
    <desc atm="CAM%DCTBM"                >CAM dynamical core test with baroclinic wave IC and terminator chemistry:</desc>
    <desc atm="CAM%KESSLER"              >CAM dynamical core test with baroclinic wave IC and Kessler physics:</desc>
    <desc atm="CAM%PORT"                 >CAM Parallel Offline Radiation Tool:</desc>
  </description>

@gold2718
Copy link

@billsacks, without commentary or alternative proposals, it is hard to understand what issues you have with the current sections in config_componont.xml. All I can do now is move forward with the <description> section but I do not think there will be time to revisit this for quite a while given all the competing post-CESM2 priorities.

@gold2718
Copy link

@jedwards4b, Here is an attempt for a hybrid solution. I also moved the option descriptions to their own section to make them easier to find.

<description>
    <!-- CAM4 physics -->
    <desc atm_name="CAM40" atm_opts=[%RCO2]"             >CAM4 physics </desc>
    <desc atm_name="CAM40" atm_opts="[%TMOZ][%RCO2]"      >CAM4 physics: </desc>
    <desc atm_name="CAM40%WXIE" atm_opts="[%RCO2]"        >CAM4 physics: with WACCM-X enhanced ionosphere, transport, and electrodynamics</desc>
    <!-- CAM4 Single Column -->
    <desc atm_name="AR95_CAM40%SCAM" atm_opts="[%RCO2]" >Single column CAM ARM95 IOP test case</desc>
    <desc atm_name="AR97_CAM40%SCAM" atm_opts="[%RCO2]" >Single column CAM ARM97 IOP test case</desc>
    <!-- CAM5 physics -->
    <desc atm_name="CAM50" atm_opts="[%RCO2]"             >CAM5 physics:</desc>
    <desc atm_name="CAM50" atm_opts="[%CLB][%RCO2]"       >CAM5 physics:</desc>
    <desc atm_name="CAM50" atm_opts="[%PM][%RCO2]"        >CAM5 physics: </desc>
    <desc atm_name="CAM50" atm_opts="[%WCCM][%RCO2]"      >CAM5 physics:    </desc>
    <!-- CAM5 Single Column -->
    <desc atm_name="AR95_CAM50" atm_opts="%SCAM[%RCO2]" >Single column CAM ARM95 IOP test case:</desc>
    <desc atm_name="AR97_CAM50%SCAM" atm_opts="[%RCO2]" >Single column CAM ARM97 IOP test case:</desc>
    <!-- CAM6 physics -->
    <desc atm_name="CAM60" atm_opts="[%RCO2]"             >CAM6 physics:</desc>
    <desc atm_name="CAM60%WCCM" atm_opts="[%RCO2]"        >CAM6 physics with middle atmosphere chemistry (WACCM)</desc>
    <desc atm_name="CAM60%WCSC" atm_opts="[%RCO2]"        >WACCM with specified chemistry:</desc>
    <desc atm_name="CAM60%WCTS" atm_opts="[%RCO2]"        >WACCM with tropospheric, stratospheric, mesospheric, and lower thermospheric chemistry:</desc>
    <desc atm_name="CAM60%CCTS" atm_opts="[%RCO2]"        >CAM-Chem troposphere / stratosphere chemistry with simplified volatility basis set SOA scheme and modal aerosols :</desc>
    <!-- SPCAM -->
    <desc atm_name="CAM%SPCAMS" atm_opts="[%RCO2]"        >Super-parameterized CAM with one moment SAM microphysics:</desc>
    <desc atm_name="CAM%SPCAMCLBS" atm_opts="[%RCO2]"     >Super-parameterized CAM with one moment SAM microphysics using CLUBB:</desc>
    <desc atm_name="CAM%SPCAMM" atm_opts="[%RCO2]"        >Super-parameterized CAM with double moment m2005 SAM microphysics:</desc>
    <desc atm_name="CAM%SPCAMCLBM" atm_opts="[%RCO2]"     >Super-parameterized CAM with double moment m2005 SAM microphysics using CLUBB:</desc>
    <!-- Specified dynamics -->
    <desc atm_name="SDYN_CAM" atm_opts="[%RCO2]"          >CAM with winds and temperature nudged towards prescribed meteorology:</desc>
    <!-- Simple physics compsets -->
    <desc atm_name="CAM%ADIAB"                >CAM with adiabatic physics:</desc>
    <desc atm_name="CAM%DABIP04"              >CAM with dry adiabatic baroclinic instability (Polvani 2004):</desc>
    <desc atm_name="CAM%HS94"                 >CAM with Held-Suarez physics:</desc>
    <desc atm_name="CAM%DCTBM"                >CAM dynamical core test with baroclinic wave IC and terminator chemistry:</desc>
    <desc atm_name="CAM%KESSLER"              >CAM dynamical core test with baroclinic wave IC and Kessler physics:</desc>
    <desc atm_name="CAM%PORT"                 >CAM Parallel Offline Radiation Tool:</desc>
    <!-- CAM Configuration Options -->
    <desc option="CLB"> CLUBB microphysics </desc>
    <desc option="PM"> with prescribed modal aerosols [%PM] </desc>
    <desc option="RCO2">with CO2 ramp [%RCO2]</desc>
    <desc option="TMOZ">with tropospheric chemistry with bulk aerosols [%TMOZ]</desc>
    <desc option="WCCM">with  middle atmosphere chemistry (WACCM) [%WCCM]</desc>

  </description>

@jedwards4b
Copy link
Contributor Author

@gold2718 I don't understand - I posted the compromise after talking with @billsacks, did you even look at it? With this compromise we don't need to separate atm_name and atm_opts.

@gold2718
Copy link

@jedwards4b, Yes, I looked at it and it works fine for CAM. I was exploring @billsacks' suggestion to separate out the options to allow a <desc> tag with clm_options but without clm_name. For the record, my proposal above has bugs and is not particularly satisfying to me but I wanted to have a look at it first.

@billsacks
Copy link
Member

@mvertens and @gold2718 asked me to provide some more details of my thinking, and to work through OCN_NCPL as an example.

There are two basic principles that are motivating me here:

  1. We should have a design that supports change - and makes it likely that, as aspects of the system change, the xml matches Just Work, rather than silently doing the wrong thing.

    Change can take a number of forms, all of which should be supported more easily than is the current case:

    a. Alternative components - e.g., a new land ice model (ISSM rather than CISM). This means that specific component names (e.g., "CISM") should only appear if a match is truly only meant to match that specific component.

    b. New component version numbers - e.g., CISM3. This means that the component version number should only appear if a match truly depends on that version number.

    c. New component types - e.g., we have recently introduced a wav model and an esp model. This means that matches should not depend on the number or ordering of component types in the compset long name.

    d. The exact specification of the compset long name. Currently, if we wanted to change the specification (the "syntax", so to speak) of the compset long name, we would need to change thousands of lines of xml. Wouldn't it be nice if the parsing of the compset long name syntax was isolated to a single python module, so that we were free to change this specification if additional needs arise in the future? So matches should not depend on the exact form of the compset long name, but rather should use individual pieces of the already-parsed long name. See below for more details.

  2. xml match rules should specify only what truly matters for that match rule, and should be as general as possible. e.g., if what matters for a rule is that we're running with an active land, a data atmosphere and a stub ocean, then the rule should state that explicitly. This makes it easier to understand what the rule is really all about, and also makes it more likely that things will Just Work as the system changes in the future (e.g., by adding an alternative land model).

How do we get there? In my ideal world, the scripts would parse the compset long name into each of its pieces, storing each piece in a separate variable that could be matched in xml files, without resorting to error-prone regular expressions in most cases. For each component (i.e., each piece of the compset long name that is separated by '_'), we can extract the following variables; here I use lnd as an example, but we'd have similar variables for atm, ocn, etc.:

  • lnd_name: Any number of alphabetic characters - e.g., "CLM"

  • lnd_version: Any number of numeric characters following the lnd_name - e.g., "50" (may be empty for some components)

  • lnd_opts: Any number of options following the '%'. I haven't thought through exactly how these should be stored and matched, but I think this ties in with the thinking @gold2718 has been describing.

  • lnd_type: One of 'active', 'data', 'stub' or 'coupler_test'. It's important to be able to match on lnd_type='active'. The other 3 are less important, because we're unlikely to rename DLND, SLND, XLND, but I'm including them for consistency.

Let's apply these principles to OCN_NCPL. For each line, I'll give the old, the proposed new, and some commentary. I know that the new is a bit more verbose than the old in these examples, but I think it's worth it for being more explicit and robust, and as I mention in my comments, I think many of these could be shortened.

Old:

      <value compset="_POP2">1</value>
      <value compset="_POP2" grid="oi%tx0.1v2">4</value>
      <value compset="_POP2" grid="oi%gx1v6">24</value>
      <value compset="_POP2" grid="oi%gx1v7">24</value>

New:

      <value ocn_name="POP">1</value>
      <value ocn_name="POP" grid="oi%tx0.1v2">4</value>
      <value ocn_name="POP" grid="oi%gx1v6">24</value>
      <value ocn_name="POP" grid="oi%gx1v7">24</value>

Comments: It seems that these are really just meant to apply to POP, and would differ for some other ocean model (e.g., MOM). So there's not much change here. The leading underscore is no longer needed, breaking dependence on the format of the compset long name. And I've gotten rid of the version number, because if there were ever a POP v. 3, the old version would give the default value for OCN_NCPL rather than matching these lines. (One could argue that these lines are meant to apply to any active ocean model, or that these are only supposed to apply to POP v. 2; either of these could be done, and now the meaning would be more explicit.)

Old:

      <value compset="_DATM.*_CLM4.*_SICE.*_SOCN">1</value>

New:

      <value atm_type="data" lnd_type="active" ice_type="stub" ocn_type="stub">1</value>

Comments: The important part here is getting rid of CLM4 from the match, since this should apply for any active land model (CLM5 or some alternative land model). I suspect that this rule could actually be shortened; e.g., maybe it really applies any time you're running with ocn_type="stub", because OCN_NCPL doesn't matter in that case???

Old:

      <value compset="_DATM%NYF.*_SLND.*_DICE.*_DOCN.*_SWAV">1</value>
      <value compset="_DATM%NYF.*_DLND.*_DICE.*_DOCN.*_DWAV">1</value>

New: ???

Comments: I'm not sure what the important thing here is. Is the important thing that we are running DATM with atm_opts="NYF"? Or is the important thing that we're running DOCN?

Old:

      <value compset="_XATM.*_XLND.*_XICE.*_XOCN">1</value>

New:

      <value atm_type="coupler_test" lnd_type="coupler_test" ice_type="coupler_test" ocn_type="coupler_test">1</value>

Comments: Again, I think this should be shortened. I suspect the real rule is to use a value of 1 whenever ocn_type="coupler_test".

Old:

      <value compset="_SATM.*_SLND.*_SICE.*_SOCN">1</value>

New:

      <value atm_type="stub" lnd_type="stub" ice_type="stub" ocn_type="stub">1</value>

Comments: Comments: Again, I think this should be shortened. I suspect the real rule is to use a value of 1 whenever ocn_type="stub".

Old:

      <value compset="_DLND.*_CISM\d">1</value>

New:

      <value atm_type="data" glc_type="active" ocn_type="stub">1</value>

Comments: A key piece is making this more general so it matches any active glc. I've added ocn_type="stub" here because I think that's actually important in this rule. This could perhaps be combined with a general rule that we use a value of 1 whenever we have ocn_type="stub".

Another good example is CLM_BLDNML_OPTS. Currently this looks like this:

      <value compset="_CLM45%[^_]*SP"		 >-bgc sp</value>
      <value compset="_CLM45%[^_]*CN"		 >-bgc cn</value>
      <value compset="_CLM45%[^_]*BGC"		 >-bgc bgc</value>
      <value compset="_CLM45%[^_]*CN-CROP"	 >-bgc cn -crop</value>
      <value compset="_CLM45%[^_]*BGC-CROP"	 >-bgc bgc -crop</value>
      <value compset="_CLM45%[^_]*CNDV"		 >-bgc cn -dynamic_vegetation</value>
      <value compset="_CLM45%[^_]*BGCDV"	 >-bgc bgc -dynamic_vegetation</value>
      <value compset="_CLM45%[^_]*CNDV-CROP"	 >-bgc cn -dynamic_vegetation -crop</value>
      <value compset="_CLM45%[^_]*BGCDV-CROP"	 >-bgc bgc -dynamic_vegetation -crop</value>
      <value compset="_CLM45%[^_]*FATES"	 >-bgc fates -no-megan</value>

      <value compset="_CLM50%[^_]*SP"		 >-bgc sp</value>
      <value compset="_CLM50%[^_]*BGC"		 >-bgc bgc</value>
      <value compset="_CLM50%[^_]*BGCDV"	 >-bgc bgc -dynamic_vegetation</value>
      <value compset="_CLM50%[^_]*BGC-CROP"	 >-bgc bgc -crop</value>
      <value compset="HIST.*_CLM50%[^_]*BGC-CROP">-bgc bgc -crop -irrig .true.</value>
      <value compset="2000.*_CLM50%[^_]*BGC-CROP">-bgc bgc -crop -irrig .true.</value>
      <value compset="_CLM50%[^_]*BGCDV-CROP"	 >-bgc bgc -dynamic_vegetation -crop</value>
      <value compset="_CLM50%[^_]*FATES"         >-bgc fates -no-megan</value>

The problem here is that every option needs to be listed for CLM45 and CLM50. Note that, currently the CLM50 list is incomplete (buggy!). As new CLM versions are added, these lists will need to be duplicated over and over. As new compset options are added, they will need to be added to the list for each CLM version (assuming that we continue down our current path, where new options can be applied to old version packages). In contrast, under my proposal, the duplication would be removed, becoming something like this (only shown for the first few items, but the rest follow this same pattern):

      <value lnd_opts="SP"		 >-bgc sp</value>
      <value lnd_opts="CN"		 >-bgc cn</value>
      <value lnd_opts="BGC"		 >-bgc bgc</value>
      <value lnd_opts="CN-CROP"	 >-bgc cn -crop</value>

If certain options only applied to particular versions, this could be accomplished by adding a lnd_version attribute. (However, at least for CLM, it's probably most robust to do this consistency check in build-namelist, since a general rule would look like, "applies to versions >= X (and possibly <= Y)", which is currently impossible to express concisely with xml attributes, and challenging to express correctly even with regular expressions.)

@gold2718
Copy link

@billsacks, my short answer to your post is that I do not think any of this impacts the proposal for the change to the <description> section. Below are a few specific comments:

1c) This is a CESM3 issue and compset names will adapt to include new component types along with the elimination of stubs. However, the new syntax must include the ability to separate out by component type so this type of change is a non issue.

1d) I'm sure the compset attribute will be deprecated and I even propose that we consider it deprecated now so its use can be phased out over time. I do not see any impact on the proposed <description> section format change.

  1. IMHO, this type of matching should be done in the <compset> tags in config_compsets.xml à la config_grids.xml and the compset attribute should be replaced with component-type attributes (e.g., ocn) in both places (post CESM2).

I do not think your proposed <comp>_type attribute (e.g., ocn_type) improves the robustness of the system. For instance, a simple variant of your example:

 <value atm_type="data" lnd_type="active" ice_type="stub" ocn_type="active">1</value>

would apply to both POP and MOM even though it may only be valid for POP. This is analogous to your issue of falling through to to a default because you did not specify any values for a new model or model version. Regardless of syntax, overriding the default behavior of a system requires a new configuration entry and classification systems (e.g., 'active' vs. 'data' or regular expressions) are prone to 'pattern rot' (my new technical term of the day) in that a classification that works now, may break when new pattern possibilities are introduced to the system (e.g., new models, model types, or model versions). I suggest we strive for specific default overrides to avoid 'pattern rot'.

I think your examples with stubs are irrelevant since stubs will be going away post-CESM2 (the NUOPC-based system does not require stubs).

@billsacks
Copy link
Member

@gold2718 - thanks for your reply here. I think we're largely in agreement about (1) the need to move away from using the compset attribute, and (2) that all of this can be addressed post-cesm2, on the road to cesm3.

But I will continue to strongly argue for the need for a generic lnd_type (etc.), or at least being able to match something like a boolean lnd_prognostic. Currently, introducing a new component requires searching all xml files throughout the system for occurrences of the old component's name. Missing an entry can lead the system to silently misbehave in bad ways. Certainly it's still possible to write the wrong thing with my proposal. I'm not claiming that it's foolproof. I'm just claiming that, currently, it is impossible to write the right thing in some cases: In some cases (the driver in particular), behavior depends on the mix of active vs. data/stub components - not on the particular components (POP vs. MOM vs. MPAS-O, etc.) that you're running with. All I'm saying is that I want the ability to express logic like that when that's most appropriate.

@jedwards4b recently added some code along these lines in case.py: _find_primary_component – i.e., determining the mix of prognostic components. I'm just asking for this information to be brought to a higher level and made available in the xml matches so that it can be used when needed.

Again, this is a post-cesm2 request, and one that I'd be happy to help implement when the time comes. I'm just bringing it up now because I was asked :-)

@gold2718
Copy link

I'm glad we are postponing any decision on new attributes (beyond the component-type attribute we are adding for <desc> tags). For instance, making such decisions based on the driver code makes no sense given that we will be starting over with an entirely new driver architecture which has a completely different way of dealing with components. In particular, the current distinction between data and active components is likely to diminish if not disappear altogether.

If searching through XML files for places where new entries need to be made is error prone, it sounds like a good place for a test. In fact, requiring entries for every supported model (e.g., DOCN, POP, MOM, MPAS-O) and avoiding default entries makes all of this easier and less error prone.

@rljacob
Copy link
Member

rljacob commented Jun 13, 2017

Is the result of this discussion going to require changes to existing component xml files?

@mvertens
Copy link
Contributor

mvertens commented Jun 13, 2017 via email

@ghost ghost assigned rljacob Jun 26, 2017
@ghost ghost added the in progress label Jun 26, 2017
@rljacob rljacob assigned jedwards4b and unassigned rljacob Jun 27, 2017
@ghost ghost removed the in progress label Jun 28, 2017
jedwards4b added a commit that referenced this issue Jun 28, 2017
Description field error check of user compsets.
This PR introduces version 3.0 schema for entry_id files bringing the description field to the top of the file and refining it to better define valid compsetname values. Change definition of primary component as suggested in PR 1692.

Test suite: scripts_regression_tests.py, several F case user compsets with modified cam config_component.xml
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes #1522
Fixes #1683

User interface changes?:

Update gh-pages html (Y/N)?:

Code review: @gold2718 @billsacks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants