Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for illegal characters in codelists #418

Merged
merged 10 commits into from
Dec 16, 2024

Conversation

dc-almeida
Copy link
Collaborator

@dc-almeida dc-almeida commented Oct 16, 2024

Closes #401, updates #411

The way it's implemented, it does not check external repos for illegal characters, thus being specific to the repo where the config is defined.
This was tested previously with common-definitions, as some variables contain possibly problematic characters (see here).

By checking if each code has a populated repository attribute, we identify and skip the external repo codes from the code lists.
I think the Pydantic validation of the illegal_characters field is straightforward, but can add a test for it if deemed necessary.

Caveat is that the stray tag check was previously a Pydantic validation, and now is (must) be called explicitly. If a better suggestion comes up, happy to implement it!

@dc-almeida dc-almeida self-assigned this Oct 16, 2024
@dc-almeida dc-almeida added the enhancement New feature or request label Oct 16, 2024
@dc-almeida dc-almeida marked this pull request as ready for review October 16, 2024 17:20
Copy link
Contributor

@phackstock phackstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @dc-almeida, couple of comments below.
Some more general questions from my side:

  1. Should we extend the illegal characters to all attributes or just the name of a code? The reason I'm asking is that I think for the definition of a variable, it might be useful to use punctuation such as :, ; or even quotes.
  2. Should the illegal characters be project specific, i.e. be part of the nomenclature config? In my opinion, the point of nomenclature is to establish standards for variable names, etc... so having project specific illegal characters defeats that point.
  3. I see the point of limiting the illegal character check for only the current repo. On the other hand, it would still affect the codelists so I'd vote for being more strict and search everything that's used in a project, including external repositories.

nomenclature/codelist.py Outdated Show resolved Hide resolved
nomenclature/codelist.py Outdated Show resolved Hide resolved
nomenclature/codelist.py Show resolved Hide resolved
nomenclature/codelist.py Outdated Show resolved Hide resolved
nomenclature/codelist.py Outdated Show resolved Hide resolved
nomenclature/config.py Outdated Show resolved Hide resolved
@danielhuppmann
Copy link
Member

FYI, the idea for this PR came from @orichters via IAMconsortium/common-definitions#138. And the current implementation with making the illegal-characters project/repo-specific was to avoid having to go through a dozen legacy projects and manually cleaning up stuff...

@phackstock
Copy link
Contributor

@danielhuppmann, ah thanks for the clarification. This would mean though that we'd have to copy-paste these from now on standard illegal characters into every nomenclature.yaml, which is also not ideal.

@phackstock
Copy link
Contributor

We could flip it around, fix the illegal characters and introduce a ignore illegal characters = False/True (False being the default) switch. We would set this to True for all the legacy repos. This way going forward, we'd have a consistent style and don't have to keep on copy-pasting the same illegal characters to every repository.

@danielhuppmann
Copy link
Member

We could flip it around, fix the illegal characters and introduce a ignore illegal characters = False/True (False being the default) switch.

I like that idea but I’d again flip it around so that we can also set a specific list of characters (maybe we want to make $ illegal later).

So an argument “check-illegal-characters” that is True by default, can be set to False, or take a list of characters.

@phackstock
Copy link
Contributor

phackstock commented Oct 17, 2024

So an argument “check-illegal-characters” that is True by default, can be set to False, or take a list of characters.

I'd split that into two attributes check-illegal-characters: bool and illegal-characters: list[str] = [].
Variables that can take on many different data types and functions usually make for ugly code that's hard to implement and hard to read.

@danielhuppmann
Copy link
Member

I'd split that into two attributes check-illegal-characters: bool and illegal-characters: list[str] = [].
Variables that can take on many different data types and functions usually make for ugly code that's hard to implement and hard to read.

Fair enough. Sounds ok, @dc-almeida?

@phackstock
Copy link
Contributor

In addition, I would set the default illegal-characters: list[str] = ["'", ":", ";"] so that we don't need to explicitly copy over the illegal characters to every new config file.

@phackstock phackstock self-requested a review November 14, 2024 11:31
Copy link
Contributor

@phackstock phackstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two smaller changes, using ErrorCollector and using model_dump to check the Code attributes, then good to merge from my side.

Copy link
Contributor

@phackstock phackstock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick changes @dc-almeida. Good to be merged from my side now.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks @dc-almeida! Just please add empty lines at the end of the new yaml files for the tests.

Before merging, can you please go through all workflow-repos in production and add the line illegal_characters: [ ] If necessary.

@dc-almeida dc-almeida merged commit 4e059f8 into IAMconsortium:main Dec 16, 2024
11 checks passed
@dc-almeida dc-almeida deleted the illegal-characters branch December 17, 2024 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check for forbidden characters in attributes
3 participants