-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table schema validator #125
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 739792d.
b4d3fad
to
aff62f2
Compare
Drop support for python 3.9 and add support for python 3.13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super clear on how this will be used in practise, or how well it maps onto the Table Schema you are following. But I've given a general review of the Pydantic models. Overall it looks really sensible, I just have just raise a few nit-picky details that are mostly around maintainability and docs.
constraints: ConstraintsValidator = Field( | ||
ConstraintsValidator(), description="A dictionary of constraints for the field." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, but just be aware that setting the default like this means that a ConstraintsValidator
with no fields is built and all the pydantic validation is run on it when this module is first imported. If a null value for constraints
must be a ConstraintsValidator
then keep it like this, otherwise, it's probably cleaner to just set it to None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason, I think, for constraints
to be a ConstraintsValidator
. It can be None
if no constraints are provided, but if they are, they must be validated by the ConstraintsValidator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I think this is cleaner because it has fewer side-effects
constraints: ConstraintsValidator = Field( | |
ConstraintsValidator(), description="A dictionary of constraints for the field." | |
) | |
constraints: ConstraintsValidator | None = Field( | |
None, description="A dictionary of constraints for the field." | |
) |
kwargs["exclude_unset"] = True | ||
kwargs["by_alias"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overwriting kwargs
like this feels wrong. It means someone who is familiar with Pydantic might want to try validator.model_dump(exclude_unset=False)
and not get the result they're expecting. I think it's better to either rename this method so the regular model_dump
is still available, or include something like
kwargs["exclude_unset"] = True | |
kwargs["by_alias"] = True | |
if "exclude_unset" not in kwargs: | |
kwargs["exclude_unset"] = True | |
if "by_alias" not in kwargs: | |
kwargs["by_alias"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. What about the following, instead, which feels more concise:
kwargs["exclude_unset"] = True | |
kwargs["by_alias"] = True | |
kwargs["exclude_unset"] = kwargs.get("exclude_unset", True) | |
kwargs["by_alias"] = kwargs.get("by_alias", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems sensible!
kwargs["exclude_unset"] = True | ||
kwargs["by_alias"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
decimalChar: str | None = Field( | ||
None, | ||
description="The character used to separate the integer and fractional. " | ||
+ "If None, '.' is used.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description claims that "." is the default, but I do not see where that is set. It looks like the default is set to None
to me.
This same pattern/comment combination is repeated multiple times in other validators. So should be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know. It is tricky. The point is that if this value is not provided, the default should be used - that's why is the default! - but if I dump the model again to save in the CSV file, I do not want all of those default values to be there because it will make the schema huge. And they are the defaults of the Table Schema specification, so anyone implementing the schema should know them.
In summary, what I want is that if I read the csvy file and then save it again without changes, the resulting schema declared in the header is the same, without extra information.
Any suggestion on how to tackle this better is most welcomed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I guess that means the default should be used in the validation but not saved to the object that is printed as the schema. I'm not sure how the schema is being used later on (i.e for printing) though, so I can't really say much more than that.
"""Validator for the Table Schema in the CSVY file. | ||
|
||
This class is used to validate the Table Schema in the CSVY file. It is based on the | ||
schema defined in the Table Schema specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should link to the Table Schema website somewhere, and it seems like this might be the best place for it.
schema defined in the Table Schema specification. | |
schema defined in the [Table Schema specification](https://specs.frictionlessdata.io/table-schema/#language). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they should. Having said that, we do not have documentation yet (see #170, in case you have free time 😆 ), so we cannot test that.
In any case, I've added it, because it is true it should be mentioned somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄
I'm 99% sure they do. But I think including the link this way for now is the best and then can be checked later once docs are built
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I agree with what @AdrianDAlessandro's said and have a few small comments of my own.
There is also the frictionless framework which can deal with table schemas. It seems like it has a v broad scope, but I'm wondering if they provide a package for validating against table schemas which we could reuse here instead? The downside is that it may involve adding a v big dependency to the project.
@@ -127,7 +41,7 @@ class CSVDialectValidator(BaseModel): | |||
|
|||
delimiter: str = Field(default=",") | |||
doublequote: bool = Field(default=True) | |||
escapechar: Optional[str] = Field(default=None) | |||
escapechar: str | None = Field(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be clearer to explicitly put the default value here, e.g.:
escapechar: str | None = Field(default=None) | |
escapechar: str = Field(default="\\") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the default accoriding to the specification is not set it https://specs.frictionlessdata.io/csv-dialect/#specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And now that I check the specification, I'm missing several fields... 😢
|
||
def register_validator( | ||
name: str, overwrite: bool = False | ||
) -> Callable[[type[BaseModel]], type[BaseModel]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the decorator
package for these situations. It also fixes up the type hints for decorated functions, which can otherwise be an issue.
I appreciate you may not want to add another dependency though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've no problem with adding new dependencies, but I'm not convinced it adds much value in this case for just one, very simple decorator that just registers something and spits the same input.
|
||
""" | ||
|
||
type_: Literal[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the underscore is just because you can't have a field named type
in Python? If so, I suppose you could also call it kind
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but I wanted to keep the names as close as possible to the names used in the field descriptors definition: https://specs.frictionlessdata.io/table-schema/#field-descriptors
|
||
""" | ||
|
||
type_: Literal[TypeEnum.STRING] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto me too 😆
groupChar: str | None = Field( | ||
None, description="The character used to separate groups of thousands." | ||
) | ||
bareNumber: bool | None = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the camelCase here? Are these the names used in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Probably I should make these aliases and use standard snake case here, to be honest, but I just thought it would be best to keep the name of the descriptors as close as possible to the specification to avoid confusion. Not sure. What do you think?
bare_number = Field(None, alias="bareNumber", description="...")
"""Validator for the Table Schema in the CSVY file. | ||
|
||
This class is used to validate the Table Schema in the CSVY file. It is based on the | ||
schema defined in the Table Schema specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄
dumped = validator.model_dump() | ||
assert dumped["name"] == "test_column" | ||
assert dumped["title"] == "Test Column" | ||
assert dumped["example"] == "example_value" | ||
assert dumped["description"] == "This is a test column." | ||
assert dumped["constraints"]["required"] is True | ||
assert dumped["constraints"]["unique"] is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way you've written these tests is v clear, but just to note that if dump_model
is changed to add extra values, these tests won't catch it. Instead you could write:
assert dumped == {"name": "test_column", # etc.
Same for the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by "add extra values". In any case, what would be the problem with those extra values? As long as the ones that must be there, are there, all is good, right?
Many thanks both for these thorough reviews! I'll address/answer your comments as soon as possible. |
|
||
def register_validator( | ||
name: str, overwrite: bool = False | ||
) -> Callable[[type[BaseModel]], type[BaseModel]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've no problem with adding new dependencies, but I'm not convinced it adds much value in this case for just one, very simple decorator that just registers something and spits the same input.
constraints: ConstraintsValidator = Field( | ||
ConstraintsValidator(), description="A dictionary of constraints for the field." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason, I think, for constraints
to be a ConstraintsValidator
. It can be None
if no constraints are provided, but if they are, they must be validated by the ConstraintsValidator
.
kwargs["exclude_unset"] = True | ||
kwargs["by_alias"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. What about the following, instead, which feels more concise:
kwargs["exclude_unset"] = True | |
kwargs["by_alias"] = True | |
kwargs["exclude_unset"] = kwargs.get("exclude_unset", True) | |
kwargs["by_alias"] = kwargs.get("by_alias", True) |
|
||
""" | ||
|
||
type_: Literal[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but I wanted to keep the names as close as possible to the names used in the field descriptors definition: https://specs.frictionlessdata.io/table-schema/#field-descriptors
"""Validator for the Table Schema in the CSVY file. | ||
|
||
This class is used to validate the Table Schema in the CSVY file. It is based on the | ||
schema defined in the Table Schema specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they should. Having said that, we do not have documentation yet (see #170, in case you have free time 😆 ), so we cannot test that.
In any case, I've added it, because it is true it should be mentioned somewhere.
kwargs["exclude_unset"] = True | ||
kwargs["by_alias"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs["exclude_unset"] = True | |
kwargs["by_alias"] = True | |
kwargs["exclude_unset"] = kwargs.get("exclude_unset", True) | |
kwargs["by_alias"] = kwargs.get("by_alias", True) |
dumped = validator.model_dump() | ||
assert dumped["name"] == "test_column" | ||
assert dumped["title"] == "Test Column" | ||
assert dumped["example"] == "example_value" | ||
assert dumped["description"] == "This is a test column." | ||
assert dumped["constraints"]["required"] is True | ||
assert dumped["constraints"]["unique"] is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by "add extra values". In any case, what would be the problem with those extra values? As long as the ones that must be there, are there, all is good, right?
decimalChar: str | None = Field( | ||
None, | ||
description="The character used to separate the integer and fractional. " | ||
+ "If None, '.' is used.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know. It is tricky. The point is that if this value is not provided, the default should be used - that's why is the default! - but if I dump the model again to save in the CSV file, I do not want all of those default values to be there because it will make the schema huge. And they are the defaults of the Table Schema specification, so anyone implementing the schema should know them.
In summary, what I want is that if I read the csvy file and then save it again without changes, the resulting schema declared in the header is the same, without extra information.
Any suggestion on how to tackle this better is most welcomed!
@AdrianDAlessandro , @alexdewar I've answered all your questions, I think, and asked for your opinion in a couple of places. Please, have a look (no rush) and let me know what you think. @AdrianDAlessandro , unless I'm missing something, this should follow the specification exactly. @alexdewar , that's a good point. To be honest, I had forgotten about Now, this brings another point: possibly, I could have pulled the specifications (both the CSV Dialect one and the Table Schema) from |
Adds a table schema validator for the header. This validator checks that, if there is a
schema
keyword in the header, it follows the standard defined in the Table Schema specification (mostly). This is done via pydantic models, some of them nested, and some of them discriminated based on the fieldtype_
, like it is done inPyECN
.This PRs only implements the validation of the schema if present in the header of the file. It does not enforce it or does anything with it. That is a job for future PRs, as it will need to be adapted to each reader/writer.
As the
validation.py
module was becoming pretty big, I have split it into several submodules withinvalidation
package, and the same with the tests. So the changes look more massive than they actually are. Just focus on the following two files:csvy/validators/table_schema.py
tests/validators/tests_table_schema.py
This PR incorporate the changes made in #169, dropping support for python 3.9 and adding support for 3.13. Ignore those changes or review that specific, already merged, PR.
Needless to say, there's no urgency on reviewing this.
Closes #3