Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add table schema validator #125

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Add table schema validator #125

wants to merge 20 commits into from

Conversation

dalonsoa
Copy link
Collaborator

@dalonsoa dalonsoa commented Oct 23, 2024

Adds a table schema validator for the header. This validator checks that, if there is a schema keyword in the header, it follows the standard defined in the Table Schema specification (mostly). This is done via pydantic models, some of them nested, and some of them discriminated based on the field type_, like it is done in PyECN.

This PRs only implements the validation of the schema if present in the header of the file. It does not enforce it or does anything with it. That is a job for future PRs, as it will need to be adapted to each reader/writer.

As the validation.py module was becoming pretty big, I have split it into several submodules within validation package, and the same with the tests. So the changes look more massive than they actually are. Just focus on the following two files:

  • csvy/validators/table_schema.py
  • tests/validators/tests_table_schema.py

This PR incorporate the changes made in #169, dropping support for python 3.9 and adding support for 3.13. Ignore those changes or review that specific, already merged, PR.

Needless to say, there's no urgency on reviewing this.

Closes #3

Base automatically changed from registry to develop October 24, 2024 09:11
@dalonsoa dalonsoa changed the base branch from develop to main December 17, 2024 12:58
@dalonsoa dalonsoa marked this pull request as ready for review December 19, 2024 14:16
@dalonsoa dalonsoa changed the title WIP: Table validator Add table schema validator Dec 19, 2024
Copy link
Contributor

@AdrianDAlessandro AdrianDAlessandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super clear on how this will be used in practise, or how well it maps onto the Table Schema you are following. But I've given a general review of the Pydantic models. Overall it looks really sensible, I just have just raise a few nit-picky details that are mostly around maintainability and docs.

Comment on lines +148 to +150
constraints: ConstraintsValidator = Field(
ConstraintsValidator(), description="A dictionary of constraints for the field."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but just be aware that setting the default like this means that a ConstraintsValidator with no fields is built and all the pydantic validation is run on it when this module is first imported. If a null value for constraints must be a ConstraintsValidator then keep it like this, otherwise, it's probably cleaner to just set it to None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason, I think, for constraints to be a ConstraintsValidator. It can be None if no constraints are provided, but if they are, they must be validated by the ConstraintsValidator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I think this is cleaner because it has fewer side-effects

Suggested change
constraints: ConstraintsValidator = Field(
ConstraintsValidator(), description="A dictionary of constraints for the field."
)
constraints: ConstraintsValidator | None = Field(
None, description="A dictionary of constraints for the field."
)

Comment on lines +165 to +166
kwargs["exclude_unset"] = True
kwargs["by_alias"] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overwriting kwargs like this feels wrong. It means someone who is familiar with Pydantic might want to try validator.model_dump(exclude_unset=False) and not get the result they're expecting. I think it's better to either rename this method so the regular model_dump is still available, or include something like

Suggested change
kwargs["exclude_unset"] = True
kwargs["by_alias"] = True
if "exclude_unset" not in kwargs:
kwargs["exclude_unset"] = True
if "by_alias" not in kwargs:
kwargs["by_alias"] = True

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. What about the following, instead, which feels more concise:

Suggested change
kwargs["exclude_unset"] = True
kwargs["by_alias"] = True
kwargs["exclude_unset"] = kwargs.get("exclude_unset", True)
kwargs["by_alias"] = kwargs.get("by_alias", True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sensible!

Comment on lines +447 to +448
kwargs["exclude_unset"] = True
kwargs["by_alias"] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines +248 to +252
decimalChar: str | None = Field(
None,
description="The character used to separate the integer and fractional. "
+ "If None, '.' is used.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description claims that "." is the default, but I do not see where that is set. It looks like the default is set to None to me.

This same pattern/comment combination is repeated multiple times in other validators. So should be checked.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know. It is tricky. The point is that if this value is not provided, the default should be used - that's why is the default! - but if I dump the model again to save in the CSV file, I do not want all of those default values to be there because it will make the schema huge. And they are the defaults of the Table Schema specification, so anyone implementing the schema should know them.

In summary, what I want is that if I read the csvy file and then save it again without changes, the resulting schema declared in the header is the same, without extra information.

Any suggestion on how to tackle this better is most welcomed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess that means the default should be used in the validation but not saved to the object that is printed as the schema. I'm not sure how the schema is being used later on (i.e for printing) though, so I can't really say much more than that.

csvy/validators/table_schema.py Show resolved Hide resolved
csvy/validators/table_schema.py Show resolved Hide resolved
"""Validator for the Table Schema in the CSVY file.

This class is used to validate the Table Schema in the CSVY file. It is based on the
schema defined in the Table Schema specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should link to the Table Schema website somewhere, and it seems like this might be the best place for it.

Suggested change
schema defined in the Table Schema specification.
schema defined in the [Table Schema specification](https://specs.frictionlessdata.io/table-schema/#language).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should. Having said that, we do not have documentation yet (see #170, in case you have free time 😆 ), so we cannot test that.

In any case, I've added it, because it is true it should be mentioned somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄

I'm 99% sure they do. But I think including the link this way for now is the best and then can be checked later once docs are built

Copy link
Contributor

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I agree with what @AdrianDAlessandro's said and have a few small comments of my own.

There is also the frictionless framework which can deal with table schemas. It seems like it has a v broad scope, but I'm wondering if they provide a package for validating against table schemas which we could reuse here instead? The downside is that it may involve adding a v big dependency to the project.

@@ -127,7 +41,7 @@ class CSVDialectValidator(BaseModel):

delimiter: str = Field(default=",")
doublequote: bool = Field(default=True)
escapechar: Optional[str] = Field(default=None)
escapechar: str | None = Field(default=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be clearer to explicitly put the default value here, e.g.:

Suggested change
escapechar: str | None = Field(default=None)
escapechar: str = Field(default="\\")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the default accoriding to the specification is not set it https://specs.frictionlessdata.io/csv-dialect/#specification

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now that I check the specification, I'm missing several fields... 😢


def register_validator(
name: str, overwrite: bool = False
) -> Callable[[type[BaseModel]], type[BaseModel]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the decorator package for these situations. It also fixes up the type hints for decorated functions, which can otherwise be an issue.

I appreciate you may not want to add another dependency though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no problem with adding new dependencies, but I'm not convinced it adds much value in this case for just one, very simple decorator that just registers something and spits the same input.


"""

type_: Literal[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the underscore is just because you can't have a field named type in Python? If so, I suppose you could also call it kind instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but I wanted to keep the names as close as possible to the names used in the field descriptors definition: https://specs.frictionlessdata.io/table-schema/#field-descriptors


"""

type_: Literal[TypeEnum.STRING] = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto me too 😆

groupChar: str | None = Field(
None, description="The character used to separate groups of thousands."
)
bareNumber: bool | None = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the camelCase here? Are these the names used in the spec?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Probably I should make these aliases and use standard snake case here, to be honest, but I just thought it would be best to keep the name of the descriptors as close as possible to the specification to avoid confusion. Not sure. What do you think?

bare_number = Field(None, alias="bareNumber", description="...")

"""Validator for the Table Schema in the CSVY file.

This class is used to validate the Table Schema in the CSVY file. It is based on the
schema defined in the Table Schema specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄

Comment on lines +46 to +52
dumped = validator.model_dump()
assert dumped["name"] == "test_column"
assert dumped["title"] == "Test Column"
assert dumped["example"] == "example_value"
assert dumped["description"] == "This is a test column."
assert dumped["constraints"]["required"] is True
assert dumped["constraints"]["unique"] is True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way you've written these tests is v clear, but just to note that if dump_model is changed to add extra values, these tests won't catch it. Instead you could write:

assert dumped == {"name": "test_column", # etc.

Same for the others.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by "add extra values". In any case, what would be the problem with those extra values? As long as the ones that must be there, are there, all is good, right?

@dalonsoa
Copy link
Collaborator Author

Many thanks both for these thorough reviews! I'll address/answer your comments as soon as possible.


def register_validator(
name: str, overwrite: bool = False
) -> Callable[[type[BaseModel]], type[BaseModel]]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no problem with adding new dependencies, but I'm not convinced it adds much value in this case for just one, very simple decorator that just registers something and spits the same input.

Comment on lines +148 to +150
constraints: ConstraintsValidator = Field(
ConstraintsValidator(), description="A dictionary of constraints for the field."
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason, I think, for constraints to be a ConstraintsValidator. It can be None if no constraints are provided, but if they are, they must be validated by the ConstraintsValidator.

Comment on lines +165 to +166
kwargs["exclude_unset"] = True
kwargs["by_alias"] = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. What about the following, instead, which feels more concise:

Suggested change
kwargs["exclude_unset"] = True
kwargs["by_alias"] = True
kwargs["exclude_unset"] = kwargs.get("exclude_unset", True)
kwargs["by_alias"] = kwargs.get("by_alias", True)


"""

type_: Literal[
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but I wanted to keep the names as close as possible to the names used in the field descriptors definition: https://specs.frictionlessdata.io/table-schema/#field-descriptors

csvy/validators/table_schema.py Show resolved Hide resolved
csvy/validators/table_schema.py Show resolved Hide resolved
"""Validator for the Table Schema in the CSVY file.

This class is used to validate the Table Schema in the CSVY file. It is based on the
schema defined in the Table Schema specification.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should. Having said that, we do not have documentation yet (see #170, in case you have free time 😆 ), so we cannot test that.

In any case, I've added it, because it is true it should be mentioned somewhere.

Comment on lines +447 to +448
kwargs["exclude_unset"] = True
kwargs["by_alias"] = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kwargs["exclude_unset"] = True
kwargs["by_alias"] = True
kwargs["exclude_unset"] = kwargs.get("exclude_unset", True)
kwargs["by_alias"] = kwargs.get("by_alias", True)

Comment on lines +46 to +52
dumped = validator.model_dump()
assert dumped["name"] == "test_column"
assert dumped["title"] == "Test Column"
assert dumped["example"] == "example_value"
assert dumped["description"] == "This is a test column."
assert dumped["constraints"]["required"] is True
assert dumped["constraints"]["unique"] is True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by "add extra values". In any case, what would be the problem with those extra values? As long as the ones that must be there, are there, all is good, right?

Comment on lines +248 to +252
decimalChar: str | None = Field(
None,
description="The character used to separate the integer and fractional. "
+ "If None, '.' is used.",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know. It is tricky. The point is that if this value is not provided, the default should be used - that's why is the default! - but if I dump the model again to save in the CSV file, I do not want all of those default values to be there because it will make the schema huge. And they are the defaults of the Table Schema specification, so anyone implementing the schema should know them.

In summary, what I want is that if I read the csvy file and then save it again without changes, the resulting schema declared in the header is the same, without extra information.

Any suggestion on how to tackle this better is most welcomed!

@dalonsoa
Copy link
Collaborator Author

@AdrianDAlessandro , @alexdewar I've answered all your questions, I think, and asked for your opinion in a couple of places. Please, have a look (no rush) and let me know what you think.

@AdrianDAlessandro , unless I'm missing something, this should follow the specification exactly.

@alexdewar , that's a good point. To be honest, I had forgotten about frictionless (despite using the specifications described in the frictionless data site!!). I think that there's an opportunity there for validating the data - in response to @AdrianDAlessandro question - but it is definitely a follow up. Indeed, PyCSVY and frictionless might work very well together: we ship the description with the data and use frictionless to validate it, rather than having to guess it and then manually tweak it, as done in this example.

Now, this brings another point: possibly, I could have pulled the specifications (both the CSV Dialect one and the Table Schema) from frictionless, make pydantic models out of the JSON schemas "automatically" and then use those, rather than coding my own... I'll need to investigate, but it might be a work for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Tabular Schema to declare tabular data
3 participants