Add table schema validator #125

dalonsoa · 2024-10-23T16:06:33Z

Adds a table schema validator for the header. This validator checks that, if there is a schema keyword in the header, it follows the standard defined in the Table Schema specification (mostly). This is done via pydantic models, some of them nested, and some of them discriminated based on the field type_, like it is done in PyECN.

This PRs only implements the validation of the schema if present in the header of the file. It does not enforce it or does anything with it. That is a job for future PRs, as it will need to be adapted to each reader/writer.

As the validation.py module was becoming pretty big, I have split it into several submodules within validation package, and the same with the tests. So the changes look more massive than they actually are. Just focus on the following two files:

csvy/validators/table_schema.py
tests/validators/tests_table_schema.py

This PR incorporate the changes made in #169, dropping support for python 3.9 and adding support for 3.13. Ignore those changes or review that specific, already merged, PR.

Needless to say, there's no urgency on reviewing this.

Closes #3

This reverts commit 739792d.

Drop support for python 3.9 and add support for python 3.13

AdrianDAlessandro

I'm not super clear on how this will be used in practise, or how well it maps onto the Table Schema you are following. But I've given a general review of the Pydantic models. Overall it looks really sensible, I just have just raise a few nit-picky details that are mostly around maintainability and docs.

AdrianDAlessandro · 2025-01-09T14:01:20Z

csvy/validators/table_schema.py

+    constraints: ConstraintsValidator = Field(
+        ConstraintsValidator(), description="A dictionary of constraints for the field."
+    )


I think this is fine, but just be aware that setting the default like this means that a ConstraintsValidator with no fields is built and all the pydantic validation is run on it when this module is first imported. If a null value for constraints must be a ConstraintsValidator then keep it like this, otherwise, it's probably cleaner to just set it to None

There's no reason, I think, for constraints to be a ConstraintsValidator. It can be None if no constraints are provided, but if they are, they must be validated by the ConstraintsValidator.

In that case, I think this is cleaner because it has fewer side-effects

Suggested change

constraints: ConstraintsValidator = Field(

ConstraintsValidator(), description="A dictionary of constraints for the field."

)

constraints: ConstraintsValidator | None = Field(

None, description="A dictionary of constraints for the field."

)

AdrianDAlessandro · 2025-01-09T14:08:33Z

csvy/validators/table_schema.py

+        kwargs["exclude_unset"] = True
+        kwargs["by_alias"] = True


Overwriting kwargs like this feels wrong. It means someone who is familiar with Pydantic might want to try validator.model_dump(exclude_unset=False) and not get the result they're expecting. I think it's better to either rename this method so the regular model_dump is still available, or include something like

Suggested change

kwargs["exclude_unset"] = True

kwargs["by_alias"] = True

if "exclude_unset" not in kwargs:

kwargs["exclude_unset"] = True

if "by_alias" not in kwargs:

kwargs["by_alias"] = True

Good point. What about the following, instead, which feels more concise:

Suggested change

kwargs["exclude_unset"] = True

kwargs["by_alias"] = True

kwargs["exclude_unset"] = kwargs.get("exclude_unset", True)

kwargs["by_alias"] = kwargs.get("by_alias", True)

Seems sensible!

AdrianDAlessandro · 2025-01-09T14:09:54Z

csvy/validators/table_schema.py

+        kwargs["exclude_unset"] = True
+        kwargs["by_alias"] = True


AdrianDAlessandro · 2025-01-09T14:15:56Z

csvy/validators/table_schema.py

+    decimalChar: str | None = Field(
+        None,
+        description="The character used to separate the integer and fractional. "
+        + "If None, '.' is used.",
+    )


This description claims that "." is the default, but I do not see where that is set. It looks like the default is set to None to me.

This same pattern/comment combination is repeated multiple times in other validators. So should be checked.

Yeah, I know. It is tricky. The point is that if this value is not provided, the default should be used - that's why is the default! - but if I dump the model again to save in the CSV file, I do not want all of those default values to be there because it will make the schema huge. And they are the defaults of the Table Schema specification, so anyone implementing the schema should know them.

In summary, what I want is that if I read the csvy file and then save it again without changes, the resulting schema declared in the header is the same, without extra information.

Any suggestion on how to tackle this better is most welcomed!

Hmm, I guess that means the default should be used in the validation but not saved to the object that is printed as the schema. I'm not sure how the schema is being used later on (i.e for printing) though, so I can't really say much more than that.

csvy/validators/table_schema.py

AdrianDAlessandro · 2025-01-09T14:29:53Z

csvy/validators/table_schema.py

+    """Validator for the Table Schema in the CSVY file.
+
+    This class is used to validate the Table Schema in the CSVY file. It is based on the
+    schema defined in the Table Schema specification.


I think we should link to the Table Schema website somewhere, and it seems like this might be the best place for it.

Suggested change

schema defined in the Table Schema specification.

schema defined in the [Table Schema specification](https://specs.frictionlessdata.io/table-schema/#language).

Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄

I think they should. Having said that, we do not have documentation yet (see #170, in case you have free time 😆 ), so we cannot test that.

In any case, I've added it, because it is true it should be mentioned somewhere.

Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄

I'm 99% sure they do. But I think including the link this way for now is the best and then can be checked later once docs are built

alexdewar

LGTM. I agree with what @AdrianDAlessandro's said and have a few small comments of my own.

There is also the frictionless framework which can deal with table schemas. It seems like it has a v broad scope, but I'm wondering if they provide a package for validating against table schemas which we could reuse here instead? The downside is that it may involve adding a v big dependency to the project.

alexdewar · 2025-01-20T08:04:11Z

csvy/validators/csv_dialect.py

@@ -127,7 +41,7 @@ class CSVDialectValidator(BaseModel):

    delimiter: str = Field(default=",")
    doublequote: bool = Field(default=True)
-    escapechar: Optional[str] = Field(default=None)
+    escapechar: str | None = Field(default=None)


Might be clearer to explicitly put the default value here, e.g.:

Suggested change

escapechar: str | None = Field(default=None)

escapechar: str = Field(default="\\")

Actually, the default accoriding to the specification is not set it https://specs.frictionlessdata.io/csv-dialect/#specification

And now that I check the specification, I'm missing several fields... 😢

alexdewar · 2025-01-20T08:08:10Z

csvy/validators/registry.py

+
+def register_validator(
+    name: str, overwrite: bool = False
+) -> Callable[[type[BaseModel]], type[BaseModel]]:


I like the decorator package for these situations. It also fixes up the type hints for decorated functions, which can otherwise be an issue.

I appreciate you may not want to add another dependency though!

I've no problem with adding new dependencies, but I'm not convinced it adds much value in this case for just one, very simple decorator that just registers something and spits the same input.

alexdewar · 2025-01-20T08:14:28Z

csvy/validators/table_schema.py

+
+    """
+
+    type_: Literal[


I'm guessing the underscore is just because you can't have a field named type in Python? If so, I suppose you could also call it kind instead

I could, but I wanted to keep the names as close as possible to the names used in the field descriptors definition: https://specs.frictionlessdata.io/table-schema/#field-descriptors

alexdewar · 2025-01-20T08:15:12Z

csvy/validators/table_schema.py

+
+    """
+
+    type_: Literal[TypeEnum.STRING] = Field(


Ditto me too 😆

alexdewar · 2025-01-20T08:16:18Z

csvy/validators/table_schema.py

+    groupChar: str | None = Field(
+        None, description="The character used to separate groups of thousands."
+    )
+    bareNumber: bool | None = Field(


Why the camelCase here? Are these the names used in the spec?

Exactly. Probably I should make these aliases and use standard snake case here, to be honest, but I just thought it would be best to keep the name of the descriptors as close as possible to the specification to avoid confusion. Not sure. What do you think?

bare_number = Field(None, alias="bareNumber", description="...")

alexdewar · 2025-01-20T08:17:59Z

csvy/validators/table_schema.py

+    """Validator for the Table Schema in the CSVY file.
+
+    This class is used to validate the Table Schema in the CSVY file. It is based on the
+    schema defined in the Table Schema specification.


Do markdown-style links like this work in mkdocs @AdrianDAlessandro? If so, I'm in favour 😄

alexdewar · 2025-01-20T08:22:06Z

tests/validators/test_table_schema.py

+    dumped = validator.model_dump()
+    assert dumped["name"] == "test_column"
+    assert dumped["title"] == "Test Column"
+    assert dumped["example"] == "example_value"
+    assert dumped["description"] == "This is a test column."
+    assert dumped["constraints"]["required"] is True
+    assert dumped["constraints"]["unique"] is True


I think the way you've written these tests is v clear, but just to note that if dump_model is changed to add extra values, these tests won't catch it. Instead you could write:

assert dumped == {"name": "test_column", # etc.

Same for the others.

I'm not sure what you mean by "add extra values". In any case, what would be the problem with those extra values? As long as the ones that must be there, are there, all is good, right?

dalonsoa · 2025-01-20T08:30:22Z

Many thanks both for these thorough reviews! I'll address/answer your comments as soon as possible.

dalonsoa · 2025-01-24T06:15:13Z

csvy/validators/registry.py

+
+def register_validator(
+    name: str, overwrite: bool = False
+) -> Callable[[type[BaseModel]], type[BaseModel]]:


I've no problem with adding new dependencies, but I'm not convinced it adds much value in this case for just one, very simple decorator that just registers something and spits the same input.

dalonsoa · 2025-01-24T06:19:44Z

csvy/validators/table_schema.py

+    constraints: ConstraintsValidator = Field(
+        ConstraintsValidator(), description="A dictionary of constraints for the field."
+    )


There's no reason, I think, for constraints to be a ConstraintsValidator. It can be None if no constraints are provided, but if they are, they must be validated by the ConstraintsValidator.

dalonsoa · 2025-01-24T06:21:39Z

csvy/validators/table_schema.py

+        kwargs["exclude_unset"] = True
+        kwargs["by_alias"] = True


Good point. What about the following, instead, which feels more concise:

Suggested change

kwargs["exclude_unset"] = True

kwargs["by_alias"] = True

kwargs["exclude_unset"] = kwargs.get("exclude_unset", True)

kwargs["by_alias"] = kwargs.get("by_alias", True)

dalonsoa · 2025-01-24T06:23:15Z

csvy/validators/table_schema.py

+
+    """
+
+    type_: Literal[


I could, but I wanted to keep the names as close as possible to the names used in the field descriptors definition: https://specs.frictionlessdata.io/table-schema/#field-descriptors

csvy/validators/table_schema.py

dalonsoa · 2025-01-24T06:42:04Z

csvy/validators/table_schema.py

+    """Validator for the Table Schema in the CSVY file.
+
+    This class is used to validate the Table Schema in the CSVY file. It is based on the
+    schema defined in the Table Schema specification.


I think they should. Having said that, we do not have documentation yet (see #170, in case you have free time 😆 ), so we cannot test that.

In any case, I've added it, because it is true it should be mentioned somewhere.

dalonsoa · 2025-01-24T06:42:48Z

csvy/validators/table_schema.py

+        kwargs["exclude_unset"] = True
+        kwargs["by_alias"] = True


Suggested change

kwargs["exclude_unset"] = True

kwargs["by_alias"] = True

kwargs["exclude_unset"] = kwargs.get("exclude_unset", True)

kwargs["by_alias"] = kwargs.get("by_alias", True)

dalonsoa · 2025-01-24T06:47:07Z

tests/validators/test_table_schema.py

+    dumped = validator.model_dump()
+    assert dumped["name"] == "test_column"
+    assert dumped["title"] == "Test Column"
+    assert dumped["example"] == "example_value"
+    assert dumped["description"] == "This is a test column."
+    assert dumped["constraints"]["required"] is True
+    assert dumped["constraints"]["unique"] is True


I'm not sure what you mean by "add extra values". In any case, what would be the problem with those extra values? As long as the ones that must be there, are there, all is good, right?

dalonsoa · 2025-01-24T06:52:37Z

csvy/validators/table_schema.py

+    decimalChar: str | None = Field(
+        None,
+        description="The character used to separate the integer and fractional. "
+        + "If None, '.' is used.",
+    )


Yeah, I know. It is tricky. The point is that if this value is not provided, the default should be used - that's why is the default! - but if I dump the model again to save in the CSV file, I do not want all of those default values to be there because it will make the schema huge. And they are the defaults of the Table Schema specification, so anyone implementing the schema should know them.

In summary, what I want is that if I read the csvy file and then save it again without changes, the resulting schema declared in the header is the same, without extra information.

Any suggestion on how to tackle this better is most welcomed!

dalonsoa · 2025-01-24T07:10:24Z

@AdrianDAlessandro , @alexdewar I've answered all your questions, I think, and asked for your opinion in a couple of places. Please, have a look (no rush) and let me know what you think.

@AdrianDAlessandro , unless I'm missing something, this should follow the specification exactly.

@alexdewar , that's a good point. To be honest, I had forgotten about frictionless (despite using the specifications described in the frictionless data site!!). I think that there's an opportunity there for validating the data - in response to @AdrianDAlessandro question - but it is definitely a follow up. Indeed, PyCSVY and frictionless might work very well together: we ship the description with the data and use frictionless to validate it, rather than having to guess it and then manually tweak it, as done in this example.

Now, this brings another point: possibly, I could have pulled the specifications (both the CSV Dialect one and the Table Schema) from frictionless, make pydantic models out of the JSON schemas "automatically" and then use those, rather than coding my own... I'll need to investigate, but it might be a work for the future.

Base automatically changed from registry to develop October 24, 2024 09:11

dalonsoa changed the base branch from develop to main December 17, 2024 12:58

dalonsoa added 7 commits December 19, 2024 08:50

✨ Add column validator and associated classes.

73823fa

✨ Support python 3.9.

ad1c9ae

✨ Support python 3.9, again.

801fdef

Revert ":sparkles: Support python 3.9, again."

cdd61cd

This reverts commit 739792d.

➕ Add extra pydantic helper dpeendency.

7d93860

➕ Add extra dependency.

279586f

Fix messy rebasing.

aff62f2

dalonsoa force-pushed the table_validator branch from b4d3fad to aff62f2 Compare December 19, 2024 08:53

dalonsoa and others added 11 commits December 19, 2024 10:50

✨ Implement several type-specific validators.

d0bf9d2

✨ Add BooleanColumnValidator

e633ee1

✨ Add DateTimeColumnValidator.

44488aa

✨ Add the Geopoint and GeoJSON column validators.

8113d43

✨ Add SchemaValidator.

4fae509

♻️ Split code in validators.py.

051cd3e

🚨 Keep mypy happy.

2c8f968

Add pydantic plugin for mypy.

1eb2069

✅ Add tests for the table validator classes.

6fb61e1

♻️ Move file.

3a69368

Merge branch 'main' into table_validator

2f845c8

dalonsoa marked this pull request as ready for review December 19, 2024 14:16

dalonsoa changed the title ~~WIP: Table validator~~ Add table schema validator Dec 19, 2024

dalonsoa and others added 2 commits December 19, 2024 14:34

⬆️ Drop support for Python 3.9 and add for Python 3.13

51272b7

Merge pull request #169 from ImperialCollegeLondon/drop-py-39

cb148bf

Drop support for python 3.9 and add support for python 3.13

dalonsoa requested review from alexdewar, dc2917 and AdrianDAlessandro December 19, 2024 14:45

AdrianDAlessandro reviewed Jan 9, 2025

View reviewed changes

alexdewar approved these changes Jan 20, 2025

View reviewed changes

dalonsoa mentioned this pull request Jan 24, 2025

Reference Enums rather than describing the specific values. #191

Open

dalonsoa commented Jan 24, 2025

View reviewed changes

	schema defined in the Table Schema specification.
	schema defined in the [Table Schema specification](https://specs.frictionlessdata.io/table-schema/#language).

	escapechar: str \| None = Field(default=None)
	escapechar: str = Field(default="\\")

Add table schema validator #125

Are you sure you want to change the base?

Add table schema validator #125

Conversation

dalonsoa commented Oct 23, 2024 • edited Loading

AdrianDAlessandro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexdewar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalonsoa commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dalonsoa commented Jan 24, 2025

dalonsoa commented Oct 23, 2024 •

edited

Loading