Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 1633 unexpected key error with primary key in schema and 1635 unexpected missing label error when ignoring header case #1641

Merged
merged 62 commits into from
Aug 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
a8b1605
TDD: add test case for ignore header case in validating resource with…
amelie-rondot Feb 5, 2024
5423982
TDD: solve new test case
amelie-rondot Feb 5, 2024
36d3f77
Refacto: refactoring test_schema
amelie-rondot Feb 5, 2024
45bce3b
Linting: format files
amelie-rondot Feb 5, 2024
dc5d254
TDD: add test case for missing label in tabular data and ignoring cas…
amelie-rondot Feb 6, 2024
7a12203
TDD: fix missing required field with 'header_case=False' dialect and …
amelie-rondot Feb 6, 2024
7c0514f
Linting: format files and sort imports
amelie-rondot Feb 6, 2024
51bfb76
Refacto: refactoring test
amelie-rondot Feb 6, 2024
6bbc2c7
Refacto: refactoring test
amelie-rondot Feb 9, 2024
b833a0c
Refacto: refactoring removing missing required label from field info …
amelie-rondot Feb 9, 2024
3550cc3
Refacto: WIP
amelie-rondot Feb 9, 2024
e4839f1
Refacto: WIP2
amelie-rondot Feb 9, 2024
2d5a575
Refacto: refactoring creating schema fields among labels when using '…
amelie-rondot Feb 10, 2024
47072f4
Doc: docstring
amelie-rondot Feb 10, 2024
424ad23
Refacto: remve useless lines
amelie-rondot Feb 10, 2024
b08aaf2
TDD: add test case for missing label in TableResource corresponding t…
amelie-rondot Feb 8, 2024
3f3b08d
TDD: fix unexpected key-error: test passes
amelie-rondot Feb 8, 2024
03f65d0
TDD: add test case for missing label in TableRecource columns corresp…
amelie-rondot Feb 8, 2024
8870f26
TDD: append PrimaryKeyError in case of missing label corresponding to…
amelie-rondot Feb 8, 2024
1736f54
Refacto: refactoring tests
amelie-rondot Feb 8, 2024
b515783
TDD: add test case for insensitive case on label corresponding to pri…
amelie-rondot Feb 8, 2024
9eb9970
TDD: fix case insentive with label corresponding to primary key: test…
amelie-rondot Feb 8, 2024
6542f0b
TDD: edit test_cases to expected beahoviour: expect only missing-labe…
amelie-rondot Feb 8, 2024
23a5125
TDD: fix expected behaviour: test passes
amelie-rondot Feb 8, 2024
d72bab0
TDD: restore one tet case: test fails
amelie-rondot Feb 8, 2024
01f7fcc
WIP: isolate test case to fix
amelie-rondot Feb 10, 2024
a9f4a7c
Refacto: WIP
amelie-rondot Feb 10, 2024
8ca2c1c
Refacto: WIP2
amelie-rondot Feb 10, 2024
6773f9f
Refacto: introduce intermediary methods to create cells in case of pr…
amelie-rondot Feb 10, 2024
df32fda
TDD: fix test case comment ignore_case test case: test passes
amelie-rondot Feb 10, 2024
7abea12
Refacto: refactoring test
amelie-rondot Feb 10, 2024
fbad612
TDD: restore insensitive-case test case: test fails
amelie-rondot Feb 10, 2024
89fa688
TDD: fix ignore header case test case: test passes
amelie-rondot Feb 10, 2024
618725d
Refacto: refactoring tests
amelie-rondot Feb 12, 2024
b3d8bf8
Refacto: refactoring tests 2
amelie-rondot Feb 12, 2024
8de5691
Refacto: introduce intermediary 'Dialect' methods to add missing prim…
amelie-rondot Feb 10, 2024
1af0123
Refacto: refactoring 'Dialect' methods recently introduced
amelie-rondot Feb 12, 2024
e8fb44d
TDD: add new test cases: test passes
amelie-rondot Feb 12, 2024
6954753
Refacto: remove useless lines
amelie-rondot Feb 12, 2024
b03e1c4
Refacto: format file
amelie-rondot Feb 12, 2024
7b27133
Doc: docstring
amelie-rondot Feb 12, 2024
a8c7ed2
Refactoring: remove incorrect doc
amelie-rondot Feb 14, 2024
181f811
Refactoring: rename 'add_fields_to_schema_among_labels' method into '…
amelie-rondot Feb 14, 2024
0e35967
Refactoring: rename 'fields_mapped'and 'fields_map' agruments into 'f…
amelie-rondot Feb 14, 2024
d29b1e7
Refactoring: introduce 'index_name' variable to factorize 'rearrange_…
amelie-rondot Feb 14, 2024
feae99c
Refactoring: edit comment with appropriate vocabulary
amelie-rondot Feb 14, 2024
8d1fb9a
Refacto: remove useless import
amelie-rondot Feb 14, 2024
9a7d04a
Refacto: Fix signature 'primary_key_cells' method and remove useless …
amelie-rondot Feb 14, 2024
9c85a8d
Refacto: remove useless 'if match' condition
amelie-rondot Feb 14, 2024
47d5b8b
Refacto: edit comment
amelie-rondot Feb 14, 2024
9dc992d
Refacto: rename 'expected_fields_names' into 'expected_field_names' v…
amelie-rondot Feb 14, 2024
bd50358
Refacto: introduce new method field_is_required and remove 'primary_k…
amelie-rondot Feb 14, 2024
89b3f73
Refacto: format files
amelie-rondot Feb 14, 2024
2dc35ad
Refacto: fix pyright errors and remove 'type: ignore'
amelie-rondot Feb 14, 2024
cd7d7aa
Refacto: remove useless assesment lines
amelie-rondot Feb 23, 2024
7681f73
documentation
pierrecamilleri Apr 22, 2024
6cc454e
refactorings
pierrecamilleri Apr 22, 2024
0ef1395
format
pierrecamilleri Apr 22, 2024
585869e
fixup! documentation
pierrecamilleri Apr 23, 2024
37f1672
🔵 linting
pierrecamilleri Apr 29, 2024
b239182
🔵 renames
pierrecamilleri May 3, 2024
d464802
fix
pierrecamilleri May 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 76 additions & 12 deletions frictionless/detector/detector.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,7 @@ def detect_schema(
labels: Optional[List[str]] = None,
schema: Optional[Schema] = None,
field_candidates: List[Dict[str, Any]] = settings.DEFAULT_FIELD_CANDIDATES,
**options: Any,
) -> Schema:
"""Detect schema from fragment

Expand Down Expand Up @@ -405,20 +406,29 @@ def detect_schema(
# Sync schema
if self.schema_sync:
if labels:
case_sensitive = options["header_case"]

if not case_sensitive:
labels = [label.lower() for label in labels]

if len(labels) != len(set(labels)):
note = '"schema_sync" requires unique labels in the header'
raise FrictionlessException(note)
mapping = {field.name: field for field in schema.fields} # type: ignore
schema.clear_fields()
for name in labels:
field = mapping.get(name)
if not field:
field = Field.from_descriptor({"name": name, "type": "any"})
schema.add_field(field)
# For required fields that are missing
for _, field in mapping.items():
if field and field.required and field.name not in labels:
schema.add_field(field)

mapped_fields = self.mapped_schema_fields_names(
schema.fields, # type: ignore
case_sensitive,
)

self.rearrange_schema_fields_given_labels(
mapped_fields,
schema,
labels,
)

self.add_missing_required_labels_to_schema_fields(
mapped_fields, schema, labels, case_sensitive
)

# Patch schema
if self.schema_patch:
Expand All @@ -432,4 +442,58 @@ def detect_schema(
field_descriptor.update(field_patch)
schema = Schema.from_descriptor(descriptor)

return schema # type: ignore
return schema

@staticmethod
def mapped_schema_fields_names(
fields: List[Field], case_sensitive: bool
) -> Dict[str, Field]:
"""Create a dictionnary to map field names with schema fields"""
if case_sensitive:
return {field.name: field for field in fields}
else:
return {field.name.lower(): field for field in fields}

@staticmethod
def rearrange_schema_fields_given_labels(
fields_mapping: Dict[str, Field],
schema: Schema,
labels: List[str],
):
"""Rearrange fields according to the order of labels. All fields
missing from labels are dropped"""
schema.clear_fields()

for name in labels:
default_field = Field.from_descriptor({"name": name, "type": "any"})
field = fields_mapping.get(name, default_field)
schema.add_field(field)

def add_missing_required_labels_to_schema_fields(
self,
fields_mapping: Dict[str, Field],
schema: Schema,
labels: List[str],
case_sensitive: bool,
):
"""This method aims to add missing required labels and
primary key field not in labels to schema fields.
"""
for name, field in fields_mapping.items():
if (
self.field_is_required(field, schema, case_sensitive)
and name not in labels
):
schema.add_field(field)

@staticmethod
def field_is_required(
field: Field,
schema: Schema,
case_sensitive: bool,
) -> bool:
if case_sensitive:
return field.required or field.name in schema.primary_key
else:
lower_primary_key = [pk.lower() for pk in schema.primary_key]
return field.required or field.name.lower() in lower_primary_key
8 changes: 5 additions & 3 deletions frictionless/resource/__spec__/test_security.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,11 @@ def test_resource_source_path_error_bad_path_not_safe_traversing():
Resource(
{
"name": "name",
"path": "data/../data/table.csv"
if not platform.type == "windows"
else "data\\..\\table.csv",
"path": (
"data/../data/table.csv"
if not platform.type == "windows"
else "data\\..\\table.csv"
),
}
)
error = excinfo.value.error
Expand Down
94 changes: 77 additions & 17 deletions frictionless/resources/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
import warnings
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union

from frictionless.schema.field import Field

from .. import errors, helpers, settings
from ..analyzer import Analyzer
from ..dialect import Dialect
Expand Down Expand Up @@ -204,6 +206,7 @@ def __open_schema(self):
labels=self.labels,
schema=self.schema,
field_candidates=system.detect_field_candidates(),
header_case=self.dialect.header_case,
)
self.stats.fields = len(self.schema.fields)

Expand Down Expand Up @@ -331,15 +334,21 @@ def row_stream():

# Primary Key Error
if is_integrity and self.schema.primary_key:
cells = tuple(row[name] for name in self.schema.primary_key)
if set(cells) == {None}:
note = 'cells composing the primary keys are all "None"'
error = errors.PrimaryKeyError.from_row(row, note=note)
row.errors.append(error)
try:
cells = self.primary_key_cells(row, self.dialect.header_case)
except KeyError:
# Row does not have primary_key as label
# There should already be a missing-label error in
# in self.header corresponding to the schema primary key
assert not self.header.valid
else:
match = memory_primary.get(cells)
memory_primary[cells] = row.row_number
if match:
if set(cells) == {None}:
note = 'cells composing the primary keys are all "None"'
error = errors.PrimaryKeyError.from_row(row, note=note)
row.errors.append(error)
else:
match = memory_primary.get(cells)
memory_primary[cells] = row.row_number
if match:
note = "the same as in the row at position %s" % match
error = errors.PrimaryKeyError.from_row(row, note=note)
Expand Down Expand Up @@ -386,20 +395,71 @@ def row_stream():
# Yield row
yield row

# NB: missing required labels are not included in the
# field_info parameter used for row creation
if self.detector.schema_sync:
# Missing required labels are not included in the
# field_info parameter used for row creation
for field in self.schema.fields:
if field.name not in self.labels and field.name in field_info["names"]:
field_index = field_info["names"].index(field.name)
del field_info["names"][field_index]
del field_info["objects"][field_index]
del field_info["mapping"][field.name]
# # Create row stream
self.remove_missing_required_label_from_field_info(field, field_info)

# Create row stream
self.__row_stream = row_stream()

# Read
def remove_missing_required_label_from_field_info(
self, field: Field, field_info: Dict[str, Any]
):
is_case_sensitive = self.dialect.header_case
if self.label_is_missing(
field.name, field_info["names"], self.labels, is_case_sensitive
):
self.remove_field_from_field_info(field.name, field_info)

@staticmethod
def label_is_missing(
field_name: str,
expected_field_names: List[str],
table_labels: types.ILabels,
case_sensitive: bool,
) -> bool:
"""Check if a schema field name is missing from the TableResource
labels.
"""
if not case_sensitive:
field_name = field_name.lower()
table_labels = [label.lower() for label in table_labels]
expected_field_names = [
field_name.lower() for field_name in expected_field_names
]

return field_name not in table_labels and field_name in expected_field_names

@staticmethod
def remove_field_from_field_info(field_name: str, field_info: Dict[str, Any]):
field_index = field_info["names"].index(field_name)
del field_info["names"][field_index]
del field_info["objects"][field_index]
del field_info["mapping"][field_name]

def primary_key_cells(self, row: Row, case_sensitive: bool) -> Tuple[Any, ...]:
"""Create a tuple containg all cells from a given row associated to primary
keys"""
return tuple(row[label] for label in self.primary_key_labels(row, case_sensitive))

def primary_key_labels(
self,
row: Row,
case_sensitive: bool,
) -> List[str]:
"""Create a list of TableResource labels that are primary keys"""
if case_sensitive:
labels_primary_key = self.schema.primary_key
else:
lower_primary_key = [pk.lower() for pk in self.schema.primary_key]
labels_primary_key = [
label for label in row.field_names if label.lower() in lower_primary_key
]
return labels_primary_key

# Read
def read_cells(self, *, size: Optional[int] = None) -> List[List[Any]]:
"""Read lists into memory

Expand Down
44 changes: 44 additions & 0 deletions frictionless/table/__spec__/test_header.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
import pytest

import frictionless
from frictionless import Schema, fields
from frictionless.resources import TableResource

Expand Down Expand Up @@ -37,3 +40,44 @@ def test_missing_label():
assert header == ["id", "name", "extra"]
assert header.labels == ["id", "name"]
assert header.valid is False


@pytest.mark.parametrize(
"source, required, valid_report, nb_errors, types_errors_expected, header_case",
[
([["B"], ["foo"]], {"required": True}, False, 1, ["missing-label"], True),
([["B"], ["foo"]], {}, False, 1, ["missing-label"], True),
([["a"], ["foo"]], {"required": True}, False, 1, ["missing-label"], True),
([["a"], ["foo"]], {}, False, 1, ["missing-label"], True),
# Ignore header_case
([["B"], ["foo"]], {"required": True}, False, 1, ["missing-label"], False),
([["B"], ["foo"]], {}, False, 1, ["missing-label"], False),
([["a"], ["foo"]], {"required": True}, True, 0, [], False),
([["a"], ["foo"]], {}, True, 0, [], False),
],
)
def test_missing_primary_key_label_with_shema_sync_issue_1633(
source, required, valid_report, nb_errors, types_errors_expected, header_case
):
schema_descriptor = {
"$schema": "https://frictionlessdata.io/schemas/table-schema.json",
"fields": [{"name": "A", "constraints": required}],
"primaryKey": ["A"],
}

resource = TableResource(
source=source,
schema=Schema.from_descriptor(schema_descriptor),
detector=frictionless.Detector(schema_sync=True),
dialect=frictionless.Dialect(header_case=header_case),
)

report = frictionless.validate(resource)

assert report.valid == valid_report

if not report.valid:
errors = report.tasks[0].errors
assert len(errors) == nb_errors
for error, type_expected in zip(errors, types_errors_expected):
assert error.type == type_expected
Loading
Loading