Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSV options to the CSV parser #28491

Merged
merged 73 commits into from
Aug 3, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
f744e2c
remove invalid legacy option
girarda Jul 19, 2023
fb5a57d
remove unused option
girarda Jul 19, 2023
3230205
the tests pass but this is quite messy
girarda Jul 19, 2023
f6a67db
very slight clean up
girarda Jul 19, 2023
d01200b
Add skip options to csv format
girarda Jul 19, 2023
b271a9e
fix some of the typing issues
girarda Jul 19, 2023
7add1c7
fixme comment
girarda Jul 19, 2023
e8c88be
remove extra log message
girarda Jul 19, 2023
9e73b51
fix typing issues
girarda Jul 19, 2023
84cabeb
merge
girarda Jul 25, 2023
79f7748
skip before header
girarda Jul 25, 2023
0ae95da
skip after header
girarda Jul 25, 2023
6324257
format
girarda Jul 25, 2023
0fd42ca
add another test
girarda Jul 25, 2023
8b54aff
Automated Commit - Formatting Changes
girarda Jul 25, 2023
b9a4a71
auto generate column names
girarda Jul 26, 2023
9982834
Merge branch 'alex/csv_options' of github.com:airbytehq/airbyte into …
girarda Jul 26, 2023
32844ce
delete dead code
girarda Jul 26, 2023
cd48738
update title and description
girarda Jul 26, 2023
43ce434
true and false values
girarda Jul 26, 2023
df47586
Update the tests
girarda Jul 26, 2023
ce9a672
Add comment
girarda Jul 26, 2023
2c03349
missing test
girarda Jul 26, 2023
c445b02
rename
girarda Jul 26, 2023
ff8f5d4
update expected spec
girarda Jul 26, 2023
87a3bcb
move to method
girarda Jul 26, 2023
9a1954f
Update comment
girarda Jul 26, 2023
72caf7d
fix typo
girarda Jul 26, 2023
ecea4e0
remove unused import
girarda Jul 26, 2023
cf298b7
Add a comment
girarda Jul 26, 2023
8cd05a4
None records do not pass the WaitForDiscoverPolicy
girarda Jul 26, 2023
d1fb6ae
format
girarda Jul 26, 2023
124cfcf
remove second branch to ensure we always go through the same processing
girarda Jul 26, 2023
a9ee16b
Raise an exception if the record is None
girarda Jul 26, 2023
a629ef0
reset
girarda Jul 26, 2023
f11a551
Update tests
girarda Jul 26, 2023
da274bc
handle unquoted newlines
girarda Jul 26, 2023
b373221
Automated Commit - Formatting Changes
girarda Jul 26, 2023
ce51b3d
Update test case so the quoting is explicit
girarda Jul 26, 2023
f8d76a1
Merge branch 'alex/csv_options' of github.com:airbytehq/airbyte into …
girarda Jul 26, 2023
b857737
Update comment
girarda Jul 26, 2023
e8609c4
Automated Commit - Formatting Changes
girarda Jul 27, 2023
59f00be
Fail validation if skipping rows before header and header is autogene…
girarda Jul 27, 2023
1cdaf60
Merge branch 'alex/csv_options' of github.com:airbytehq/airbyte into …
girarda Jul 27, 2023
d1c4036
always fail if a record cannot be parsed
girarda Aug 1, 2023
903074a
merge
girarda Aug 1, 2023
bdfccee
format
girarda Aug 1, 2023
355d596
set write line_no in error message
girarda Aug 1, 2023
0426b4c
remove none check
girarda Aug 2, 2023
8b7d519
Merge branch 'master' into alex/csv_options
girarda Aug 2, 2023
8a7bcf7
Automated Commit - Formatting Changes
girarda Aug 2, 2023
9252651
enable autogenerate test
girarda Aug 2, 2023
06157dc
Merge branch 'alex/csv_options' of github.com:airbytehq/airbyte into …
girarda Aug 2, 2023
e5a1c0e
remove duplicate test
girarda Aug 2, 2023
9c9dc72
missing unit tests
girarda Aug 2, 2023
146680a
Update
girarda Aug 2, 2023
4cfd721
remove branching
girarda Aug 2, 2023
6f10047
remove unused none check
girarda Aug 2, 2023
e4986e8
Merge branch 'master' into alex/csv_options
girarda Aug 2, 2023
1f57507
Update tests
girarda Aug 2, 2023
0441c28
remove branching
girarda Aug 2, 2023
c2b3a37
format
girarda Aug 2, 2023
d8538f9
extract to function
girarda Aug 2, 2023
16df89d
comment
girarda Aug 2, 2023
7d7f6dd
Merge branch 'master' into alex/csv_options
girarda Aug 2, 2023
cef6a41
missing type
girarda Aug 2, 2023
cec32dc
Merge branch 'alex/csv_options' of github.com:airbytehq/airbyte into …
girarda Aug 2, 2023
05067a7
type annotation
girarda Aug 2, 2023
bdbd413
use set
girarda Aug 3, 2023
bf525b4
Document that the strings are case-sensitive
girarda Aug 3, 2023
d32a94f
public -> private
girarda Aug 3, 2023
69240b0
add unit test
girarda Aug 3, 2023
bfe4d47
newline
girarda Aug 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add unit test
  • Loading branch information
girarda committed Aug 3, 2023
commit 69240b0def402538cb3065b30033e5102c4dc935
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

import codecs
from enum import Enum
from typing import Any, List, Mapping, Optional, Set
from typing import Any, Mapping, Optional, Set

from pydantic import BaseModel, Field, root_validator, validator
from typing_extensions import Literal
Expand Down Expand Up @@ -69,10 +69,14 @@ class CsvFormat(BaseModel):
description="Whether to autogenerate column names if column_names is empty. If true, column names will be of the form “f0”, “f1”… If false, column names will be read from the first CSV row after skip_rows_before_header.",
)
true_values: Set[str] = Field(
title="True Values", default=DEFAULT_TRUE_VALUES, description="A set of case-sensitive strings that should be interpreted as true values."
title="True Values",
default=DEFAULT_TRUE_VALUES,
description="A set of case-sensitive strings that should be interpreted as true values.",
)
false_values: Set[str] = Field(
title="False Values", default=DEFAULT_FALSE_VALUES, description="A set of case-sensitive strings that should be interpreted as false values."
title="False Values",
default=DEFAULT_FALSE_VALUES,
description="A set of case-sensitive strings that should be interpreted as false values.",
)

@validator("delimiter")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@
#

import logging
from unittest.mock import MagicMock, Mock

import pytest
from airbyte_cdk.sources.file_based.config.csv_format import DEFAULT_FALSE_VALUES, DEFAULT_TRUE_VALUES, CsvFormat
from airbyte_cdk.sources.file_based.file_types.csv_parser import _cast_types
from airbyte_cdk.sources.file_based.exceptions import RecordParseError
from airbyte_cdk.sources.file_based.file_types.csv_parser import CsvParser, _cast_types

PROPERTY_TYPES = {
"col1": "null",
Expand Down Expand Up @@ -70,4 +72,26 @@
)
def test_cast_to_python_type(row, true_values, false_values, expected_output):
csv_format = CsvFormat(true_values=true_values, false_values=false_values)
assert _cast_types(row, PROPERTY_TYPES, csv_format, logger)==expected_output
assert _cast_types(row, PROPERTY_TYPES, csv_format, logger) == expected_output

@pytest.mark.parametrize(
"reader_values, expected_rows", [
pytest.param([{"col1": "1", "col2": None}], None, id="raise_exception_if_any_value_is_none"),
pytest.param([{"col1": "1", "col2": "2"}], [{"col1": "1", "col2": "2"}], id="read_no_cast"),
]
)
def test_read_and_cast_types(reader_values, expected_rows):
reader = MagicMock()
reader.__iter__.return_value = reader_values
schema = {}
config_format = CsvFormat()
logger = Mock()

parser = CsvParser()

expected_rows = expected_rows
if expected_rows is None:
with pytest.raises(RecordParseError):
list(parser._read_and_cast_types(reader, schema, config_format, logger))
else:
assert expected_rows == list(parser._read_and_cast_types(reader, schema, config_format, logger))