Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PartiQL should use idiomatic CSV parsing #366

Closed
jonathanfrenz opened this issue Feb 19, 2021 · 2 comments · Fixed by #480
Closed

PartiQL should use idiomatic CSV parsing #366

jonathanfrenz opened this issue Feb 19, 2021 · 2 comments · Fixed by #480
Labels
enhancement New feature or request S Small sized task

Comments

@jonathanfrenz
Copy link

PartiQL's CSV parser today doesn't take into account quotation marks around fields. Typical CSV writers will write a field with a comma in it to be surrounded by quotations. This is mentioned in the PartiQL code as a TODO here.

CSVReader from Apache is a great example of how to implement this easily. A one-line fix could just be replacing this line with a CSVParser.parse call.

@dlurton
Copy link
Member

dlurton commented Feb 19, 2021

We should really consider utilizing an existing RFC-4180 CSV parser as a dependency in the CLI project and removing the hand-rolled one we created.

@alancai98 alancai98 added enhancement New feature or request S Small sized task labels Mar 12, 2021
@abhikuhikar abhikuhikar self-assigned this Mar 24, 2021
@dlurton
Copy link
Member

dlurton commented Mar 25, 2021

The use cases below are organized such that they can be implemented in sequence and individually have value. i.e. it would be helpful to implement use cases 1 and 2, while 3, 4 and 5 can come some time later.

Use case 1: the csv file has header row. in this case, the columns should be bound to _n fields where n is the field number. i..e.

This selects the first three columns of some_file.csv which does not contain a header row:

select _1, _2, _3 from read_csv('some_file.csv', `{ has_header_row: false }`)

Use case 2: the csv file has a header row. in this case, the columns should be bound to their names:

select first_name, last_name, city from read_csv('some_file_with_header_row.csv', `{ has_header_row: true }`)

Use case 3: the user does not know if there are column names. We should try to autodetect.

select * from read_csv('some_file_with_header_row.csv', `{ has_header_row: auto }`)
-- or -- 
select * from read_csv('some_file_with_header_row.csv')  -- has_header_row defaults to auto

Use case 4: the user would like to specify a predefined csv format:

select * from read_csv('some_file_with_header_row.csv', `{ format: postgres }`)

If format is unspecified, the default is rfc4180.

Possible values for format are a subset of those defined here and include, default, excel, mysql, rfc4180, oracle, postgres_csv, postgres_text. The other formats not included in this list seem rather obscure and we do not wish to support them. if the user tries to specify one not listed here, an error should be raised.

Use case 5: the user would like to create a custom format:

select * from read_csv('some_file_with_header_row.csv', 
    `{ record_separator: '\n', ignore_empty_lines = true, ignore_sourrounding_spaces = true }`)

For a complete list of all the different configuration options, see:
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request S Small sized task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants