PartiQL should use idiomatic CSV parsing #366

jonathanfrenz · 2021-02-19T22:12:48Z

PartiQL's CSV parser today doesn't take into account quotation marks around fields. Typical CSV writers will write a field with a comma in it to be surrounded by quotations. This is mentioned in the PartiQL code as a TODO here.

CSVReader from Apache is a great example of how to implement this easily. A one-line fix could just be replacing this line with a CSVParser.parse call.

dlurton · 2021-02-19T22:20:17Z

We should really consider utilizing an existing RFC-4180 CSV parser as a dependency in the CLI project and removing the hand-rolled one we created.

dlurton · 2021-03-25T01:14:08Z

The use cases below are organized such that they can be implemented in sequence and individually have value. i.e. it would be helpful to implement use cases 1 and 2, while 3, 4 and 5 can come some time later.

Use case 1: the csv file has header row. in this case, the columns should be bound to _n fields where n is the field number. i..e.

This selects the first three columns of some_file.csv which does not contain a header row:

select _1, _2, _3 from read_csv('some_file.csv', `{ has_header_row: false }`)

Use case 2: the csv file has a header row. in this case, the columns should be bound to their names:

select first_name, last_name, city from read_csv('some_file_with_header_row.csv', `{ has_header_row: true }`)

Use case 3: the user does not know if there are column names. We should try to autodetect.

select * from read_csv('some_file_with_header_row.csv', `{ has_header_row: auto }`)
-- or -- 
select * from read_csv('some_file_with_header_row.csv')  -- has_header_row defaults to auto

Use case 4: the user would like to specify a predefined csv format:

select * from read_csv('some_file_with_header_row.csv', `{ format: postgres }`)

If format is unspecified, the default is rfc4180.

Possible values for format are a subset of those defined here and include, default, excel, mysql, rfc4180, oracle, postgres_csv, postgres_text. The other formats not included in this list seem rather obscure and we do not wish to support them. if the user tries to specify one not listed here, an error should be raised.

Use case 5: the user would like to create a custom format:

select * from read_csv('some_file_with_header_row.csv', 
    `{ record_separator: '\n', ignore_empty_lines = true, ignore_sourrounding_spaces = true }`)

For a complete list of all the different configuration options, see:
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html

alancai98 added enhancement New feature or request S Small sized task labels Mar 12, 2021

abhikuhikar self-assigned this Mar 24, 2021

dlurton closed this as completed Mar 25, 2021

alancai98 reopened this Mar 25, 2021

dlurton mentioned this issue Nov 10, 2021

Replaced self-implemented CSV file reader with Apache CSVParser for CLI #474

Merged

lziq mentioned this issue Dec 13, 2021

Enabled CLI to read predefined and customized CSV files #480

Merged

lziq linked a pull request Dec 13, 2021 that will close this issue

Enabled CLI to read predefined and customized CSV files #480

Merged

abhikuhikar removed their assignment Dec 20, 2021

lziq closed this as completed in #480 Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PartiQL should use idiomatic CSV parsing #366

PartiQL should use idiomatic CSV parsing #366

jonathanfrenz commented Feb 19, 2021

dlurton commented Feb 19, 2021 •

edited

Loading

dlurton commented Mar 25, 2021 •

edited

Loading

PartiQL should use idiomatic CSV parsing #366

PartiQL should use idiomatic CSV parsing #366

Comments

jonathanfrenz commented Feb 19, 2021

dlurton commented Feb 19, 2021 • edited Loading

dlurton commented Mar 25, 2021 • edited Loading

dlurton commented Feb 19, 2021 •

edited

Loading

dlurton commented Mar 25, 2021 •

edited

Loading