Performance improvements #92

GjjvdBurg · 2023-04-02T21:20:58Z

This PR adds performance improvements in two ways:

Caching the is_potential_escapechar function result
Implementing merge_with_quotechar in C

Especially for large files, this will likely make a significant difference to the performance of CleverCSV. Some statistics(*) on our integration tests:

mean runtime: 0.629 seconds to 0.445 seconds (-29.3%)
median runtime: 18.19 ms to 16.06 ms (-11.7%)
p90 runtime: 0.951 seconds to 0.732 seconds (-23.1%)

Also, this PR fixes the documentation error reported in #91.

*: one file (13a6c86a18f053c593feda3d98755010) was discarded from the comparison because before these improvement dialect detection would timeout, so it wasn't included previously.

This fix reduces runtime on the integration test set by ~17%.

Gives a speed improvement of ~15%

Rest might be feasible too, but may need some further investigation

The one file moving from error to failed is because the dialect detection fails, but it no longer times out due to the recent speed improvements. The failure can be investigated at some point, but is not considered urgent at this time.

Used to further test the c version of merge_with_quotechar

GjjvdBurg added 8 commits April 1, 2023 17:37

correct documentation for potential escapechar (fixes #91)

050bd9b

Improve efficiency of escape char

510b037

This fix reduces runtime on the integration test set by ~17%.

Implement merge_with_quotechar in C

54b4358

Gives a speed improvement of ~15%

Replace part of fill_empties with regex

16aa4a5

Rest might be feasible too, but may need some further investigation

Update integration test results

4fa9896

The one file moving from error to failed is because the dialect detection fails, but it no longer times out due to the recent speed improvements. The failure can be investigated at some point, but is not considered urgent at this time.

Add additional test cases for making abstractions

ad6cb54

Used to further test the c version of merge_with_quotechar

Merge branch 'master' into bugfix/escape_char

2ed8eab

fix possible invalid read

87f6cc8

GjjvdBurg merged commit fb539ee into master Apr 6, 2023

GjjvdBurg mentioned this pull request Apr 7, 2023

Reduce median dialect detection time by ~64% #96

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements #92

Performance improvements #92

GjjvdBurg commented Apr 2, 2023

Performance improvements #92

Performance improvements #92

Conversation

GjjvdBurg commented Apr 2, 2023