Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements #92

Merged
merged 8 commits into from
Apr 6, 2023
Merged

Performance improvements #92

merged 8 commits into from
Apr 6, 2023

Conversation

GjjvdBurg
Copy link
Collaborator

This PR adds performance improvements in two ways:

  • Caching the is_potential_escapechar function result
  • Implementing merge_with_quotechar in C

Especially for large files, this will likely make a significant difference to the performance of CleverCSV. Some statistics(*) on our integration tests:

  • mean runtime: 0.629 seconds to 0.445 seconds (-29.3%)
  • median runtime: 18.19 ms to 16.06 ms (-11.7%)
  • p90 runtime: 0.951 seconds to 0.732 seconds (-23.1%)

Also, this PR fixes the documentation error reported in #91.


*: one file (13a6c86a18f053c593feda3d98755010) was discarded from the comparison because before these improvement dialect detection would timeout, so it wasn't included previously.

This fix reduces runtime on the integration test set by ~17%.
Gives a speed improvement of ~15%
Rest might be feasible too, but may need some further investigation
The one file moving from error to failed is because the
dialect detection fails, but it no longer times out due
to the recent speed improvements. The failure can be
investigated at some point, but is not considered urgent
at this time.
Used to further test the c version of merge_with_quotechar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant