-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasaver save speed #1187
Datasaver save speed #1187
Conversation
insert_many_values uses atomic transactions already, hence no need to commit; tests pass
This improves performance by ~2x because there is only one commit for any number of data chunks (data chunk is defined by the maximum number of variables that can be inserted in to sqlite database at once).
itertools.chain.from_iterable is faster than a simple list comprehension as many users at StackOverflow claim (with data from timeit experiments)
Codecov Report
@@ Coverage Diff @@
## master #1187 +/- ##
==========================================
- Coverage 79.76% 79.67% -0.09%
==========================================
Files 48 47 -1
Lines 6662 6668 +6
==========================================
- Hits 5314 5313 -1
- Misses 1348 1355 +7 |
I additionally tested:
|
@Dominik-Vogel Shall we at least merge the 2x fix from this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one comment
@@ -405,8 +405,6 @@ def add_results(self, results: List[Dict[str, VALUE]]) -> int: | |||
len_before_add = length(self.conn, self.table_name) | |||
insert_many_values(self.conn, self.table_name, list(expected_keys), | |||
values) | |||
# TODO: should this not be made atomic? | |||
self.conn.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is technically an api change for anyone that uses add_results directly. Do we need to worry about that? Should we rather add a new function that does not commit and make this an atomic around that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert_many_values
already commits, so this line is just redundant. There is not API change at all, neither for add_results
, nor for insert_many_values
as discussed with @WilliamHPNielsen , I will remove the benchmark code from this PR, and make a new PR with the benchmark code that uses |
2087c20
to
88e0318
Compare
Recently, two users have reported that it takes too much time to save data to the database during the experiment. In particular, saving 10,000 points for 5 parameters (2 dependent, and 3 independent) is reported to take 5-10s.
After investigating the
DataSaver
add_results
method usingcProfile
profiler, I could improve the saving speed by a factor of ~2. Hence, this pull-request.The fix is in the
insert_many_values
method. The data is split into chunks in order to pass as many values as possible within a singleINSERT
call. The problem was that each of theseINSERT
calls had acommit
which was decreasing the performance. As it is suggested everywhere on the web, in this situation, all theINSERT
calls should happen within one transaction, meaning there should be a singleCOMMIT
for all of them. This way the performance is improved.Changes in this PR:
COMMIT
per allINSERT
sinsert_many_values
call, because commit is done within that functionitertools
instead of list comprehension for flattening a list of lists (advised on StackOverflow)Note that the users would like to have this amount of data saved within ~0.1s which raises a question whether sqlite is the right backend. But let's keep this discussion aside of this pull-request.