Datasaver save speed #1187

astafan8 · 2018-07-13T15:04:44Z

Recently, two users have reported that it takes too much time to save data to the database during the experiment. In particular, saving 10,000 points for 5 parameters (2 dependent, and 3 independent) is reported to take 5-10s.

After investigating the DataSaver add_results method using cProfile profiler, I could improve the saving speed by a factor of ~2. Hence, this pull-request.

The fix is in the insert_many_values method. The data is split into chunks in order to pass as many values as possible within a single INSERT call. The problem was that each of these INSERT calls had a commit which was decreasing the performance. As it is suggested everywhere on the web, in this situation, all the INSERT calls should happen within one transaction, meaning there should be a single COMMIT for all of them. This way the performance is improved.

Changes in this PR:

one COMMIT per all INSERTs
remove unnecessary commit after the insert_many_values call, because commit is done within that function
use itertools instead of list comprehension for flattening a list of lists (advised on StackOverflow)
add convenient profile decorator function

Note that the users would like to have this amount of data saved within ~0.1s which raises a question whether sqlite is the right backend. But let's keep this discussion aside of this pull-request.

insert_many_values uses atomic transactions already, hence no need to commit; tests pass

This improves performance by ~2x because there is only one commit for any number of data chunks (data chunk is defined by the maximum number of variables that can be inserted in to sqlite database at once).

itertools.chain.from_iterable is faster than a simple list comprehension as many users at StackOverflow claim (with data from timeit experiments)

codecov · 2018-07-13T15:36:39Z

Codecov Report

Merging #1187 into master will decrease coverage by 0.08%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1187      +/-   ##
==========================================
- Coverage   79.76%   79.67%   -0.09%     
==========================================
  Files          48       47       -1     
  Lines        6662     6668       +6     
==========================================
- Hits         5314     5313       -1     
- Misses       1348     1355       +7

Dominik-Vogel · 2018-07-16T13:29:46Z

I additionally tested:

increasing cache -> no improvement
using isolation_level='EXCLUSIVE'-> no improvement
using a pure in memory database: -> no significant improvement (from ~3 down to 2.19 s ± 224 ms)
So I suggest to test storing the data as binary blobs until we come up with a better solution.

astafan8 · 2018-07-16T16:20:43Z

@Dominik-Vogel Shall we at least merge the 2x fix from this PR?

jenshnielsen

Only one comment

jenshnielsen · 2018-07-18T14:34:59Z

qcodes/dataset/data_set.py

@@ -405,8 +405,6 @@ def add_results(self, results: List[Dict[str, VALUE]]) -> int:
        len_before_add = length(self.conn, self.table_name)
        insert_many_values(self.conn, self.table_name, list(expected_keys),
                           values)
-        # TODO: should this not be made atomic?
-        self.conn.commit()


So this is technically an api change for anyone that uses add_results directly. Do we need to worry about that? Should we rather add a new function that does not commit and make this an atomic around that?

insert_many_values already commits, so this line is just redundant. There is not API change at all, neither for add_results, nor for insert_many_values

astafan8 · 2018-07-18T15:00:39Z

as discussed with @WilliamHPNielsen , I will remove the benchmark code from this PR, and make a new PR with the benchmark code that uses asv (like it is used in numpy).

Merge: ed9b6fe 88e0318 Author: Mikhail Astafev <astafan8@gmail.com> Merge pull request #1187 from astafan8/datasaver-save-speed

astafan8 added 4 commits July 13, 2018 11:02

Remove unnecessary commit to database

2cdec3c

insert_many_values uses atomic transactions already, hence no need to commit; tests pass

Use single commit for all inserts in insert_many_values

27d737e

This improves performance by ~2x because there is only one commit for any number of data chunks (data chunk is defined by the maximum number of variables that can be inserted in to sqlite database at once).

Use itertools to flatten a list of lists to improve performance

b4e9f77

itertools.chain.from_iterable is faster than a simple list comprehension as many users at StackOverflow claim (with data from timeit experiments)

Add convenient profiling decorator based on cProfile

88e0318

astafan8 added enhancement new dataset labels Jul 13, 2018

astafan8 self-assigned this Jul 13, 2018

astafan8 requested review from jenshnielsen, sohailc, WilliamHPNielsen and Dominik-Vogel July 13, 2018 15:04

jenshnielsen approved these changes Jul 18, 2018

View reviewed changes

astafan8 force-pushed the datasaver-save-speed branch from 2087c20 to 88e0318 Compare July 18, 2018 15:22

astafan8 merged commit bcedfbc into microsoft:master Jul 18, 2018

giulioungaretti pushed a commit that referenced this pull request Jul 18, 2018

Generated gh-pages for commit bcedfbc

18fa7dd

Merge: ed9b6fe 88e0318 Author: Mikhail Astafev <astafan8@gmail.com> Merge pull request #1187 from astafan8/datasaver-save-speed

astafan8 deleted the datasaver-save-speed branch July 19, 2018 13:05

astafan8 mentioned this pull request Jul 24, 2018

Setup benchmarking and add simple dataset benchmark #1202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasaver save speed #1187

Datasaver save speed #1187

astafan8 commented Jul 13, 2018 •

edited by WilliamHPNielsen

Loading

codecov bot commented Jul 13, 2018 •

edited

Loading

Dominik-Vogel commented Jul 16, 2018

astafan8 commented Jul 16, 2018

jenshnielsen left a comment

jenshnielsen Jul 18, 2018

astafan8 Jul 18, 2018 •

edited

Loading

astafan8 commented Jul 18, 2018

Datasaver save speed #1187

Datasaver save speed #1187

Conversation

astafan8 commented Jul 13, 2018 • edited by WilliamHPNielsen Loading

codecov bot commented Jul 13, 2018 • edited Loading

Codecov Report

Dominik-Vogel commented Jul 16, 2018

astafan8 commented Jul 16, 2018

jenshnielsen left a comment

Choose a reason for hiding this comment

jenshnielsen Jul 18, 2018

Choose a reason for hiding this comment

astafan8 Jul 18, 2018 • edited Loading

Choose a reason for hiding this comment

astafan8 commented Jul 18, 2018

astafan8 commented Jul 13, 2018 •

edited by WilliamHPNielsen

Loading

codecov bot commented Jul 13, 2018 •

edited

Loading

astafan8 Jul 18, 2018 •

edited

Loading