Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasaver save speed #1187

Merged
merged 4 commits into from
Jul 18, 2018
Merged

Conversation

astafan8
Copy link
Contributor

@astafan8 astafan8 commented Jul 13, 2018

Recently, two users have reported that it takes too much time to save data to the database during the experiment. In particular, saving 10,000 points for 5 parameters (2 dependent, and 3 independent) is reported to take 5-10s.

After investigating the DataSaver add_results method using cProfile profiler, I could improve the saving speed by a factor of ~2. Hence, this pull-request.

The fix is in the insert_many_values method. The data is split into chunks in order to pass as many values as possible within a single INSERT call. The problem was that each of these INSERT calls had a commit which was decreasing the performance. As it is suggested everywhere on the web, in this situation, all the INSERT calls should happen within one transaction, meaning there should be a single COMMIT for all of them. This way the performance is improved.

Changes in this PR:

  • one COMMIT per all INSERTs
  • remove unnecessary commit after the insert_many_values call, because commit is done within that function
  • use itertools instead of list comprehension for flattening a list of lists (advised on StackOverflow)
  • add convenient profile decorator function

Note that the users would like to have this amount of data saved within ~0.1s which raises a question whether sqlite is the right backend. But let's keep this discussion aside of this pull-request.

astafan8 added 4 commits July 13, 2018 11:02
insert_many_values uses atomic transactions already, hence no need to
commit; tests pass
This improves performance by ~2x because there is only one commit for
any number of data chunks (data chunk is defined by the maximum number
of variables that can be inserted in to sqlite database at once).
itertools.chain.from_iterable is faster than a simple list comprehension
as many users at StackOverflow claim (with data from timeit experiments)
@codecov
Copy link

codecov bot commented Jul 13, 2018

Codecov Report

Merging #1187 into master will decrease coverage by 0.08%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1187      +/-   ##
==========================================
- Coverage   79.76%   79.67%   -0.09%     
==========================================
  Files          48       47       -1     
  Lines        6662     6668       +6     
==========================================
- Hits         5314     5313       -1     
- Misses       1348     1355       +7

@Dominik-Vogel
Copy link
Contributor

I additionally tested:

  • increasing cache -> no improvement
  • using isolation_level='EXCLUSIVE'-> no improvement
  • using a pure in memory database: -> no significant improvement (from ~3 down to 2.19 s ± 224 ms)
    So I suggest to test storing the data as binary blobs until we come up with a better solution.

@astafan8
Copy link
Contributor Author

@Dominik-Vogel Shall we at least merge the 2x fix from this PR?

Copy link
Collaborator

@jenshnielsen jenshnielsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one comment

@@ -405,8 +405,6 @@ def add_results(self, results: List[Dict[str, VALUE]]) -> int:
len_before_add = length(self.conn, self.table_name)
insert_many_values(self.conn, self.table_name, list(expected_keys),
values)
# TODO: should this not be made atomic?
self.conn.commit()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is technically an api change for anyone that uses add_results directly. Do we need to worry about that? Should we rather add a new function that does not commit and make this an atomic around that?

Copy link
Contributor Author

@astafan8 astafan8 Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert_many_values already commits, so this line is just redundant. There is not API change at all, neither for add_results, nor for insert_many_values

@astafan8
Copy link
Contributor Author

as discussed with @WilliamHPNielsen , I will remove the benchmark code from this PR, and make a new PR with the benchmark code that uses asv (like it is used in numpy).

@astafan8 astafan8 force-pushed the datasaver-save-speed branch from 2087c20 to 88e0318 Compare July 18, 2018 15:22
@astafan8 astafan8 merged commit bcedfbc into microsoft:master Jul 18, 2018
giulioungaretti pushed a commit that referenced this pull request Jul 18, 2018
Merge: ed9b6fe 88e0318
Author: Mikhail Astafev <astafan8@gmail.com>

    Merge pull request #1187 from astafan8/datasaver-save-speed
@astafan8 astafan8 deleted the datasaver-save-speed branch July 19, 2018 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants