Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LibraryScanner: Improve hashing of directory contents #2497
LibraryScanner: Improve hashing of directory contents #2497
Changes from 2 commits
3171c9c
686a229
2ebaede
44ee5df
160618c
5c6996d
866704d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you made a performance check for toUtf8() compared to use the QString bytes?
I can Imagine that the later is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raw bytes of QString are not platform-independent when considering endianess, but UTF-8 is. Does this matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no function to obtain the raw bytes of a QString. Using toLatin1() may lose characters and toLocal8Bit() is platform dependent while the database is portable. Please clarify which option you consider here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just the idea to check if something like
hasher.addData(reinterprete_cast<char*>(currentFile.utf16()))
is faster or not.
This way there is the double amount to hash but no utf8 conversion.
I don't care about endianes, because this would be only apply if the DB is on an external HD, accessing the same OS with different CPU architectures.
Thinking of this a independent solution is not bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the additional allocations really matter here and won't invest hours to figure out how to do a realistic performance test to prove this. The optimized solution would also produce non-portable data, in contrast to all other data in the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double the data should not be an issue. sha256 runtime not increase 1:1 with the amount of input data. I'm on my phone, but you should see the same effect on desktop pcs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual number of bytes that we feed into QCryptographicHash should not be an issue. But QString might need to allocate a temporary QByteArray for the UTF-8 representation and the conversion itself also takes some time.
Yes, utf16() would be faster, but it is non-portable. If someone is able to proof that it is much faster and worth the drawback, please go ahead ;) I will not do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curious I did the test. Here are the results:
Conclusion:
As expected there is only performance gain for ASCII characters. Except in non Latin charter sets two byte utf8 characters are a minority.
For a typical file name of 30 charters, the unicode() version is way faster. If we append all strings first, the utf8 version catches up at one point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those numbers also need to be put in relation to the file system access. I guess that the differences on the overall performance are negligible. That's what I had in mind with "realistic performance test".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes of cause. I was just interested if we have a low hanging fruit here.