-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvcfs: optimize get()
by reducing index.info calls()
#10540
Conversation
tests/func/test_import.py
Outdated
expected_info_calls = {(name,)} | ||
if isinstance(files[name], dict): | ||
dirs = { | ||
(name, *d.relative_to(tmp_dir / "out").parts) | ||
for d in (tmp_dir / "out").rglob("*") | ||
if d.is_dir() | ||
} | ||
expected_info_calls.update(dirs) | ||
assert { | ||
call.args[1] for call in index_info_spy.call_args_list | ||
} == expected_info_calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This asserts that Index.info()
calls only happen for directories during fs.walk()
. If the output is a file, we'll call Index.info()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that strong an opinion, but would it be better to have a separate test for files and dirs? Maybe we need to leave some comments? This isn't the easiest test to understand on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have simplified the tests. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Really like all the dvcfs tests you added.
I was investigating why it was taking a long time (79s is still a lot of time for imports). Looking at the cprofile results, half the time was spent on hashing files. Turns out we were using 2.0 hashes in dvc-bench. So, |
2.0 hash cross-compatibility performance was the last item in #10059 (comment) that we never got to since I never saw anyone ask about it and it should becomes less of an issue over time. |
That's difficult for anyone else to say what the expected behaviour here is. Before your PR #10531, we were hashing them anyway. And imports have always been slower. |
This PR improves performance for
dvc import/get
when all files are cached by as much as 90%.This is done by using
fs.walk(detail=True)
insidedvcfs.get()
, which will gather all theinfo
from index at the same time (usingDataIndex.traverse
). Doing this will avoid callingSQLtrie.info()
millions of times.We cannot use
fs.info()
calls to gather state information at the end ofRepoDependency.download()
, so I had to createdvcfs._get
internal function that gives us the information that we need.Closes #10537.