Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: migrate: deduplicate objects between v2 and v3 #9924

Closed
12michi34 opened this issue Sep 7, 2023 · 7 comments
Closed

remote: migrate: deduplicate objects between v2 and v3 #9924

12michi34 opened this issue Sep 7, 2023 · 7 comments
Assignees
Labels
feature request Requesting a new feature

Comments

@12michi34
Copy link

My situation is like this
a) dataFolderA containing fileA.bin fileB.bin, fileC.Bin and I added that via "dvc add dataFolderA" to the remote dvc via 2.0
b) then I changed fileB.bin and added that via "dvc add dataFolderB" to the remove via dvc 3.0

when investigating the remote(and cache) I can see the md5-renamed file for fileA.bin and fileC.bin in both files/md5// and /
it is the same exact md5 hash and the data for fileA.bin and fileC.bin are now twice in the remote (and cache)
(I am simplifying my case there are many fileA,fileB,fileC's involved)

How can I clean up the remote?. I know there exists a "dvc cache migrate" (have not tried it yet though) .
Kindest regards

@efiop
Copy link
Contributor

efiop commented Sep 7, 2023

@12michi34 Is this you on the forum as well https://discuss.dvc.org/t/help-with-upgrading-imported-via-dvc2-x-dvc-data-with-dvc3-0/1750/10? If not, then just linking for the record.

@efiop
Copy link
Contributor

efiop commented Sep 7, 2023

Regarding the question itself, there is no such feature right now. We've thought about it when implementing migrate, but didn't prioritize it. One obvious way to go here is to try to use hardlinks to link v2 and v3 files.

@efiop efiop changed the title how to remove dvc v2.0/3.0 duplicates in remote (and cache) cache: migrate: deduplicate objects between v2 and v3 cache Sep 7, 2023
@0x2b3bfa0
Copy link
Member

(Originated on Discord)

@efiop efiop added the feature request Requesting a new feature label Sep 8, 2023
@dberenbaum dberenbaum added this to DVC Sep 8, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Sep 8, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Sep 12, 2023

Just to clarify here, DVC already does support 2.x/3.x deduplication for local cache (via dvc cache migrate). Deduplication is only currently unsupported for remotes.

@pmrowla pmrowla changed the title cache: migrate: deduplicate objects between v2 and v3 cache remote: migrate: deduplicate objects between v2 and v3 cache Sep 12, 2023
@pmrowla pmrowla changed the title remote: migrate: deduplicate objects between v2 and v3 cache remote: migrate: deduplicate objects between v2 and v3 Sep 12, 2023
@dberenbaum
Copy link
Collaborator

After #9938, let's document how best to handle this -- migrate everything to 3.0 and then gc

@dberenbaum dberenbaum self-assigned this Oct 3, 2023
@12michi34
Copy link
Author

@efiop .. sorry about the late reply . Yes, this originated on discord. Somehow missed notification emails that there are new comments on this issue.

@dberenbaum
Copy link
Collaborator

Added some explanation to the migration guide in the docs. No more action is planned at the moment, so closing this one.

@github-project-automation github-project-automation bot moved this from Backlog to Done in DVC Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants