Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: output after partial pull wrongfully includes top dir #9507

Closed
omesser opened this issue May 24, 2023 · 4 comments · Fixed by #9980
Closed

diff: output after partial pull wrongfully includes top dir #9507

omesser opened this issue May 24, 2023 · 4 comments · Fixed by #9980
Assignees
Labels
bug Did we break something? diff/show Related to the diff/show feature p1-important Important, aka current backlog of things to do

Comments

@omesser
Copy link
Contributor

omesser commented May 24, 2023

Bug Report

Description

After pulling a subset of a dvc tracked dataset, the results of dvc diff <subpath in dataset> show wrong (or confusing) results.
Specifically, if only a subdir of a dataset is pulled and modified, the top level item appears as "Removed", which is wrong.

Reproduce

Using a fresh repo with the dataset.dvc from dvc-bench/data/mnist at its root.

  1. Pull data partially, create 1 new file
$ dvc pull dataset/train/4
$ echo "new" > dataset/train/4/new.txt

Note, this uses the newly developed "virtual dir":

  1. git diff shows (expected) 1 added file to dataset:
$ git --no-pager diff
diff --git a/dataset.dvc b/dataset.dvc
index ac31d91..79740af 100644
--- a/dataset.dvc
+++ b/dataset.dvc
@@ -1,6 +1,5 @@
 outs:
-- md5: e42412b82dcab425ce9c7e2d0abdfb78.dir
-  size: 19258482
-  nfiles: 70000
+- md5: 881ee47fcae5c4d9625071cfdc5c3991.dir
+  nfiles: 70001
   path: dataset
  1. running dvc diff, here we have the fishy result:
$ dvc diff --targets dataset/train/4
Added:
    dataset/train/4/
    dataset/train/4/new.txt

Deleted:
    dataset/

files summary: 1 added

☝️ it shows a the top level dir as "Deleted", and also the containing dir as added (not just the file), looks like a bug

Another questionable behavior is running when running dvc diff --targets dataset/ - there it would dump all the files not present in workspace (dataset/test/* , dataset/traing/{!4}/*) - Here I intentionally didn't pull them, but they are not missing from my local dataset.dvc

Expected

$ dvc diff --targets dataset/train/4
Added:
    dataset/train/4/new.txt
files summary: 1 added

Questionable, should we also expect to see dataset/ as "modified" since it's the dataset (data-item?) level, and dataset.dvc contents are modified and dataset.dvc represents datasets/ ? I would argue it's not expected but, not sure

Environment information

$ dvc doctor
DVC version: 2.57.3.dev6+g3ddd4b87
----------------------------------
Platform: Python 3.8.12 on macOS-13.0-arm64-arm-64bit
Subprojects:
	dvc_data = 0.51.0
	dvc_objects = 0.22.0
	dvc_render = 0.5.3
	dvc_task = 0.2.1
	scmrepo = 1.0.3
Supports:
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.5.0, boto3 = 1.26.76)
Config:
	Global: /Users/oded/Library/Application Support/dvc
	System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: local, s3, https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/3b754cc9e7b81a867238f983ad8551e1

Additional Information:

Related:

@omesser omesser added bug Did we break something? p1-important Important, aka current backlog of things to do diff/show Related to the diff/show feature labels May 24, 2023
@omesser
Copy link
Contributor Author

omesser commented May 24, 2023

Additional info, dvc data status --granular output attached
data_status_gran.txt

@dberenbaum
Copy link
Collaborator

it shows a the top level dir as "Deleted", and also the containing dir as added

This may have actually got worse since 3.0. Even without adding a new file, I see that behavior, and now I also see all the other files in dataset/train/4 as modified:

git clone git@github.com:iterative/dvc-bench.git
cd dvc-bench/data/mnist
dvc pull dataset/train/4
dvc diff --targets dataset/train/4

The output looks like:

Added:
    data/mnist/dataset/train/4/

Deleted:
    data/mnist/dataset/

Modified:
    data/mnist/dataset/train/4/00003.png
    data/mnist/dataset/train/4/00010.png
    data/mnist/dataset/train/4/00021.png
    ...

@dberenbaum dberenbaum added this to DVC Aug 10, 2023
@dberenbaum dberenbaum moved this to Todo in DVC Aug 10, 2023
@efiop
Copy link
Contributor

efiop commented Sep 26, 2023

New modified state for other files is related to the hash changes. md5 != md5-dos2unix, hence hash mismatch and that diff. Taking a closer look.

@dberenbaum
Copy link
Collaborator

You can ignore my comment then since it seems like it's just about 2.x->3.x conversion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? diff/show Related to the diff/show feature p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants