Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc repro --dry --allow-missing: fails on missing data #9818

Closed
Otterpatsch opened this issue Aug 8, 2023 · 24 comments
Closed

dvc repro --dry --allow-missing: fails on missing data #9818

Otterpatsch opened this issue Aug 8, 2023 · 24 comments
Assignees
Labels
A: pipelines Related to the pipelines feature awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@Otterpatsch
Copy link

Otterpatsch commented Aug 8, 2023

I tried to update our dvc ci pipeline

Currently we got the following commands (among others).

dvc pull to check if everything is pushed
dvc status to check if the dvc status is clean. In other words no repro would be run if one would run dvc repro.

But pulling thats a long time and with the now new --alllow-missing feature i though i can skip that with

dvc data status --not-in-remote --json | grep -v not_in_remote
dvc repro --allow-missing --dry

the first is working like expected. Fails if data was forgotten to be pushed and succeeds if it was.
But the later just fails on missing data.

Reproduce

Example: Failure/Success on Machine Two and Three should be synced

Machine One:

  1. dvc repro -f
  2. git add . && git commit -m "repro" && dvc push && git push
  3. dvc repro --allow-missing --dry
    --> doesnt fail, nothing changed (as expected)

Machine Two:
4. dvc data status --not-in-remote --json | grep -v not_in_remote
--> does not fail, everything is pushed and would be pulled
5. dvc repro --allow-missing --dry
--> fails on missing data (unexpected)

Machine Three
4. dvc pull
5. dvc status
--> succeeds

Expected

On a machine where i did not dvc pull i would expect on a git clean state and a clean dvc data status --not-in-remote --json | grep -v not_in_remote state that dvc repro --allow-missing --dry would succed and show me that no stage had to run.

Environment information

Linux

Output of dvc doctor:

$ dvc doctor
09:16:47  DVC version: 3.13.2 (pip)
09:16:47  -------------------------
09:16:47  Platform: Python 3.10.11 on Linux-5.9.0-0.bpo.5-amd64-x86_64-with-glibc2.35
09:16:47  Subprojects:
09:16:47  	dvc_data = 2.12.1
09:16:47  	dvc_objects = 0.24.1
09:16:47  	dvc_render = 0.5.3
09:16:47  	dvc_task = 0.3.0
09:16:47  	scmrepo = 1.1.0
09:16:47  Supports:
09:16:47  	azure (adlfs = 2023.4.0, knack = 0.11.0, azure-identity = 1.13.0),
09:16:47  	gdrive (pydrive2 = 1.16.1),
09:16:47  	gs (gcsfs = 2023.6.0),
09:16:47  	hdfs (fsspec = 2023.6.0, pyarrow = 12.0.1),
09:16:47  	http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
09:16:47  	https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
09:16:47  	oss (ossfs = 2021.8.0),
09:16:47  	s3 (s3fs = 2023.6.0, boto3 = 1.28.17),
09:16:47  	ssh (sshfs = 2023.7.0),
09:16:47  	webdav (webdav4 = 0.9.8),
09:16:47  	webdavs (webdav4 = 0.9.8),
09:16:47  	webhdfs (fsspec = 2023.6.0)
09:16:47  Config:
09:16:47  	Global: /home/runner/.config/dvc
09:16:47  	System: /etc/xdg/dvc
09:16:47  Cache types: <https://error.dvc.org/no-dvc-cache>
09:16:47  Caches: local
09:16:47  Remotes: ssh
09:16:47  Workspace directory: ext4 on /dev/nvme0n1p2
09:16:47  Repo: dvc, git
@Otterpatsch Otterpatsch changed the title dvc repro --dry --allow-missing: fails on missing data dvc repro --dry --allow-missing: fails on missing data Aug 8, 2023
@dberenbaum
Copy link
Collaborator

@Otterpatsch I see you provided some verbose output in https://discord.com/channels/485586884165107732/1138144206473396304/1138162073705128148, but I don't see any error there. Are you able to post output showing the full output, including the error you hit?

@dberenbaum dberenbaum added the awaiting response we are waiting for your reply, please respond! :) label Aug 8, 2023
@Otterpatsch
Copy link
Author

Otterpatsch commented Aug 10, 2023

I dont hit any "error" just that notification due to --dry that staged would run. And further notification that some files are missing (dvc tracked).
But maybe my assumation that dvc repro --allow-missing --dry should not fail/report everything is fine and uptodate when i use those flag, iff from some other machine that repro was done and pushed successfully is wrong.
Im very much confused by now

Just to clarify if i run dvc pull and run dvc status everything is reported as fine.

dvc repro --allow-missing --dry
11:18:32  'datasets/benchmark-sets/customer0/2020_11_02.dvc' didn't change, skipping
...
11:18:32  'datasets/training-sets/customer/customerN/customerN_empty_consignment_field_faxified.dvc' didn't change, skipping
11:18:32  Running stage 'training':
11:18:32  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
11:18:32  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
11:18:32  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
11:18:32  > cp -r stages/training/charsets model/
11:18:32  
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Ansprechpartner' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Beinstueck' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Kommission' didn't change, skipping
11:18:32  Stage 'extract@customer0/2020_11_02/Formularmerkmal_Kundenname' didn't change, skipping
11:18:32  'datasets/benchmark-sets/company/emails_2021-03-22.dvc' didn't change, skipping
11:18:32  ERROR: failed to reproduce 'extract@company/emails_2021-03-22': [Errno 2] No such file or directory: '/var/jenkins_home/workspace/repo_namecompany_MR-20/datasets/benchmark-sets/company/emails_2021-03-22'

@dberenbaum
Copy link
Collaborator

It seems like this happens when there is a dependency on data that was tracked via dvc add. I can reproduce:

git clone https://github.com/iterative/example-get-started-experiments.git
cd example-get-started-experiments
dvc repro --allow-missing --dry

Verbose output:

$ dvc repro -v --allow-missing --dry
2023-08-10 11:15:25,325 DEBUG: v3.14.1.dev2+g04e891cef, CPython 3.11.4 on macOS-13.4.1-arm64-arm-64bit
2023-08-10 11:15:25,325 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc repro -v --allow-missing --dry
2023-08-10 11:15:25,709 DEBUG: Computed stage: 'data/pool_data.dvc' md5: 'None'
'data/pool_data.dvc' didn't change, skipping
2023-08-10 11:15:25,711 DEBUG: Dependency 'data/pool_data' of stage: 'data_split' changed because it is 'modified'.
2023-08-10 11:15:25,712 DEBUG: stage: 'data_split' changed.
2023-08-10 11:15:25,714 ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'
Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 199, in _reproduce
    ret = repro_fn(stage, upstream=upstream, force=force_stage, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 129, in _reproduce_stage
    ret = stage.reproduce(**kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
           ^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 433, in reproduce
    self.run(**kwargs)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
           ^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 599, in run
    self._run_stage(dry, force, allow_missing=allow_missing, **kwargs)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
           ^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 630, in _run_stage
    return run_stage(self, dry, force, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/run.py", line 134, in run_stage
    stage.repo.stage_cache.restore(stage, dry=dry, **kwargs)
  File "/Users/dave/Code/dvc/dvc/stage/cache.py", line 188, in restore
    if not _can_hash(stage):
           ^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/cache.py", line 43, in _can_hash
    if not (dep.protocol == "local" and dep.def_path and dep.get_hash()):
                                                         ^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/output.py", line 553, in get_hash
    _, hash_info = self._get_hash_meta()
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/output.py", line 573, in _get_hash_meta
    _, meta, obj = self._build(
                   ^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/output.py", line 566, in _build
    return build(*args, callback=pb.as_callback(), **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/build.py", line 233, in build
    details = fs.info(path)
              ^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc-objects/src/dvc_objects/fs/base.py", line 495, in info
    return self.fs.info(path, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc-objects/src/dvc_objects/fs/local.py", line 42, in info
    return self.fs.info(path)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/implementations/local.py", line 87, in info
    out = os.stat(path, follow_symlinks=False)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/example-get-started-experiments/data/pool_data'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/cli/__init__.py", line 209, in main
    ret = cmd.do_run()
          ^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/cli/command.py", line 26, in do_run
    return self.run()
           ^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/commands/repro.py", line 13, in run
    stages = self.repo.reproduce(**self._common_kwargs, **self._repro_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 64, in wrapper
    return f(repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/scm_context.py", line 151, in run
    return method(repo, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 260, in reproduce
    return _reproduce(steps, graph=graph, on_error=on_error or "fail", **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 203, in _reproduce
    _raise_error(exc, stage)
  File "/Users/dave/Code/dvc/dvc/repo/reproduce.py", line 167, in _raise_error
    raise ReproductionError(f"failed to reproduce{segment} {names}") from exc
dvc.exceptions.ReproductionError: failed to reproduce 'data_split'

2023-08-10 11:15:25,721 DEBUG: Analytics is disabled.

@dberenbaum dberenbaum added bug Did we break something? p1-important Important, aka current backlog of things to do A: pipelines Related to the pipelines feature labels Aug 10, 2023
@dberenbaum dberenbaum added this to DVC Aug 10, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Aug 10, 2023
@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Aug 10, 2023
@dberenbaum
Copy link
Collaborator

Looks like it is failing in my example because data/pool_data.dvc is in legacy 2.x format, so the hash info doesn't match the stage dep here:

dvc/dvc/stage/__init__.py

Lines 315 to 321 in 04e891c

if allow_missing and status[str(dep)] == "deleted":
if upstream and any(
dep.fs_path == out.fs_path and dep.hash_info != out.hash_info
for stage in upstream
for out in stage.outs
):
status[str(dep)] = "modified"

The hashes are the same, but debugging shows that the different hash names make it fail:

(Pdb) out.hash_info
HashInfo(name='md5-dos2unix', value='14d187e749ee5614e105741c719fa185.dir', obj_name=None)
(Pdb) dep.hash_info
HashInfo(name='md5', value='14d187e749ee5614e105741c719fa185.dir', obj_name=None)

@Otterpatsch Does datasets/benchmark-sets/customer0/2020_11_02.dvc contain the line hash: md5 (that line is only present in 3.x files)? Also, could you try to delete the site cache dir?

@dberenbaum dberenbaum removed this from DVC Aug 10, 2023
@dberenbaum
Copy link
Collaborator

Looks like it is failing in my example because data/pool_data.dvc is in legacy 2.x format, so the hash info doesn't match the stage dep here:

@iterative/dvc Thoughts on how we should treat this? Is it modified or not?

@daavoo
Copy link
Contributor

daavoo commented Aug 10, 2023

Looks like it is failing in my example because data/pool_data.dvc is in legacy 2.x format, so the hash info doesn't match the stage dep here:

@iterative/dvc Thoughts on how we should treat this? Is it modified or not?

IMO, it was an overlook for this scenario

@dberenbaum
Copy link
Collaborator

@daavoo What does that mean? Do you think we should only compare the hash value and not all hash info?

@daavoo
Copy link
Contributor

daavoo commented Aug 10, 2023

@daavoo What does that mean?

I mean that we should not consider it modified in the example-get-started-experiments scenario.

Do you think we should only compare the hash value and not all hash info?

Can't say from the top of my mind. Would need to take a closer look to see what makes sense

@daavoo daavoo removed the awaiting response we are waiting for your reply, please respond! :) label Aug 10, 2023
@dberenbaum dberenbaum added this to DVC Aug 10, 2023
@dberenbaum dberenbaum moved this to Todo in DVC Aug 10, 2023
@Otterpatsch
Copy link
Author

Otterpatsch commented Aug 11, 2023

Does datasets/benchmark-sets/customer0/2020_11_02.dvc contain the line hash: md5 (that line is only present in 3.x files)?

outs:
- md5: f4eb1691cb23a5160a958274b9b9fb41.dir
  size: 55860614
  nfiles: 5491
  path: '2020_11_02'

seems it does

Also, could you try to delete the site cache dir?

With deleting the /var/tmp/dvc (was existing) error persists

@daavoo
Copy link
Contributor

daavoo commented Aug 11, 2023

So, to give context, the problem appears if there is a .dvc file in 2.X format:

https://github.com/iterative/example-get-started-experiments/blob/9dba21cbffb0caad939c63db427eea7f16f3c269/data/pool_data.dvc#L1-L5

That is referenced in a dvc.lock in 3.X format as a dependency:

https://github.com/iterative/example-get-started-experiments/blob/9dba21cbffb0caad939c63db427eea7f16f3c269/dvc.lock#L6-L10

As soon as the contents associated with the .dvc are updated, the file will be updated to 3.X format so the problem would disappear.

Do you think we should only compare the hash value and not all hash info?
Can't say from the top of my mind. Would need to take a closer look to see what makes sense

Strictly speaking, I guess there could be a collision where we would be miss identifying 2 different things as being the same 🤷

@efiop
Copy link
Contributor

efiop commented Aug 11, 2023

As soon as the contents associated with the .dvc are updated, the file will be updated to 3.X format so the problem would disappear.

@Otterpatsch Is it possible to just force-commit for you to upgrade those hashes? We can't really compare those without computing both, which is undesirable. Seems like just upgrading old lock file should be an easy long-term fix.

@Otterpatsch
Copy link
Author

How do i upgrade the hashes?

@dberenbaum
Copy link
Collaborator

@Otterpatsch You can do dvc commit -f to upgrade the hashes.

@dberenbaum
Copy link
Collaborator

@daavoo Are you planning a PR to fix the dvc commit -f behavior?

@Otterpatsch Are you still working through this problem? It turns out that dvc commit -f won't fix it for you currently. The best workaround for now would be to do dvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc followed by dvc add datasets/benchmark-sets/customer0/2020_11_02.dvc.

@daavoo
Copy link
Contributor

daavoo commented Aug 18, 2023

@daavoo Are you planning a PR to fix the dvc commit -f behavior?

yes

@Otterpatsch
Copy link
Author

@daavoo Are you planning a PR to fix the dvc commit -f behavior?

@Otterpatsch Are you still working through this problem? It turns out that dvc commit -f won't fix it for you currently. The best workaround for now would be to do dvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc followed by dvc add datasets/benchmark-sets/customer0/2020_11_02.dvc.

alright we will test that. But currently we just rolled back to just jusing dvc pull and dvc status (close to a hour).
Yeah dvc commit -f did somethings but pipeline was still failing but I wasnt sure if we had some other issues so i tried to find those.
As soon as the dvc commit -f fix is implemented should this in theory fix also this issue (when dvc commit -f is run and commited ofc)?

@dberenbaum
Copy link
Collaborator

Once datasets/benchmark-sets/customer0/2020_11_02.dvc is updated to use the 3.0 cache (you should see the field hash: md5 in that file), then it should fix this issue. If you do dvc remove datasets/benchmark-sets/customer0/2020_11_02.dvc; dvc add datasets/benchmark-sets/customer0/2020_11_02.dvc, it should work now.

@Otterpatsch
Copy link
Author

Otterpatsch commented Aug 21, 2023

So i fixed the issue (i think) on our side. I basically run dvc repro --allow-missing --dry couple of times to get each time one of the datasets which where still dvc2. Then i readd those and not anymore crashing.

But now the pipeline succeeds even tho i get a the following lines in the command. Which makes sense because i changed a lot of .dvc files which are also in that path.

13:57:33  2023-08-21 11:57:24,369 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,370 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:57:33  2023-08-21 11:57:24,371 DEBUG: stage: 'training' changed.
13:57:33  2023-08-21 11:57:24,384 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,386 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: {'datasets/training-sets': 'modified'}
13:57:33  2023-08-21 11:57:24,408 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,409 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  Running stage 'training':
13:57:33  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:57:33  > cp -r stages/training/charsets model/
13:57:33  2023-08-21 11:57:24,412 DEBUG: stage: 'training' was reproduced

How can i fix this? Like it seems that i dont use the correct command for my pipeline. I mean the command succeeds but it should fail in a pipeline sense because a repro would be run if i just would use dvc repro.

I believe im missing something similar to the dvc data status one
dvc data status --not-in-remote --json | grep -v not_in_remote

which got the grep but not sure how do it for dvc repro --allow-missing --dry so it failes for all kinds of the dependecies.

So i tried:
dvc repro --dry --allow-missing | grep -v "Running stage "
But it still succeds even tho if i just use grep "Running stage " i get some output

> dvc repro --dry --allow-missing | grep "Running stage "
Running stage 'training':
Running stage 'collect_benchmarks':

@daavoo daavoo moved this from In Progress to Todo in DVC Sep 6, 2023
@dberenbaum
Copy link
Collaborator

dvc commit -f also seems like it would be useful after running dvc cache migrate to ensure that all dvc files reference the migrated 3.x cache. See https://discord.com/channels/485586884165107732/563406153334128681/1149449470480760982.

@Otterpatsch
Copy link
Author

Otterpatsch commented Sep 11, 2023

The dvc cache migrate outputs that no file changed in the the cache. Even tho the dvc commit -f did something again but i also rebased the branch which intruce those pipeline changes. But event commit those changes the pipeline still fails.

I also run dvc cache migrate on the ci machine. I doenst apply any changes. Also the after each run everything is cleared (but wanted to check anyway)

Sadly the pipeline still fails on the dvc repro --dry --allow-missing | grep -vz "Running". Which a dvc pull it doenst fail

So the following output confused me a lot as there a .dvc files with no md5sum? even tho i run the commands you mentioned. So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?

13:44:20  2023-09-11 11:44:12,198 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/training-2023_08_08-LYD-consignments.dvc' md5: 'a9c8f1cf1840f743123f169bba789ac1'
13:44:20  'datasets/training-sets/customer/CustomerName/training-2023_08_08-LYD-consignments.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,201 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field.dvc' md5: 'None'
13:44:20  'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,206 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/training-2020_08_22-alpha-referenceOrder.dvc' md5: 'None'
13:44:20  'datasets/training-sets/customer/CustomerName/training-2020_08_22-alpha-referenceOrder.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,210 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field_2020-07-14.dvc' md5: 'None'
13:44:20  'datasets/training-sets/customer/CustomerName/CustomerName_trendcolor_field_2020-07-14.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,214 DEBUG: Computed stage: 'datasets/training-sets/customer/CustomerName/CustomerName_empty_consignment_field_faxified.dvc' md5: 'None'
13:44:20  'datasets/training-sets/customer/CustomerName/CustomerName_empty_consignment_field_faxified.dvc' didn't change, skipping
13:44:20  2023-09-11 11:44:12,255 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20  2023-09-11 11:44:12,256 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:44:20  2023-09-11 11:44:12,257 DEBUG: stage: 'training' changed.
13:44:20  2023-09-11 11:44:12,271 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20  2023-09-11 11:44:12,273 DEBUG: built tree 'object f203ea8d0a44649090eb4d3debd6ed8d.dir'
13:44:20  2023-09-11 11:44:12,286 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20  2023-09-11 11:44:12,286 DEBUG: {'datasets/training-sets': 'modified'}
13:44:20  2023-09-11 11:44:12,298 DEBUG: built tree 'object d612b2a946d81ab74f8dfeeea7e41a8a.dir'
13:44:20  2023-09-11 11:44:12,300 DEBUG: built tree 'object f203ea8d0a44649090eb4d3debd6ed8d.dir'
13:44:20  Running stage 'training':
13:44:20  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:44:20  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:44:20  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:44:20  > cp -r stages/training/charsets model/
13:44:20  2023-09-11 11:44:12,302 DEBUG: stage: 'training' was reproduced

@dberenbaum
Copy link
Collaborator

So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?

Yes, sorry for the confusion @Otterpatsch. I initially thought dvc commit -f would achieve that, but it doesn't do that today. We are looking into changing that, but for now you would need to do this yourself.

@Otterpatsch
Copy link
Author

Otterpatsch commented Sep 20, 2023

So i do i get rid of the dvc2 .dvc files and replace them with their dvc3 counterpart?

Yes, sorry for the confusion @Otterpatsch. I initially thought dvc commit -f would achieve that, but it doesn't do that today. We are looking into changing that, but for now you would need to do this yourself.

So i just tried to do that (with version 3.22.0)

dvc repro --dry --allow-missing to detect all "bad" dvc files. E.g datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc
then i run rm datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc to get rid of the bad file and dvc add datasets/benchmark-sets/SomeCompanyName/2020_11_02 to readd the directory. See below the git diff (at line fine hash: md5 was inserted).

 ■■ datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc                                                               
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
@@ -2,4 +2,5 @@ outs:                                                                                               
    2    - md5: f4eb1691cb23a5160a958274b9b9fb41.dir                                                                
    3      size: 55860614                                                                                           
    4      nfiles: 5491                                                                                             
    5  +   hash: md5                                                                                                
    5      path: '2020_11_02'  

Now i expected if i run dvc repro --dry --allow-missing to not have an have the output md5: 'None for that one specific file.
But i still do get the same output as earlier

> dvc repro --dry --allow-missing --verbose | grep -P "datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc"  
2023-09-20 12:34:13,083 DEBUG: Computed stage: 'datasets/benchmark-sets/SomeCompanyName/2020_11_02.dvc' md5: 'None'

@dberenbaum
Copy link
Collaborator

That debug calls from here:

dvc/dvc/stage/__init__.py

Lines 466 to 473 in b856081

def compute_md5(self) -> Optional[str]:
# `dvc add`ed files don't need stage md5
if self.is_data_source and not (self.is_import or self.is_repo_import):
m = None
else:
m = compute_md5(self)
logger.debug("Computed %s md5: '%s'", self, m)
return m

It will only show a non-empty md5 for an actual stage, not a .dvc-tracked data source. The check for --allow-missing is separate and comes later, so this is expected.

Is dvc repro --dry --allow-missing skipping the stage/working as expected?

@dberenbaum dberenbaum added the awaiting response we are waiting for your reply, please respond! :) label Oct 10, 2023
@dberenbaum
Copy link
Collaborator

Closing since I haven't heard back but feel free to reopen if you still have issues @Otterpatsch

@github-project-automation github-project-automation bot moved this from Todo to Done in DVC Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants