`dvc repro --dry`: should fail if any stage has to run #9861

dberenbaum · 2023-08-21T13:26:34Z

So i fixed the issue (i think) on our side. I basically run dvc repro --allow-missing --dry couple of times to get each time one of the datasets which where still dvc2. Then i readd those and not anymore crashing.

But now the pipeline succeeds even tho i get a the following lines in the command. Which makes sense because i changed a lot of .dvc files which are also in that path.

13:57:33  2023-08-21 11:57:24,369 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,370 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:57:33  2023-08-21 11:57:24,371 DEBUG: stage: 'training' changed.
13:57:33  2023-08-21 11:57:24,384 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,386 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: {'datasets/training-sets': 'modified'}
13:57:33  2023-08-21 11:57:24,408 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,409 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  Running stage 'training':
13:57:33  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:57:33  > cp -r stages/training/charsets model/
13:57:33  2023-08-21 11:57:24,412 DEBUG: stage: 'training' was reproduced

How can i fix this? Like it seems that i dont use the correct command for my pipeline. I mean the command succeeds but it should fail in a pipeline sense because a repro would be run if i just would use dvc repro.

I believe im missing something similar to the dvc data status one
dvc data status --not-in-remote --json | grep -v not_in_remote

which got the grep but not sure how do it for dvc repro --allow-missing --dry so it failes for all kinds of the dependecies.

So i tried:
dvc repro --dry --allow-missing | grep -v "Running stage "
But it still succeds even tho if i just use grep "Running stage " i get some output

> dvc repro --dry --allow-missing | grep "Running stage "
Running stage 'training':
Running stage 'collect_benchmarks':

Originally posted by @Otterpatsch in #9818 (comment)

The text was updated successfully, but these errors were encountered:

dberenbaum · 2023-08-21T13:28:32Z

@Otterpatsch I opened a separate issue for this one to keep track.

@iterative/dvc I think once we hit a stage that needs to be run, we need to stop execution for downstream stages, or at least we should raise a non-0 exit code.

dberenbaum · 2023-08-21T13:34:14Z

@Otterpatsch do you hit this in CI or only when testing locally? I would expect that you would hit this when you have already pulled the data but not in CI since the data won't be pulled and downstream stages will likely fail to find the necessary dependencies.

Otterpatsch · 2023-08-22T06:30:03Z

I hit this on the ci. I assume because of the --allow-missing the command does fail on the non present dependencies(partially .dvc tracked dependencies are fine).

I run the following commands on the machine where i dvc pushed the data(and did the repro):
>dvc repro --dry --allow-missing | grep -vz "Running stage"
where echo $? returns the expected returncode of 0
but on any other machine i get a return code of 1. Which is unexpected

Some more details:
On any machine(tested 2) which didnt pull the data the dvc repro --dry --allow-missing is returning
something like

2023-08-22 09:24:04,038 DEBUG: built tree 'object 2a19b4e09c2cc3cb5fd9bf314391d8f3.dir'
2023-08-22 09:24:04,053 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
2023-08-22 09:24:04,054 DEBUG: stage: 'training' changed.
2023-08-22 09:24:08,281 DEBUG: built tree 'object 2a19b4e09c2cc3cb5fd9bf314391d8f3.dir'
2023-08-22 09:24:08,297 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
2023-08-22 09:24:12,538 DEBUG: built tree 'object 2a19b4e09c2cc3cb5fd9bf314391d8f3.dir'
2023-08-22 09:24:12,556 DEBUG: {'datasets/training-sets': 'modified'}
2023-08-22 09:24:16,933 DEBUG: built tree 'object 2a19b4e09c2cc3cb5fd9bf314391d8f3.dir'
2023-08-22 09:24:16,952 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
Running stage 'training':
> conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
> conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
> conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
> cp -r stages/training/charsets model/
2023-08-22 09:24:16,954 DEBUG: stage: 'training' was reproduced

dvc push was done, also dvc data status --not-in-remote --json | grep -v not_in_remote succeeds )

datasets/training-sets is a path to a bunch of directories/datasets which are used for training.

ls datasets/training-sets
2023-easter-internal-dates.dvc  KeinWifi     customer             playground-alphanumeric.dvc
2023-easter-internal-something.dvc  backgrounds  example_dataset.dvc  playground.dvc

Basically instead of adding all full paths to the dependencies (which would be ~200-300 lines) we decided to just add the parent directory. This parent folder is not dvc tracked but all the subdirectories are. Could this may cause the issue?
It kinda make sense that datasets/training-sets changed as its subdirs were not filled by dvc pull

I assume it could have something do with dependencies which are not fully dvc tracked?

> grep "modified" log --after 5
2023-08-22 07:43:23,032 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
2023-08-22 07:43:23,033 DEBUG: stage: 'training' changed.
2023-08-22 07:43:23,041 DEBUG: built tree 'object 898704477691cd70828dca497f483b3b.dir'
2023-08-22 07:43:23,042 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
2023-08-22 07:43:23,049 DEBUG: built tree 'object 898704477691cd70828dca497f483b3b.dir'
2023-08-22 07:43:23,049 DEBUG: {'datasets/training-sets': 'modified'}
2023-08-22 07:43:23,056 DEBUG: built tree 'object 898704477691cd70828dca497f483b3b.dir'
2023-08-22 07:43:23,057 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
Running stage 'training':
> conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
> conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
--
2023-08-22 07:43:23,262 DEBUG: Dependency 'stages/extract/outputs' of stage: 'collect_benchmarks' changed because it is 'modified'.
2023-08-22 07:43:23,263 DEBUG: stage: 'collect_benchmarks' changed.
2023-08-22 07:43:23,266 DEBUG: built tree 'object 505e2691068f58084f1f23180a42c903.dir'
2023-08-22 07:43:23,268 DEBUG: built tree 'object 505e2691068f58084f1f23180a42c903.dir'
2023-08-22 07:43:23,269 DEBUG: {'stages/extract/outputs': 'modified'}
2023-08-22 07:43:23,270 DEBUG: built tree 'object 505e2691068f58084f1f23180a42c903.dir'
Running stage 'collect_benchmarks':
> mkdir --parents stages/collect_benchmarks/outputs
> conda env export --prefix .conda-envs/collect_benchmarks | grep -v "\(^prefix:\)\|\(^name:\)" > stages/collect_benchmarks/outputs/exported-conda-env.yaml
> conda run --no-capture --prefix .conda-envs/collect_benchmarks/ python stages/collect_benchmarks/scripts/collect_benchmarks.py --predictions-folder stages/extract/outputs --groundtruth-folder datasets/benchmark-sets --output-file stages/collect_benchmarks/outputs/all_datasets.csv

For stage collect_benchmarks the dependency changed which is basically the defined outs of the previous stage

outs:                                                                                                                                                                                                                                 
      - stages/${stage_name}/outputs/${item}

and referenced as such as

deps:
    - stages/extract/outputs

dberenbaum · 2023-08-22T16:08:45Z

Have you git committed and pushed all changes as well? I'm a bit confused whether you are trying to simulate a clean state when no pipeline stages should run, or a messy state when it should run and fail?

Otterpatsch · 2023-08-22T16:48:09Z

Everything is git pushed(git wise im on a clean state). Im trying to have a dvc clean state. The state itself is also clean i think.
At least when i pull, everything is fine. I just cannt get the ci pipeline to indicate that.

dvc repro --dry --allow-missing outputs that some stages dependecies changed. If i pull the command does not indicate that any stage dependency has changed(which is also true).

dberenbaum · 2023-08-22T17:41:18Z

In that case, it may not be about --dry --allow-missing. What is the output of dvc data status on the CI machines?

Edit: you may need to run dvc pull so that they don't all show as missing, but after that I'd expect it to show a clean state.

Otterpatsch · 2023-08-22T20:00:05Z

In that case, it may not be about --dry --allow-missing. What is the output of dvc data status on the CI machines?

> dvc data status                                           
Not in cache:                                                                                                       
  (use "dvc fetch <file>..." to download files)
        model/
....
        datasets/training-sets/customer/some_customer/2022-09-03_training_somedata_1/
        datasets/training-sets/customer/some_customer/2022-09-03_training_somedata_2/
        datasets/training-sets/customer/some_customer/2022-09-03_training_somedata_3/
        datasets/training-sets/customer/some_customer/2022-09-03_training_somedata_4/
        stages/extract/outputs/customer/some-dataset1
        stages/extract/outputs/customer/some-dataset2
        stages/extract/outputs/customer/some-dataset3
...

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
        modified: model/
...
        stages/extract/outputs/customer/some-dataset1
        stages/extract/outputs/customer/some-dataset2
        stages/extract/outputs/customer/some-dataset3

Just to clarify the stage dvc data status --not-in-remote --json | grep -v not_in_remote succeeds (expected).

Edit: you may need to run dvc pull so that they don't all show as missing, but after that I'd expect it to show a clean state.

Yes, if i run dvc pull everything is fine. But thats expected but the idea is that i dont have to dvc pull to verify the pipeline status? So i guess i have to combine the commands? And maybe dvc data status would be enough? or is there any other command missing to archive this? On the ci machine everything is deleted/cleared afterwards (at least thats the intended behavior)
https://dvc.org/doc/user-guide/pipelines/running-pipelines#verify-pipeline-status indicates that the command

dberenbaum · 2023-08-23T01:11:24Z

Yes, if i run dvc pull everything is fine.

Is this true even on the CI machine? Does dvc data status report a clean status in CI after dvc pull? I know you don't want to pull in the final scenario, but I'm trying to understand why DVC tries to run those stages, and why it only tries to run them in CI.

Basically instead of adding all full paths to the dependencies (which would be ~200-300 lines) we decided to just add the parent directory. This parent folder is not dvc tracked but all the subdirectories are. Could this may cause the issue?
It kinda make sense that datasets/training-sets changed as its subdirs were not filled by dvc pull

What do you mean that the subdirs were not filled by dvc pull?

Otterpatsch · 2023-08-23T09:09:04Z

Disclaimer: If i sound confusing or confuse things, its because i am.

Is this true even on the CI machine? Does dvc data status report a clean status in CI after dvc pull? I know you don't want to pull in the final scenario, but I'm trying to understand why DVC tries to run those stages, and why it only tries to run them in CI.

the dvc pull failes to pull some data. This should have been fixed/is fixed on any other branch. So i will investigate that
Could it maybe has to do with the site-cache, which at least for us caused a lot of issues?
Which is very confusing to me because dvc data status --not-in-remote doesnt show anything not beeing on remote. So it should be pullable?

error im getting:

WARNING: No file hash info found for '/var/jenkins_home/workspace/repo_MR-20/datasets/training-sets/customer/somecustomer/training-2023_08_08-LYD-consignments/annotations.jsonl

What do you mean that the subdirs were not filled by dvc pull?

As a dvc pull was not done(in the ci). I may assume(d) the check for the dependency like datasets/training-sets depends on some check. But as this dependency is not a git tracked file nor a .dvc file i assume the check if differs to those files? Because pipelines which have only git/dvc tracked dependency are fine (are shown as "not changed, skipping") basically only dependecies which are paths to directories, which are either dvc outs or contain in some subdirs the .dvc files/the dvc tracked files

So what does our ci pipeline do maybe thats helpfull or whats im trying to archive. As i suspect i do something wrong

some checks linting
dvc data status, to check if a dvc pushed was forgotten (succeeds)
- dvc install to newest version
- dvc doctor
- dvc data status --not-in-remote --json | grep -v not_in_remote
dvc pipeline status, to check if a repro has to be done (fails)
- install newest dvc version
- dvc doctor
- dvc repro --dry --allow-missing | grep -vz "Running stage"

Otterpatsch · 2023-08-23T12:58:17Z

on the machine where the repro was run

> dvc push
Everything is up to date.

on CI machine

dvc repro --allow-missing --dry
....
....
Running stage 'training':
> conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
 > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
> conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
> cp -r stages/training/charsets model/
....
Running stage 'collect_benchmarks':
> mkdir --parents stages/collect_benchmarks/outputs
> conda env export --prefix .conda-envs/collect_benchmarks | grep -v "\(^prefix:\)\|\(^name:\)" > stages/collect_benchmarks/outputs/exported-conda-env.yaml
> conda run --no-capture --prefix .conda-envs/collect_benchmarks/ python stages/collect_benchmarks/scripts/collect_benchmarks.py --predictions-folder stages/extract/outputs --groundtruth-folder datasets/benchmark-sets --output-file stages/collect_benchmarks/outputs/all_datasets.csv

Stage 'plots' didn't change, skipping

if i run dvc pull afterwards and then do a dvc status

> dvc pull
137 files added and 149768 files fetched
>dvc data status
No changes.

After a lot of thinging of what might cause this issue: it seems that dvc repro --dry --allow-missing checks the "local" state but not the remote as intented for a ci pipeline (correct me if iam wrong).
E.g. dvc data status ci command one (dvc data status --not-on-remote checks just the remote state. Is there any command planed which ignore any local state and simply checks the status on the remote site? like a dvc repro status --remote?

dberenbaum · 2024-01-09T16:02:54Z

Closing as stale, but feel free to reopen if you are still facing issues with this

dberenbaum added p1-important Important, aka current backlog of things to do A: pipelines Related to the pipelines feature labels Aug 21, 2023

dberenbaum added this to DVC Aug 21, 2023

github-project-automation bot moved this to Backlog in DVC Aug 21, 2023

dberenbaum added awaiting response we are waiting for your reply, please respond! :) and removed p1-important Important, aka current backlog of things to do labels Aug 22, 2023

dberenbaum removed this from DVC Aug 22, 2023

dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dvc repro --dry`: should fail if any stage has to run #9861

`dvc repro --dry`: should fail if any stage has to run #9861

dberenbaum commented Aug 21, 2023

dberenbaum commented Aug 21, 2023

dberenbaum commented Aug 21, 2023

Otterpatsch commented Aug 22, 2023 •

edited

Loading

dberenbaum commented Aug 22, 2023

Otterpatsch commented Aug 22, 2023 •

edited

Loading

dberenbaum commented Aug 22, 2023 •

edited

Loading

Otterpatsch commented Aug 22, 2023 •

edited

Loading

dberenbaum commented Aug 23, 2023

Otterpatsch commented Aug 23, 2023 •

edited

Loading

Otterpatsch commented Aug 23, 2023 •

edited

Loading

dberenbaum commented Jan 9, 2024

dvc repro --dry: should fail if any stage has to run #9861

dvc repro --dry: should fail if any stage has to run #9861

Comments

dberenbaum commented Aug 21, 2023

dberenbaum commented Aug 21, 2023

dberenbaum commented Aug 21, 2023

Otterpatsch commented Aug 22, 2023 • edited Loading

dberenbaum commented Aug 22, 2023

Otterpatsch commented Aug 22, 2023 • edited Loading

dberenbaum commented Aug 22, 2023 • edited Loading

Otterpatsch commented Aug 22, 2023 • edited Loading

dberenbaum commented Aug 23, 2023

Otterpatsch commented Aug 23, 2023 • edited Loading

Otterpatsch commented Aug 23, 2023 • edited Loading

dberenbaum commented Jan 9, 2024

`dvc repro --dry`: should fail if any stage has to run #9861

`dvc repro --dry`: should fail if any stage has to run #9861

Otterpatsch commented Aug 22, 2023 •

edited

Loading

Otterpatsch commented Aug 22, 2023 •

edited

Loading

dberenbaum commented Aug 22, 2023 •

edited

Loading

Otterpatsch commented Aug 22, 2023 •

edited

Loading

Otterpatsch commented Aug 23, 2023 •

edited

Loading

Otterpatsch commented Aug 23, 2023 •

edited

Loading