Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc repro --dry: should fail if any stage has to run #9861

Closed
dberenbaum opened this issue Aug 21, 2023 · 11 comments
Closed

dvc repro --dry: should fail if any stage has to run #9861

dberenbaum opened this issue Aug 21, 2023 · 11 comments
Labels
A: pipelines Related to the pipelines feature awaiting response we are waiting for your reply, please respond! :)

Comments

@dberenbaum
Copy link
Collaborator

So i fixed the issue (i think) on our side. I basically run dvc repro --allow-missing --dry couple of times to get each time one of the datasets which where still dvc2. Then i readd those and not anymore crashing.

But now the pipeline succeeds even tho i get a the following lines in the command. Which makes sense because i changed a lot of .dvc files which are also in that path.

13:57:33  2023-08-21 11:57:24,369 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,370 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
13:57:33  2023-08-21 11:57:24,371 DEBUG: stage: 'training' changed.
13:57:33  2023-08-21 11:57:24,384 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,386 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,397 DEBUG: {'datasets/training-sets': 'modified'}
13:57:33  2023-08-21 11:57:24,408 DEBUG: built tree 'object 880a0f10a0350a3ed636a6a395a7cd4a.dir'
13:57:33  2023-08-21 11:57:24,409 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
13:57:33  Running stage 'training':
13:57:33  > conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
13:57:33  > conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
13:57:33  > cp -r stages/training/charsets model/
13:57:33  2023-08-21 11:57:24,412 DEBUG: stage: 'training' was reproduced

How can i fix this? Like it seems that i dont use the correct command for my pipeline. I mean the command succeeds but it should fail in a pipeline sense because a repro would be run if i just would use dvc repro.

I believe im missing something similar to the dvc data status one
dvc data status --not-in-remote --json | grep -v not_in_remote

which got the grep but not sure how do it for dvc repro --allow-missing --dry so it failes for all kinds of the dependecies.

So i tried:
dvc repro --dry --allow-missing | grep -v "Running stage "
But it still succeds even tho if i just use grep "Running stage " i get some output

> dvc repro --dry --allow-missing | grep "Running stage "
Running stage 'training':
Running stage 'collect_benchmarks':

Originally posted by @Otterpatsch in #9818 (comment)

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do A: pipelines Related to the pipelines feature labels Aug 21, 2023
@dberenbaum dberenbaum added this to DVC Aug 21, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Aug 21, 2023
@dberenbaum
Copy link
Collaborator Author

@Otterpatsch I opened a separate issue for this one to keep track.

@iterative/dvc I think once we hit a stage that needs to be run, we need to stop execution for downstream stages, or at least we should raise a non-0 exit code.

@dberenbaum dberenbaum changed the title So i fixed the issue (i think) on our side. I basically run dvc repro --allow-missing --dry couple of times to get each time one of the datasets which where still dvc2. Then i readd those and not anymore crashing. dvc repro --dry: should fail if any stage has to run Aug 21, 2023
@dberenbaum
Copy link
Collaborator Author

@Otterpatsch do you hit this in CI or only when testing locally? I would expect that you would hit this when you have already pulled the data but not in CI since the data won't be pulled and downstream stages will likely fail to find the necessary dependencies.

@Otterpatsch
Copy link

Otterpatsch commented Aug 22, 2023

I hit this on the ci. I assume because of the --allow-missing the command does fail on the non present dependencies(partially .dvc tracked dependencies are fine).

I run the following commands on the machine where i dvc pushed the data(and did the repro):
>dvc repro --dry --allow-missing | grep -vz "Running stage"
where echo $? returns the expected returncode of 0
but on any other machine i get a return code of 1. Which is unexpected

Some more details:
On any machine(tested 2) which didnt pull the data the dvc repro --dry --allow-missing is returning
something like

2023-08-22 09:24:04,038 DEBUG: built tree 'object 2a19b4e09c2cc3cb5fd9bf314391d8f3.dir'
2023-08-22 09:24:04,053 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
2023-08-22 09:24:04,054 DEBUG: stage: 'training' changed.
2023-08-22 09:24:08,281 DEBUG: built tree 'object 2a19b4e09c2cc3cb5fd9bf314391d8f3.dir'
2023-08-22 09:24:08,297 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
2023-08-22 09:24:12,538 DEBUG: built tree 'object 2a19b4e09c2cc3cb5fd9bf314391d8f3.dir'
2023-08-22 09:24:12,556 DEBUG: {'datasets/training-sets': 'modified'}
2023-08-22 09:24:16,933 DEBUG: built tree 'object 2a19b4e09c2cc3cb5fd9bf314391d8f3.dir'
2023-08-22 09:24:16,952 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
Running stage 'training':
> conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
> conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
> conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
> cp -r stages/training/charsets model/
2023-08-22 09:24:16,954 DEBUG: stage: 'training' was reproduced

dvc push was done, also dvc data status --not-in-remote --json | grep -v not_in_remote succeeds )

datasets/training-sets is a path to a bunch of directories/datasets which are used for training.

ls datasets/training-sets
2023-easter-internal-dates.dvc  KeinWifi     customer             playground-alphanumeric.dvc
2023-easter-internal-something.dvc  backgrounds  example_dataset.dvc  playground.dvc

Basically instead of adding all full paths to the dependencies (which would be ~200-300 lines) we decided to just add the parent directory. This parent folder is not dvc tracked but all the subdirectories are. Could this may cause the issue?
It kinda make sense that datasets/training-sets changed as its subdirs were not filled by dvc pull

I assume it could have something do with dependencies which are not fully dvc tracked?

> grep "modified" log --after 5
2023-08-22 07:43:23,032 DEBUG: Dependency 'datasets/training-sets' of stage: 'training' changed because it is 'modified'.
2023-08-22 07:43:23,033 DEBUG: stage: 'training' changed.
2023-08-22 07:43:23,041 DEBUG: built tree 'object 898704477691cd70828dca497f483b3b.dir'
2023-08-22 07:43:23,042 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
2023-08-22 07:43:23,049 DEBUG: built tree 'object 898704477691cd70828dca497f483b3b.dir'
2023-08-22 07:43:23,049 DEBUG: {'datasets/training-sets': 'modified'}
2023-08-22 07:43:23,056 DEBUG: built tree 'object 898704477691cd70828dca497f483b3b.dir'
2023-08-22 07:43:23,057 DEBUG: built tree 'object 2ead35ca4cf9b96e0f4ad3cc696e78d7.dir'
Running stage 'training':
> conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
> conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
--
2023-08-22 07:43:23,262 DEBUG: Dependency 'stages/extract/outputs' of stage: 'collect_benchmarks' changed because it is 'modified'.
2023-08-22 07:43:23,263 DEBUG: stage: 'collect_benchmarks' changed.
2023-08-22 07:43:23,266 DEBUG: built tree 'object 505e2691068f58084f1f23180a42c903.dir'
2023-08-22 07:43:23,268 DEBUG: built tree 'object 505e2691068f58084f1f23180a42c903.dir'
2023-08-22 07:43:23,269 DEBUG: {'stages/extract/outputs': 'modified'}
2023-08-22 07:43:23,270 DEBUG: built tree 'object 505e2691068f58084f1f23180a42c903.dir'
Running stage 'collect_benchmarks':
> mkdir --parents stages/collect_benchmarks/outputs
> conda env export --prefix .conda-envs/collect_benchmarks | grep -v "\(^prefix:\)\|\(^name:\)" > stages/collect_benchmarks/outputs/exported-conda-env.yaml
> conda run --no-capture --prefix .conda-envs/collect_benchmarks/ python stages/collect_benchmarks/scripts/collect_benchmarks.py --predictions-folder stages/extract/outputs --groundtruth-folder datasets/benchmark-sets --output-file stages/collect_benchmarks/outputs/all_datasets.csv

For stage collect_benchmarks the dependency changed which is basically the defined outs of the previous stage

outs:                                                                                                                                                                                                                                 
      - stages/${stage_name}/outputs/${item}  

and referenced as such as

deps:
    - stages/extract/outputs

@dberenbaum
Copy link
Collaborator Author

Have you git committed and pushed all changes as well? I'm a bit confused whether you are trying to simulate a clean state when no pipeline stages should run, or a messy state when it should run and fail?

@Otterpatsch
Copy link

Otterpatsch commented Aug 22, 2023

Everything is git pushed(git wise im on a clean state). Im trying to have a dvc clean state. The state itself is also clean i think.
At least when i pull, everything is fine. I just cannt get the ci pipeline to indicate that.

dvc repro --dry --allow-missing outputs that some stages dependecies changed. If i pull the command does not indicate that any stage dependency has changed(which is also true).

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Aug 22, 2023

In that case, it may not be about --dry --allow-missing. What is the output of dvc data status on the CI machines?

Edit: you may need to run dvc pull so that they don't all show as missing, but after that I'd expect it to show a clean state.

@dberenbaum dberenbaum added awaiting response we are waiting for your reply, please respond! :) and removed p1-important Important, aka current backlog of things to do labels Aug 22, 2023
@dberenbaum dberenbaum removed this from DVC Aug 22, 2023
@Otterpatsch
Copy link

Otterpatsch commented Aug 22, 2023

In that case, it may not be about --dry --allow-missing. What is the output of dvc data status on the CI machines?

> dvc data status                                           
Not in cache:                                                                                                       
  (use "dvc fetch <file>..." to download files)
        model/
....
        datasets/training-sets/customer/some_customer/2022-09-03_training_somedata_1/
        datasets/training-sets/customer/some_customer/2022-09-03_training_somedata_2/
        datasets/training-sets/customer/some_customer/2022-09-03_training_somedata_3/
        datasets/training-sets/customer/some_customer/2022-09-03_training_somedata_4/
        stages/extract/outputs/customer/some-dataset1
        stages/extract/outputs/customer/some-dataset2
        stages/extract/outputs/customer/some-dataset3
...

DVC uncommitted changes:
  (use "dvc commit <file>..." to track changes)
  (use "dvc checkout <file>..." to discard changes)
        modified: model/
...
        stages/extract/outputs/customer/some-dataset1
        stages/extract/outputs/customer/some-dataset2
        stages/extract/outputs/customer/some-dataset3

Just to clarify the stage dvc data status --not-in-remote --json | grep -v not_in_remote succeeds (expected).

Edit: you may need to run dvc pull so that they don't all show as missing, but after that I'd expect it to show a clean state.

Yes, if i run dvc pull everything is fine. But thats expected but the idea is that i dont have to dvc pull to verify the pipeline status? So i guess i have to combine the commands? And maybe dvc data status would be enough? or is there any other command missing to archive this? On the ci machine everything is deleted/cleared afterwards (at least thats the intended behavior)
https://dvc.org/doc/user-guide/pipelines/running-pipelines#verify-pipeline-status indicates that the command

@dberenbaum
Copy link
Collaborator Author

Yes, if i run dvc pull everything is fine.

Is this true even on the CI machine? Does dvc data status report a clean status in CI after dvc pull? I know you don't want to pull in the final scenario, but I'm trying to understand why DVC tries to run those stages, and why it only tries to run them in CI.

Basically instead of adding all full paths to the dependencies (which would be ~200-300 lines) we decided to just add the parent directory. This parent folder is not dvc tracked but all the subdirectories are. Could this may cause the issue?
It kinda make sense that datasets/training-sets changed as its subdirs were not filled by dvc pull

What do you mean that the subdirs were not filled by dvc pull?

@Otterpatsch
Copy link

Otterpatsch commented Aug 23, 2023

Disclaimer: If i sound confusing or confuse things, its because i am.

Is this true even on the CI machine? Does dvc data status report a clean status in CI after dvc pull? I know you don't want to pull in the final scenario, but I'm trying to understand why DVC tries to run those stages, and why it only tries to run them in CI.

the dvc pull failes to pull some data. This should have been fixed/is fixed on any other branch. So i will investigate that
Could it maybe has to do with the site-cache, which at least for us caused a lot of issues?
Which is very confusing to me because dvc data status --not-in-remote doesnt show anything not beeing on remote. So it should be pullable?

error im getting:

WARNING: No file hash info found for '/var/jenkins_home/workspace/repo_MR-20/datasets/training-sets/customer/somecustomer/training-2023_08_08-LYD-consignments/annotations.jsonl

What do you mean that the subdirs were not filled by dvc pull?

As a dvc pull was not done(in the ci). I may assume(d) the check for the dependency like datasets/training-sets depends on some check. But as this dependency is not a git tracked file nor a .dvc file i assume the check if differs to those files? Because pipelines which have only git/dvc tracked dependency are fine (are shown as "not changed, skipping") basically only dependecies which are paths to directories, which are either dvc outs or contain in some subdirs the .dvc files/the dvc tracked files


So what does our ci pipeline do maybe thats helpfull or whats im trying to archive. As i suspect i do something wrong

  • some checks linting
  • dvc data status, to check if a dvc pushed was forgotten (succeeds)
    • dvc install to newest version
    • dvc doctor
    • dvc data status --not-in-remote --json | grep -v not_in_remote
  • dvc pipeline status, to check if a repro has to be done (fails)
    • install newest dvc version
    • dvc doctor
    • dvc repro --dry --allow-missing | grep -vz "Running stage"

@Otterpatsch
Copy link

Otterpatsch commented Aug 23, 2023

on the machine where the repro was run

> dvc push
Everything is up to date.

on CI machine

dvc repro --allow-missing --dry
....
....
Running stage 'training':
> conda env export --prefix .conda-envs/training | grep -v "\(^prefix:\)\|\(^name:\)" > stages/training/exported-conda-env.yaml
 > conda run --no-capture --prefix .conda-envs/training/ mmocr train --config_path stages/training/abinet_config_handwriting.py
> conda run --no-capture --prefix .conda-envs/training/ python dependencies/scripts/rename.py model/
> cp -r stages/training/charsets model/
....
Running stage 'collect_benchmarks':
> mkdir --parents stages/collect_benchmarks/outputs
> conda env export --prefix .conda-envs/collect_benchmarks | grep -v "\(^prefix:\)\|\(^name:\)" > stages/collect_benchmarks/outputs/exported-conda-env.yaml
> conda run --no-capture --prefix .conda-envs/collect_benchmarks/ python stages/collect_benchmarks/scripts/collect_benchmarks.py --predictions-folder stages/extract/outputs --groundtruth-folder datasets/benchmark-sets --output-file stages/collect_benchmarks/outputs/all_datasets.csv

Stage 'plots' didn't change, skipping

if i run dvc pull afterwards and then do a dvc status

> dvc pull
137 files added and 149768 files fetched
>dvc data status
No changes.

After a lot of thinging of what might cause this issue: it seems that dvc repro --dry --allow-missing checks the "local" state but not the remote as intented for a ci pipeline (correct me if iam wrong).
E.g. dvc data status ci command one (dvc data status --not-on-remote checks just the remote state. Is there any command planed which ignore any local state and simply checks the status on the remote site? like a dvc repro status --remote?

@dberenbaum
Copy link
Collaborator Author

Closing as stale, but feel free to reopen if you are still facing issues with this

@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

2 participants