-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dreamerv3 trouble resuming? freeze? #187
Comments
It looks like its faster now when i use fabric accelerator gpu, ill close this for now. |
I got this error after 75k steps.
What does this mean? Also I'm running on windows btw. More info:
Could this be due to my vram maxing out? Are there any settings I can do to reduce the usage? Why does dreamerv3 take so little memory during the pretraining, and then once it starts learning it spikes up massively, is that expected? |
Also since stable retro can only start one emulator per process, the training works due to the async_vector_env but when i run sheeprl_eval, the env fails because it tries to open two within the same process. Any way around this? |
I figured some ways around this stuff, only thing I'd like to really know is just if there is any way to reduce the memory since it takes so much vram while its training. otherwise ill just close this since its not really an issue. |
I see there are different dreamerv3 model sizes defined in the original paper that reduce memory and speed up performance. My question is do the transition_model.hidden_size and representation_model.hidden_size change along with the dense_units or any of the other params, or should I just leave those at 1024 for all of the model sizes? |
I feel like you should leave this open, I am encountering the same error with command: |
@balloch fabric.accelerator=gpu on command line. Also do you happen to know if this is a problem?
To speed it up more you can lower the size of the dreamer model by using a custom yaml in the algo config. Note for the below I don't actually know if the "hidden_size" fields should be changed with the model size, so its possible they should just always be 1024. large:
medium:
|
I noticed odd behavior when resuming a checkpoint. it does the pretraining again, even if i change learning_starts=0. It looks like when resuming it just ends after 50k steps, regardless of what i've set the total_steps to also. |
Hi @Disastorm, thank you for reporting this! I'll have a look asap 🤟 |
I'm actually just using regular training rather than experiment. Is regular training not resumable? I actually dont really know the difference between experiments and regular training. This is what I use for the initial training:
This is what I use for resuming:
here is the yaml of the original training:
here is the yaml of the resume training ( in this case it didnt even do a single step of training, it just did a "test" and returned the reward value and then stopped.
|
oh maybe i see the reason. it looks like the resume yaml isn't getting my algo.total_steps params properly from the command line for some reason. It still has the previous total steps, which explains why it quits as soon as it starts. Can you tell me if i should be using experiment instead or is this supposed to work? Also when exactly would I use experiment vs using regular training? thanks. interestingly enough the yaml that is printed out in the console actually has the updated total_steps but the yaml written to the folder ( and presumably the yaml that is actually used, seems to be the previous yaml ). |
looks like if i modify the values in the old yaml, it works. i change the total_steps and learning_starts in the old yaml and when resuming the resuming gets those values properly. |
If you have used just like that it is possible that you're running the experiment on cpu. To run it on gpu you can run the following command: python sheeprl.py exp=dreamer_v3 env=dmc env.id=walker_walk "algo.cnn_keys.encoder=[rgb] fabric.accelerator=gpu and to reduce the memory footprint you can also try to add |
To reduce the memory footprint you could try to lower down the model dimension: you could have a look at this config, where the S version of dreamer-v3 is used. You could also try to lower down the |
This is inteded because when we resume from a checkpoint we assume that somethig has gone wrong, so we load the old config from the checkpoint to be resumed and we merge it with the one running now, so everything that you specify from the CLI will be discarded. Moreover, when you resume from a checkppoint you must (as explained in our how-to) specify the entire path to the file When the checkpoint is resumed, the |
PS I also suggest to install the latest version of sheeprl with |
I think this is an indicator that I have done something wrong in my setup. |
great advice with the fabric=precision I did rerun it with the accelerator set, and it resumes eventually but it takes a long time. maybe there is some initialization that lightning does that im not used to, and it will speed up over time (jax/flax is like this; the first time operations are run is much longer)? |
Could you quantify your being slow? Consider that even the smallest model (the S-sized one) used to train on the Atari-100k takes around 9/10 hours on a single V100, the same reported by the authors of DV3 with just a "< 1 day". Another thing that can speed up the training is to set the |
Ok I see, thanks. So actually manual resuming is not really intended, its considered like "something went wrong"? Is there no way to have the buffer read from the previous run's buffer files? Would filling up the buffer with 65k runs at the start of the resuming make up for this, or should i actually just include the buffer in the checkpoint until I feel like the model is good enough and then take it out of the checkpoint? I assume the way to do this is have the experiment's buffer.checkpoint = True? So my main questions are:
Otherwise feel free to close this issue at least for my stuff. |
Right now no, this is done only when you're resuming a checkpoint and you have set
Exactly, if you're resuming we assume that you want to continue with a previous experiment, taking off from where you left
Right now you just have to set a higher
Dreamer-v3 works only with the power of 2 for how the CNN encoder is defined and yes, your observation is resized to match the |
Cool thanks for the answers. |
@balloch @Disastorm, is it ok for you to close the issue? |
Yes, I think so, very helpful! By slow I mean the default model took ~24 hours to do 1.2 million steps, which isn't bad it's about the speed of dreamerv2, I think it just seemed weird because of that long gap of time to get passed 64k |
Ok with me! |
yea closing is fine. @belerico
And so when resuming it doesn't do the pretraining buffer steps anymore, however I noticed the buffer files don't ever get updated, the last modified date is just when the first training started. Is this a problem? The files I'm referring to are the .memmap files, I see now it doesn't keep creating them for each run when checkpoint = True, so I assumed it would be using the ones from the previous run, but their update date isn't changing at all. Is it inside the checkpoint file itself? The filesize of the checkpoint still looks pretty similar to when running with checkpoint: False I think. |
Hello, sorry I don't actually know much about the mathematical formulas and whatnot behind rl and the algorithms, I've previously been just training stuff with SB3 PPO.
Anyway, I installed sheeprl and implemented a wrapper for Stable Retro and got it started and I could see the envs running in parallel and the agent doing stuff. However, once it hits the point where its set to "learning_starts" it stops logging anything, although its still taking my CPU and RAM. It was basically sitting here for over 30 minutes with no logs. No idea if its actually doing anything or not, although I suppose I could try again with the visual retro window open so I could check.
Any ideas as to what the issue could be?
*edit actually it finally updated, i guess its just really slow after learning starts? Is there a way to run this on GPU?
The text was updated successfully, but these errors were encountered: