Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't pickle with xformers after 81bc427 #203

Closed
jramapuram opened this issue Feb 3, 2022 · 11 comments
Closed

Can't pickle with xformers after 81bc427 #203

jramapuram opened this issue Feb 3, 2022 · 11 comments
Assignees

Comments

@jramapuram
Copy link

jramapuram commented Feb 3, 2022

🐛 Bug

I'm running into pickling errors after commit 81bc427

Pickling is needed for some DDP contexts.

Pseudocode

import pickle

class ViTB(nn.Module):
    def __init__(self):
        super().__init__()
        self.xformer = xFormer.from_config(xFormerConfig([self.model_config]))  # eg: Generic ViT-B conf

pickle.dumps(ViTB())

I haven't bisected, but on head (093224e at this time) this results in:

AttributeError: Can't pickle local object '_init_from_params.<locals>.init_method'

Expected behavior

Model pickles correctly (needed for certain DDP contexts).

Environment

  • PyTorch Version (e.g., 1.0): 1.10.2 (tried earlier versions, same issue).
  • OS (e.g., Linux): Ubuntu 20.04
  • How you installed PyTorch (conda, pip, source): conda
@blefaudeux
Copy link
Contributor

oh, thanks for the report, this is unexpected ! (and the repro will make a perfect unit test so that this does not appear again)

@blefaudeux blefaudeux self-assigned this Feb 3, 2022
@jramapuram jramapuram changed the title Can't pickle xformers after 81bc427 Can't pickle with xformers after 81bc427 Feb 3, 2022
@blefaudeux
Copy link
Contributor

blefaudeux commented Feb 3, 2022

ah, could be related to the FusedMLP which pulls in Triton, this would at least make sense with respect to the linked commit. Just in case, is triton installed on your machines @jramapuram ? (still writing down a test and checking that, but in case this can unlock you earlier) Does this happen if you ask for "MLP" and not "FullMLP" ?

@blefaudeux
Copy link
Contributor

_init_from_params

replying to myself: nope, this is not because of fusedMLP, I can repro indeed. Fixing that

@jramapuram
Copy link
Author

You beat me to it :). Let me know if you need anything else.

@jramapuram
Copy link
Author

Here is my config if helpful (it is pretty vanilla):

reversible: False
block_type: "encoder"
num_layers: 12
dim_model: 768
layer_norm_style: "pre"

multi_head_config:
  num_heads: 12
  residual_dropout: 0.1
  use_rotary_embeddings: true

  attention:
    name: "scaled_dot_product"
    dropout: 0.0
    causal: False

feedforward_config:
  name: "MLP"
  dropout: 0.0
  activation: "gelu"
  hidden_layer_multiplier: 4

@blefaudeux
Copy link
Contributor

@jramapuram the above should work with https://github.com/facebookresearch/xformers/tree/pickling_tests, but pickling still dies on the triton parts which are JIT compiled, so if you switch MLP to FusedMLP it will crash on that :( Trying to find a global solution

@jramapuram
Copy link
Author

Got it; thanks for the quick turn around! This should be okay to get going in the meantime at least :)

@blefaudeux
Copy link
Contributor

Got it; thanks for the quick turn around! This should be okay to get going in the meantime at least :)

I just updated the branch, make sure that you force pull, it was subtly buggy at some point (the init if k/q/v had different settings). The triton pickling part should be fixable but a bit short on time right now, checking back in a bit later. Putting up a draft PR

xwhan pushed a commit to xwhan/xformers that referenced this issue Feb 8, 2022
* Adding checkerboard pattern

* Adding documentation

* black reformatting

* Revert "black reformatting"

This reverts commit 032cf331495f63e1439c8334625220afae2d29ea.

* Revert changes to notebook

* Refactor implementation to handle some corner cases

Also re-uses already existing methods to simplify things

* Add tests and bugfix

* Bugfix

Co-authored-by: Marta Gazulla <martatintore@devfair0121.h2.fair>
Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
@blefaudeux
Copy link
Contributor

@jramapuram late thought, I've been meaning to write back but time flew.. It's a bit strange to me that you have to pickle the model, it's not typically considered good practice, each of the processes that DDP spawns should be able to rebuild the whole context by itself (you just need to be careful about the seeds which need to be set on each process, but it's typically a good idea anyway). Something to take care also is that the dataloaders need to be properly initialized, but you're probably on top of that already.

@jramapuram
Copy link
Author

@blefaudeux : this was with some prototype pytorch-lightning code. I'm guessing they use the old mp.spawn method instead of something like torchrun (which we typically use 😄 ) to spinup K processes per GPU.

Good shout on the dataloaders, already seeding those with replica_id from the world pool already 👍

@blefaudeux
Copy link
Contributor

Closing that for now, I don't think that it's actually possible to pickle the Triton parts since they are compiled JIT and that this is beyond the pickling mechanism. A fallback is to not use triton if this is required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants