Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix uploading processors/tokenizers to WandB on train end #35701

Conversation

jack89roberts
Copy link
Contributor

@jack89roberts jack89roberts commented Jan 15, 2025

What does this PR do?

Since tokenizer was renamed to processing_class in #32385 models uploaded to WandB at the end of training have not included the tokenizer/processor config. This is due to a missed update to the argument names in WandbCallback.on_train_end, which is fixed in this PR. The bug has only impacted the final artifacts saved at the end of training, checkpoints saved during training still upload both the model and the processor config.

If you have wandb setup and logged in you can see the problem by running examples/pytorch/text-classification on main and this branch, and comparing the artifacts logged to your wandb run (you'll need to delete the local output directory rm -rf /tmp/mrpc before starting the script on each branch):

export TASK_NAME=mrpc                            
export WANDB_LOG_MODEL=checkpoint
python run_glue.py \
  --model_name_or_path google-bert/bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 2 \
  --output_dir /tmp/$TASK_NAME/ \
  --report_to wandb \
  --save_strategy epoch \
  --save_total_limit 2 \
  --load_best_model_at_end True \
  --eval_strategy epoch

The files in the logged :latest model artifact in your run will include the tokenizer config for the run from this branch, but not in the run started on main.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Previous changes: @amyeroberts

Maybe Trainer: @muellerzr, @SunMarc , though this is more an integration/WandB issue.

@jack89roberts
Copy link
Contributor Author

jack89roberts commented Jan 15, 2025

tests_torch_and_tf has failed in circleci but I don't see how that can be related to these changes. I'm happy to try to fix it if there's something I can do/someone can point me in the right direction, though:

   1 failed because `AssertionError: 0.00021042596 not less than or equal to 0.0002 : outputs.logits` -> Difference between PyTorch and TF is 0.00021042596199549735 (>= 0.0002) for Data2VecVisionForSemanticSegmentation
Number of failed tests: 1

Edit: Passed after the 2nd commit so I suspect it was some unlucky randomness in the tests?

@jack89roberts
Copy link
Contributor Author

jack89roberts commented Jan 15, 2025

Looking at the rest of transformers/integrations/integration_utils.py I suspect all instances of tokenizer should be replaced with processing_class. I have pushed that here as well - there are 6 other instances of it across ClearMLCallback and DVCLiveCallback but I'm unfamiliar with those and so less certain about them than the WandB updates.

@ayulockin
Copy link

Hey @jack89roberts, I am from W&B. Thanks for opening this PR. LGTM. :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is something we missed from the PR you mentioned. We need to have this change since we have the following in CallbackHandler

def call_event(self, event, args, state, control, **kwargs):
        for callback in self.callbacks:
            result = getattr(callback, event)(
                args,
                state,
                control,
                model=self.model,
                processing_class=self.processing_class,
                optimizer=self.optimizer,
                lr_scheduler=self.lr_scheduler,
                train_dataloader=self.train_dataloader,
                eval_dataloader=self.eval_dataloader,
                **kwargs,
            )
            # A Callback can skip the return of `control` if it doesn't change it.
            if result is not None:
                control = result
        return control

@SunMarc SunMarc requested a review from ArthurZucker January 15, 2025 14:10
@jack89roberts
Copy link
Contributor Author

Hi, sorry to prod but is there anything I can do to help this get merged/make sure it doesn't get forgotten about? It's stopping us from using transformers >=4.46.0.

@MekkCyber
Copy link
Contributor

Hi @jack89roberts, sorry for the delay we are just waiting for the review of a core maintainer

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow sorry and thanks all for the update! INdeed we changed the api I think 2-3 month ago!

@ArthurZucker ArthurZucker merged commit 0a950e0 into huggingface:main Jan 23, 2025
25 checks passed
@jack89roberts
Copy link
Contributor Author

Thanks!

bursteratom pushed a commit to bursteratom/transformers that referenced this pull request Jan 31, 2025
…e#35701)

* rename tokenizer to processing_class in WandbCallback.on_train_end

* rename tokenizer to processing_class in ClearMLCallback and DVCLiveCallback
elvircrn pushed a commit to elvircrn/transformers that referenced this pull request Feb 13, 2025
…e#35701)

* rename tokenizer to processing_class in WandbCallback.on_train_end

* rename tokenizer to processing_class in ClearMLCallback and DVCLiveCallback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants