Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer: add update_listeners_in_predict option #342

Closed

Conversation

danieldk
Copy link
Contributor

@danieldk danieldk commented Aug 4, 2022

Draft: still needs docs, but I first wanted to discuss this proposal.

The output of a transformer is passed through in two different ways:

  • Prediction: the data is passed through the Doc._.trf_data attribute.
  • Training: the data is broadcast directly to the transformer's listeners.

However, the Transformer.predict method breaks the strict separation between training and prediction by also broadcasting transformer outputs to its listeners. This was added (I think) to support training with a frozen transformer.

However, this breaks down when we are training a model with an unfrozen transformer when the transformer is also in annotating_components. The transformer will first (as part of its update step) broadcast the tensors and backprop function to its listeners. However, then when acting as an annotating component, it would immediately override its own output and clear the backprop function. As a result, gradients will not flow into the transformer.

This change fixes this issue by adding the update_listeners_in_predict option, which is enabled by default. When this option is disabled, the tensors will not be broadcast to listeners in predict.


Alternatives considered:

  • Yanking the listener code from predict: breaks our current semantics, would make it harder to train with a frozen transformer.
  • Checking in the listener if the tensors that we are receiving is the same batch ID as we already have. Don't update if we already have the same batch with a backprop function. I thought this is a bit fragile, because it breaks when batching differs between training and prediction (?).

The output of a transformer is passed through in two different ways:

- Prediction: the data is passed through the `Doc._.trf_data` attribute.
- Training: the data is broadcast directly to the transformer's
  listeners.

However, the `Transformer.predict` method breaks the strict separation between
training and prediction by also broadcasting transformer outputs to its
listeners. This was added (I think) to support training with a frozen
transformer.

However, this breaks down when we are training a model with an unfrozen
transformer when the transformer is also in `annotating_components`. The
transformer will first (as part of its update step) broadcast the tensors and
backprop function to its listeners. However, then when acting as an annotating
component, it would immediately override its own output and clear the backprop
function. As a result, gradients will not flow into the transformer.

This change fixes this issue by adding the `update_listeners_in_predict` option,
which is enabled by default. When this option is disabled, the tensors will not
be broadcast to listeners in `predict`.
@svlandeg svlandeg added bug Something isn't working feat / pipeline Feature: Pipeline components labels Aug 5, 2022
@svlandeg svlandeg self-requested a review August 5, 2022 12:39
@adrianeboyd
Copy link
Contributor

Let me close and reopen to check the CI.

@svlandeg
Copy link
Member

svlandeg commented Aug 26, 2022

I agree it looks like there's some inconsistencies in how the embeddings are passed through and the clear cut between train/predict that we envisioned isn't actually reflected in the code.

Alternatives considered:

* Yanking the listener code from `predict`: breaks our current semantics, would make it harder to train with a frozen transformer.

I think I like this suggestion best. A tok2vec version of this is here: explosion/spaCy#11385

* Checking in the listener if the tensors that we are receiving is the same batch ID as we already have. Don't update if we already have the same batch with a backprop function. I thought this is a bit fragile, because it breaks when batching differs between training and prediction (?).

You're right that this might be a bit too fragile.

Another alternative would be to check against None (for transformers) or against _empty_backprop (with tok2vec) within the receive body and don't overwrite the backprop call when the new one is empty, but against that's a bit brittle.

I have a final, more radical proposal:

For v4, remove the indirect communication with listeners entirely for the outputs (you still need it for the backprop), and always rely on doc.tensor, i.e. make it mandatory to put tok2vec and transformer in the annotating_components.

@svlandeg
Copy link
Member

svlandeg commented Sep 7, 2022

Closing in favour of #345

@svlandeg svlandeg closed this Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feat / pipeline Feature: Pipeline components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants