Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate a conflict when using sentencepiece #33327

Merged
merged 3 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion src/transformers/convert_slow_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,18 @@
from tokenizers import AddedToken, Regex, Tokenizer, decoders, normalizers, pre_tokenizers, processors
from tokenizers.models import BPE, Unigram, WordPiece

from .utils import is_protobuf_available, logging, requires_backends
from .utils import is_protobuf_available, is_sentencepiece_available, logging, requires_backends
from .utils.import_utils import PROTOBUF_IMPORT_ERROR


logger = logging.get_logger(__name__)


def import_protobuf(error_message=""):
if is_sentencepiece_available():
from sentencepiece import sentencepiece_model_pb2

return sentencepiece_model_pb2
if is_protobuf_available():
import google.protobuf

Expand Down
20 changes: 19 additions & 1 deletion tests/tokenization/test_tokenization_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,15 @@
is_tokenizers_available,
)
from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
from transformers.testing_utils import CaptureStderr, require_flax, require_tf, require_tokenizers, require_torch, slow
from transformers.testing_utils import (
CaptureStderr,
require_flax,
require_sentencepiece,
require_tf,
require_tokenizers,
require_torch,
slow,
)


if is_tokenizers_available():
Expand Down Expand Up @@ -296,3 +304,13 @@ def test_len_tokenizer(self):
self.assertEqual(len(tokenizer), tokenizer.vocab_size + 1)
self.assertEqual(len(tokenizer.added_tokens_decoder), added_tokens_size + 1)
self.assertEqual(len(tokenizer.added_tokens_encoder), added_tokens_size + 1)

@require_sentencepiece
def test_sentencepiece_cohabitation(self):
from sentencepiece import sentencepiece_model_pb2 as _original_protobuf # noqa: F401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we do this, we are importing sentencepiece.sentencepiece_model_pb2 and it will persists druing the whole (remaining) life time of the python test process.

I would probably avoid this, and run this test in a separate test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting to move this test to its own file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, typo, I mean in a separate python subprocess.

But we can wait for a core maintainer's review.

Copy link
Contributor Author

@tengomucho tengomucho Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Who is the core maintainer? Can we tag him/her?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already ping Arthur


from transformers.convert_slow_tokenizer import import_protobuf # noqa: F401

# Now this will try to import sentencepiece_model_pb2_new.py. This should not fail even if the protobuf
# was already imported.
import_protobuf()
Loading