-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use TokenizerBuilder
?
#1549
Comments
In the meantime, I've implemented this alternative workaround (using struct TokenizerX;
#[buildstructor::buildstructor]
impl TokenizerX {
#[builder]
fn try_new<'a>(
with_model: ModelWrapper,
with_decoder: Option<Decoder<'a>>,
with_normalizer: Option<Normalizer<'a>>,
) -> Result<Tokenizer> {
let mut tokenizer = Tokenizer::new(with_model);
// Handle local enum to remote enum type:
if let Some(decoder) = with_decoder {
let d = DecoderWrapper::try_from(decoder)?;
tokenizer.with_decoder(d);
}
if let Some(normalizer) = with_normalizer {
let n = NormalizerWrapper::try_from(normalizer)?;
tokenizer.with_normalizer(n);
}
Ok(tokenizer)
}
} Usage: let mut tokenizer: Tokenizer = TokenizerX::try_builder()
.with_model(model)
.with_decoder(decoder)
.with_normalizer(normalizer)
.build()?; The local to remote enum logic above is for the related let decoder = Decoder::Sequence(vec![
Decoder::Replace("_", " "),
Decoder::ByteFallback,
Decoder::Fuse,
Decoder::Strip(' ', 1, 0),
]);
let normalizer = Normalizer::Sequence(vec![
Normalizer::Prepend("▁"),
Normalizer::Replace(" ", "▁"),
]); |
The builder is I believe mostly used fro training |
@ArthurZucker perhaps you could better document that? Because by naming convention and current docs comment it implies it is the builder pattern for the
It provides an API that matches what you'd expect of a builder API, and it's tokenizers/tokenizers/src/tokenizer/mod.rs Lines 464 to 484 in 1ff56c0
tokenizers/tokenizers/src/tokenizer/mod.rs Lines 419 to 436 in 1ff56c0
tokenizers/tokenizers/src/tokenizer/mod.rs Lines 408 to 417 in 1ff56c0
As the issue reports though, that doesn't seem to work very well, the builder API is awkward to use. You could probably adapt it to use Presently, due to the reported issue here the builder offers little value vs creating the tokenizer without a fluent builder API. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I expected
TokenizerBuilder
to produce aTokenizer
from thebuild()
result, but insteadTokenizer
wrapsTokenizerImpl
.No problem, I see that it impl
From<TokenizerImpl> for Tokenizer
, but it's attempting to do quite a bit more for some reason? Meanwhile I cannot useTokenizer(unwrapped_build_result_here)
as the struct is private 🤔 (while theTokenizer::new()
method won't take this in either)Why is this an issue? Isn't the point of the builder so that you don't have to specify the optional types not explicitly set?
I had a glance over the source on github but didn't see an example or test for using this API and the docs don't really cover it either.
Meanwhile with
Tokenizer
instead ofTokenizerBuilder
this works:The text was updated successfully, but these errors were encountered: