Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 #52

Open
dhananjaybhandiwad opened this issue Jun 6, 2024 · 3 comments
Open

Comments

@dhananjaybhandiwad
Copy link

dhananjaybhandiwad commented Jun 6, 2024

Dear Author,
I encountered the below error when I tried to run the script file token_grammar_recognize.py

Traceback (most recent call last):
  File "d:\transformers-CFG\transformers_cfg\token_grammar_recognizer.py", line 288, in <module>
    input_text = file.read()
  File "C:\Users\dhana\miniconda3\envs\decoding\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 163: character maps to <undefined>

This is the main file:

if __name__ == "__main__":
    from transformers import AutoTokenizer

    with open("D:/transformers-CFG/examples/grammars/japanese.ebnf", "r") as file:
        input_text = file.read()
    parsed_grammar = parse_ebnf(input_text)
    parsed_grammar.print()

    tokenizer = AutoTokenizer.from_pretrained("gpt2")

    tokenRecognizer = IncrementalTokenRecognizer(
        grammar_str=input_text,
        start_rule_name="root",
        tokenizer=tokenizer,
        unicode=True,
    )



    japanese = "トリーム"  # "こんにちは"
    token_ids = tokenizer.encode(japanese)
    # 13298, 12675, 12045, 254
    init_state = None
    state = tokenRecognizer._consume_token_ids(token_ids, init_state, as_string=False)

    if state.stacks:
        print("The Japanese input is accepted")
    else:
        print("The Japanese input is not accepted")

Please could you help me regarding this issue.

@Saibo-creator
Copy link
Collaborator

Hello @dhananjaybhandiwad,

Thank you for raising this issue. I did not encounter any problems when running your script. It might be a versioning issue. Could you please check which version of the package you are using?

On myside, by running directly from pypi using pip install transformers_cfg, I get

(transformers-cfg-pypi) ➜  transformers-CFG-dev git:(main) ✗ pip show transformers_cfg
Name: transformers_cfg
Version: 0.2.1
Summary: Extension of Transformers library for Context-Free Grammar Constrained Decoding with EBNF grammars
Home-page: https://github.com/epfl-dlab/transformers-CFG
Author: EPFL-dlab
Author-email: saibo.geng@epfl.ch
License:
Location: /opt/anaconda3/envs/transformers-cfg-pypi/lib/python3.8/site-packages
Requires: line-profiler, numpy, protobuf, sentencepiece, setuptools, termcolor, tokenizers, torch, transformers
Required-by:

@dhananjaybhandiwad
Copy link
Author

dhananjaybhandiwad commented Jun 11, 2024

Hello @Saibo-creator,
I cloned the latest repo for modifying some code in the parser.py file to accomadate the grammar of SPARQL. So I tried just running the file token_grammar_recognize.py, to see how the system works and it threw me the error mentioned in the previous comment.

To see if my changes in parser.py had effected the token_grammar_recognize.py, I reverted all my changes and ran the unchanged version of parser.py, still the error persisted.

This error also persists when I try to run the parser.py file independently. Specifically while parsing the Japanese.ebnf file.

@Saibo-creator
Copy link
Collaborator

Hello @dhananjaybhandiwad ,
I just tried to clone the lastest version (commit 86eccd) and I was able to run your script without encountering a problem. I think maybe it's a problem due to platform(not sure at all), are you using windows ? If so, could you try WSL ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants