Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different translations for the same sentence (same batch size, CPU, different surrounding sentences) #104

Closed
cgr71ii opened this issue Oct 5, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@cgr71ii
Copy link

cgr71ii commented Oct 5, 2023

Bug description

Hi!

When I translate the sentence "Nokkuð mun minna en 50% skal gisti tær.", I get different results. The sentences that are provided in the batch size are the only difference. I've prepared a minimum example (sentences1.txt and sentences2.txt):

Files sentences1.txt and sentences2.txt contain very similar sentences, but the only common sentence is just the one which should report the same translation, but it does not: "Anything less than 50% should be edible." vs "Anything less than 50% should be ed.".

I think this might be expected if the execution was in GPU, since the order of the instructions might be different, but I think this should not happen in CPU.

I used the isen.student.base model from browsermt/students#74 (not sure if it is the same version of the PR, since I see changes in the files of the PR and mines, but should be the same).

sentences2.txt
sentences1.txt

How to reproduce

Translation script (marian-translate-is2en.sh):

#!/usr/bin/env bash

THREADS=$([[ -z "$1" ]] && echo "1" || echo "$1")

if [[ ! "$THREADS" =~ ^[0-9]+$ ]]; then
  THREADS="1"
fi

/home/cgarcia/Documentos/marian-dev/marian-dev/build/marian-decoder \
  -c /home/cgarcia/Documentos/marian-dev/models/isen.student.base/config.intgemm8bit.alphas.yml --quiet --cpu-threads "$THREADS"

Translate:

cat sentences1.txt |  marian-dev/scripts/marian-translate-is2en.sh 5

# The HCA percent is very important, make sure that these ranges in between 50 and 60%.
# Anything less than 50% should be edible.
# You need to keep an eye out for uncountable items, binders and fillers.
cat sentences2.txt |  marian-dev/scripts/marian-translate-is2en.sh 5

# The HCA percent is very important, make sure that this limit in between 50 and 60%.
# Anything less than 50% should be ed.
# You need to look out for synthetic components, binders as well as fillers.

If I translate both files together (cat sentences{1,2}.txt or cat sentences{2,1}.txt), the result of the sentence is "Anything less than 50% should be edible.".

Context

  • Marian version: v1.9.56 a1a82ff 2021-10-18 18:17:11 +0200
  • CMake command: cannot
  • Log file: inference

Maybe is related to huggingface/transformers#25921 (?)

@cgr71ii cgr71ii added the bug Something isn't working label Oct 5, 2023
@jelmervdl
Copy link
Member

If the model uses a shortlist (and I think it does) then this is to be expected. The shortlist is applied at the batch level.

Does it still happen when you remove the shortlist?

@cgr71ii
Copy link
Author

cgr71ii commented Oct 5, 2023

Yes, you are right:

shortlist:
    - lex.s2t.bin
    - false

Once removed, the translation is "Anything less than 50% should be left to the room." in both files.

Thank you!

@cgr71ii cgr71ii closed this as completed Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants