Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the tokenized result of sentencepiece java lib and python lib are different #999

Closed
thinkzhou opened this issue Jun 8, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@thinkzhou
Copy link

Description

I am using your java lib and the origin python lib to load xlm-robert-base model and tokenize sentences, find the result of java and python are different. It looks like the way java lib treat the emoji (eg. 👋) is incorrect, maybe this is a bug?

Expected Behavior

The tokenized result from java lib and python lib be the same

Error Message

No Error Message

How to Reproduce?

Java code:

public static void main(String[] args) {
    Path modelPath = Paths.get("path/to/sentencepiece.model");
    try (SpTokenizer tokenizer = new SpTokenizer(modelPath)) {
     String s = "\uD83D\uDC4B\uD83D\uDC4B";
      List<String> tokens = tokenizer.tokenize(s);
      System.out.println(tokens);
    } catch (IOException exception) {
      exception.printStackTrace();
    }
  }

get result:
[▁, ������������]

Python Code:

import sentencepiece as spm
processor = spm.SentencePieceProcessor(model_file="path/to/sentencepiece.model")
print(processor.tokenize("👋👋",out_type=str))

get result:
['▁', '👋', '👋']

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. run java code and python code
  2. compare the tokenized result

What have you tried to solve it?

  1. https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

----------- System Properties -----------
sun.cpu.isalist:
ftp.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
socksNonProxyHosts: local|*.local|169.254/16|*.169.254/16
sun.io.unicode.encoding: UnicodeBig
sun.cpu.endian: little
java.vendor.url.bug: http://bugreport.sun.com/bugreport/
file.separator: /
java.vendor: Oracle Corporation
sun.boot.class.path: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/classes
java.ext.dirs: /Users/zhouyang/Library/Java/Extensions:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/ext:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java
java.version: 1.8.0_171
java.vm.info: mixed mode
awt.toolkit: sun.lwawt.macosx.LWCToolkit
user.language: zh
java.specification.vendor: Oracle Corporation
sun.java.command: ai.djl.integration.util.DebugEnvironment
java.home: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre
sun.arch.data.model: 64
java.vm.specification.version: 1.8
java.class.path: /Users/zhouyang/work/github/djl/integration/build/classes/java/main:/Users/zhouyang/work/github/djl/integration/build/resources/main:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.4/c51c00206bb913cd8612b24abd9fa98ae89719b1/commons-cli-1.4.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.13.3/7cca27a921a18645139cf651c04b83b1a19cfd76/log4j-slf4j-impl-2.13.3.jar:/Users/zhouyang/work/github/djl/basicdataset/build/libs/basicdataset-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/model-zoo/build/libs/model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/testing/build/libs/testing-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.1.0/b0bcea778fb2899aeb4014c558babea8833d180a/testng-7.1.0.jar:/Users/zhouyang/work/github/djl/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.mxnet/mxnet-native-auto/1.8.0/e32265c03e27e1fb18c9c0904733b00f9acffaee/mxnet-native-auto-1.8.0.jar:/Users/zhouyang/work/github/djl/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.pytorch/pytorch-native-auto/1.8.1/3cbb59c8b21c24cb368d296f6c4c6ef069d4d9b/pytorch-native-auto-1.8.1.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.tensorflow/tensorflow-native-auto/2.4.1/20b8c7a4e6d451e782d15dd30cebd4df0ad86c74/tensorflow-native-auto-2.4.1.jar:/Users/zhouyang/work/github/djl/mxnet/mxnet-engine/build/libs/mxnet-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/pytorch/pytorch-engine/build/libs/pytorch-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/api/build/libs/api-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.30/b5a4b6d16ab13e34a88fae84c35cd5d68cac922c/slf4j-api-1.7.30.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.13.3/4e857439fc4fe974d212adaaaa3b118b8b50e3ec/log4j-core-2.13.3.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.13.3/ec1508160b93d274b1add34419b897bae84c6ca9/log4j-api-2.13.3.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.8/37ca9a9aa2d4be2599e55506a6d3170dd7a3df4/commons-csv-1.8.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.72/6375e521c1e11d6563d4f25a07ce124ccf8cd171/jcommander-1.72.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.inject/guice/4.1.0/faf9ee8ac09eafd1128091426dd367a8c0085d55/guice-4.1.0-no_aop.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.21/18775fdda48574784f40b47bf478ab0593f92e4d/snakeyaml-1.21.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.8.6/9180733b7df8542621dc12e21e87557e8c99b8cb/gson-2.8.6.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.3.0/4654d1da02e4173ba7b64f7166378847db55448a/jna-5.3.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.20/b8df472b31e1f17c232d2ad78ceb1c84e00c641b/commons-compress-1.20.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/javax.inject/javax.inject/1/6975da39a7040257bd51d21a231b76c915872d38/javax.inject-1.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/aopalliance/aopalliance/1.0/235ba8b489512805ac13a8f9ea77a1ca5ebe3e8/aopalliance-1.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/19.0/6ce200f6b23222af3d8abb6b6459e6c44f4bb0e9/guava-19.0.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.5/92e1c31aaed15a3dc12008859a37ced45fa0b730/javacpp-1.5.5.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.3.1/954f292e85f4d2a587ede1b2e1a525e74ef96c97/tensorflow-core-api-0.3.1.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.8.0/b5f93103d113540bb848fe9ce4e6819b1f39ee49/protobuf-java-3.8.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.3.1/3cdb825411a9de908cc3dac740f18628d6512260/ndarray-0.3.1.jar
user.name: zhouyang
ai.djl.logging.level: debug
file.encoding: UTF-8
java.specification.version: 1.8
java.awt.printerjob: sun.lwawt.macosx.CPrinterJob
user.timezone: Asia/Shanghai
user.home: /Users/zhouyang
library.jansi.path: /Users/zhouyang/.gradle/native/jansi/1.18/osx
http.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
os.version: 10.15.7
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.specification.name: Java Platform API Specification
java.class.version: 52.0
org.gradle.internal.http.connectionTimeout: 60000
java.library.path: /Users/zhouyang/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
org.gradle.internal.publish.checksums.insecure: true
sun.jnu.encoding: UTF-8
os.name: Mac OS X
user.variant:
java.vm.specification.vendor: Oracle Corporation
org.gradle.appname: gradlew
java.io.tmpdir: /var/folders/zv/gqw522z179l_5zv1k2q7tblm0000gn/T/
line.separator:

java.endorsed.dirs: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/endorsed
os.arch: x86_64
java.awt.graphicsenv: sun.awt.CGraphicsEnvironment
java.runtime.version: 1.8.0_171-b11
java.vm.specification.name: Java Virtual Machine Specification
user.dir: /Users/zhouyang/work/github/djl/integration
org.gradle.internal.http.socketTimeout: 120000
user.country: CN
sun.java.launcher: SUN_STANDARD
sun.os.patch.level: unknown
java.vm.name: Java HotSpot(TM) 64-Bit Server VM
file.encoding.pkg: sun.io
path.separator: :
java.vm.vendor: Oracle Corporation
java.vendor.url: http://java.oracle.com/
gopherProxySet: false
sun.boot.library.path: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib
java.vm.version: 25.171-b11
java.runtime.name: Java(TM) SE Runtime Environment
@thinkzhou thinkzhou added the bug Something isn't working label Jun 8, 2021
@frankfliu
Copy link
Contributor

@thinkzhou
This is very interesting. Looks like the emoj are in UTF16 surrogate range, we convert to native chars with GetStringUTFChars, not sure if they are related.

@thinkzhou
Copy link
Author

@thinkzhou
This is very interesting. Looks like the emoj are in UTF16 surrogate range, we convert to native chars with GetStringUTFChars, not sure if they are related.

Yes, GetStringUTFChars return a pointer to an array of bytes representing the string in modified UTF-8 encoding, while emoji and special characters like 𝑊𝑒𝑙𝑐𝑜𝑚𝑒︎ will be in trouble when using this function, I am not familiar with JNI this link may give some solutions.

frankfliu added a commit to frankfliu/djl that referenced this issue Jun 9, 2021
Change-Id: I19e77cf5a8282bea901434041806eb102549ec0f
frankfliu added a commit to frankfliu/djl that referenced this issue Jun 9, 2021
Change-Id: I19e77cf5a8282bea901434041806eb102549ec0f
frankfliu added a commit to frankfliu/djl that referenced this issue Jun 9, 2021
Change-Id: I19e77cf5a8282bea901434041806eb102549ec0f
@thinkzhou
Copy link
Author

@frankfliu thanks for the quick fix, could you publish the new version to maven central repository? i will test it.

@thinkzhou
Copy link
Author

I build the snapshot version in local and pass my test, when will the released version be published to maven central repository?

@frankfliu
Copy link
Contributor

You can use our SNAPSHOT release for now. We expect to release 0.12.0 in mid of Jul

AzizZayed added a commit to AzizZayed/djl that referenced this issue Jun 15, 2021
commit 0092f8e
Author: Aziz Zayed <azayed01@gmail.com>
Date:   Tue Jun 15 08:22:51 2021 -0700

    Fixed truncated-normal bug

commit a6ded8c
Author: Aziz Zayed <azayed01@gmail.com>
Date:   Mon Jun 14 13:33:30 2021 -0700

    [pytorch] Add BigGAN demo

commit f145614
Merge: a8a1a9b ec8405b
Author: Abd-El-Aziz Zayed <48853777+AzizZayed@users.noreply.github.com>
Date:   Fri Jun 11 20:45:34 2021 -0700

    Merge branch 'deepjavalibrary:master' into master

commit ec8405b
Author: Abd-El-Aziz Zayed <48853777+AzizZayed@users.noreply.github.com>
Date:   Fri Jun 11 14:53:59 2021 -0700

    [pytorch] Add oneHot operator (deepjavalibrary#1014)

    [tensoflow] Add truncated normal operation

commit 50600fd
Author: Frank Liu <frankfliu2000@gmail.com>
Date:   Fri Jun 11 14:53:43 2021 -0700

    upgrade dependencies version (deepjavalibrary#1012)

    Change-Id: I709938f69f21096bc5cd29a24191f0f282dcbc97

commit 3379fd2
Author: Frank Liu <frankfliu2000@gmail.com>
Date:   Fri Jun 11 14:53:29 2021 -0700

    [serving] Fix flaky test (deepjavalibrary#1013)

    Change-Id: I13b89e04516c59a3d28ecafd49f4f808630b22fb

commit 23157fd
Author: Frank Liu <frankfliu2000@gmail.com>
Date:   Thu Jun 10 16:31:03 2021 -0700

    Enable spotbugs for java 11+ (deepjavalibrary#1010)

    Change-Id: I74effbf45492a5cf50e09ba8af0223d2b1bcb5a5

commit 4f38708
Author: Frank Liu <frankfliu2000@gmail.com>
Date:   Thu Jun 10 16:30:50 2021 -0700

    Fix model zoo test typo (deepjavalibrary#1009)

    Change-Id: I7c0109c6e5fc0ece16288082fd830718f20ad489

commit a8a1a9b
Merge: 77809f4 30b03f4
Author: Aziz Zayed <azayed01@gmail.com>
Date:   Thu Jun 10 15:16:05 2021 -0700

    Merge Truncated-Normal branch

commit 77809f4
Author: Frank Liu <frankfliu2000@gmail.com>
Date:   Thu Jun 10 14:07:43 2021 -0700

    Make model zoo test weekly (deepjavalibrary#1004)

    Change-Id: I1c73df17cb077b9ce8905fcc2fc8bbb37b9688d8

commit 0aec8ca
Author: Abd-El-Aziz Zayed <48853777+AzizZayed@users.noreply.github.com>
Date:   Thu Jun 10 12:46:16 2021 -0700

    [tensoflow] Add truncated normal operation (deepjavalibrary#1005)

commit 30b03f4
Author: Aziz Zayed <azayed01@gmail.com>
Date:   Wed Jun 9 01:40:33 2021 -0700

    [tensoflow] Add truncated normal operation

commit d8e7e1d
Author: Frank Liu <frankfliu2000@gmail.com>
Date:   Wed Jun 9 07:55:15 2021 -0700

    Fixes deepjavalibrary#999, hanlde UTF16 surrogate charactors properly. (deepjavalibrary#1003)

    Change-Id: I19e77cf5a8282bea901434041806eb102549ec0f

commit b0fe73a
Author: Frank Liu <frankfliu2000@gmail.com>
Date:   Tue Jun 8 17:56:19 2021 -0700

    [pytorch] Update load model jupyter notebook (deepjavalibrary#1002)

    Change-Id: I1889aa93d2002e6ce02c740d2d1d3517bf586760

commit 8286930
Author: Frank Liu <frankfliu2000@gmail.com>
Date:   Tue Jun 8 15:29:27 2021 -0700

    [tensorflow] fix optOption usage document (deepjavalibrary#1001)

    Change-Id: Ie044839cf082d63010a5c26d3f2f8833447919c6

commit a26f5b2
Author: Abd-El-Aziz Zayed <48853777+AzizZayed@users.noreply.github.com>
Date:   Tue Jun 8 15:29:10 2021 -0700

    Updated PyTorch Docs  (deepjavalibrary#1000)

    * Added auto softmax metadata for action_recognition

    * Update PyTorch Docs

commit e6890f9
Author: Lanking <qingla@amazon.com>
Date:   Mon Jun 7 18:25:19 2021 -0700

    upgrade xgboost (deepjavalibrary#993)

commit a0dcf3a
Author: Lanking <qingla@amazon.com>
Date:   Mon Jun 7 18:25:12 2021 -0700

    bump up onnx runtime version (deepjavalibrary#992)
Lokiiiiii pushed a commit to Lokiiiiii/djl that referenced this issue Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants