Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] azure-json throws an illegal state exception when running the GetEmbeddingSample code #41159

Closed
Lucky-Vince opened this issue Jul 16, 2024 · 8 comments · Fixed by #41662
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-author-feedback Workflow: More information is needed from author to address the issue. OpenAI question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@Lucky-Vince
Copy link

Describe the bug
I ran the following sample code locally (azure-open-ai:1.0.0-beta.10) to test the basic functions of EmbeddingModel, but an exception was thrown. After switching to azure-open-ai:1.0.0-beta.8, an illegal character error was prompted.

Exception or Stack Trace
azure-open-ai:1.0.0-beta.10:
image
azure-open-ai:1.0.0-beta.8:
image

To Reproduce
Steps to reproduce the behavior:

Code Snippet

public static void main(String[] args) {
    OpenAIClient client = new AzureService().buildClient();

    EmbeddingsOptions embeddingsOptions = new EmbeddingsOptions((Arrays.asList("hi")));

    Embeddings embeddings = client.getEmbeddings("TEXT-EMBEDDING-ADA-002", embeddingsOptions);

    for (EmbeddingItem item : embeddings.getData()) {
        System.out.printf("Index: %d.%n", item.getPromptIndex());
        System.out.println("Embedding as base64 encoded string: " +  item.getEmbeddingAsString());
        System.out.println("Embedding as list of floats: ");
        for (Float embedding : item.getEmbedding()) {
            System.out.printf("%f;", embedding);
        }
    }

    EmbeddingsUsage usage = embeddings.getUsage();
    System.out.printf(
            "Usage: number of prompt token is %d and number of total tokens in request and response is %d.%n",
            usage.getPromptTokens(), usage.getTotalTokens());
}

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Setup (please complete the following information):

  • OS: Windows
  • IDE: IntelliJ IDEA 2024.1
  • Library/Libraries: com.azure:azure-ai-openai:1.0.0-beta.10
  • Java version: 8
  • App Server/Environment: Tomcat
  • Frameworks: Spring Boot 2.6.6

If you suspect a dependency version mismatch (e.g. you see NoClassDefFoundError, NoSuchMethodError or similar), please check out Troubleshoot dependency version conflict article first. If it doesn't provide solution for the problem, please provide:

  • verbose dependency tree (mvn dependency:tree -Dverbose)
[INFO] +- com.azure:azure-ai-openai:jar:1.0.0-beta.10:compile
[INFO] |  +- com.azure:azure-core:jar:1.49.0:compile
[INFO] |  |  +- com.azure:azure-xml:jar:1.0.0:compile
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-annotations:jar:2.13.2:compile
[INFO] |  |  +- com.fasterxml.jackson.core:jackson-core:jar:2.13.2:compile
[INFO] |  |  \- io.projectreactor:reactor-core:jar:3.4.16:compile
[INFO] |  |     \- org.reactivestreams:reactive-streams:jar:1.0.3:compile
[INFO] |  \- com.azure:azure-core-http-netty:jar:1.15.0:compile
[INFO] |     +- io.netty:netty-handler:jar:4.1.75.Final:compile
[INFO] |     |  +- io.netty:netty-resolver:jar:4.1.75.Final:compile
[INFO] |     |  \- io.netty:netty-transport:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-handler-proxy:jar:4.1.75.Final:compile
[INFO] |     |  \- io.netty:netty-codec-socks:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-buffer:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-codec:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-codec-http:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-codec-http2:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-transport-native-unix-common:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-transport-native-epoll:jar:linux-x86_64:4.1.75.Final:compile
[INFO] |     |  \- io.netty:netty-transport-classes-epoll:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-transport-native-kqueue:jar:osx-x86_64:4.1.75.Final:compile
[INFO] |     |  \- io.netty:netty-transport-classes-kqueue:jar:4.1.75.Final:compile
[INFO] |     +- io.netty:netty-tcnative-boringssl-static:jar:2.0.51.Final:compile
[INFO] |     |  +- io.netty:netty-tcnative-classes:jar:2.0.51.Final:compile
[INFO] |     |  +- io.netty:netty-tcnative-boringssl-static:jar:linux-x86_64:2.0.51.Final:compile
[INFO] |     |  +- io.netty:netty-tcnative-boringssl-static:jar:linux-aarch_64:2.0.51.Final:compile
[INFO] |     |  +- io.netty:netty-tcnative-boringssl-static:jar:osx-x86_64:2.0.51.Final:compile
[INFO] |     |  +- io.netty:netty-tcnative-boringssl-static:jar:osx-aarch_64:2.0.51.Final:compile
[INFO] |     |  \- io.netty:netty-tcnative-boringssl-static:jar:windows-x86_64:2.0.51.Final:compile
[INFO] |     +- io.projectreactor.netty:reactor-netty-http:jar:1.0.17:compile
[INFO] |     |  +- io.netty:netty-resolver-dns:jar:4.1.75.Final:compile
[INFO] |     |  |  \- io.netty:netty-codec-dns:jar:4.1.75.Final:compile
[INFO] |     |  +- io.netty:netty-resolver-dns-native-macos:jar:osx-x86_64:4.1.75.Final:compile
[INFO] |     |  |  \- io.netty:netty-resolver-dns-classes-macos:jar:4.1.75.Final:compile
[INFO] |     |  \- io.projectreactor.netty:reactor-netty-core:jar:1.0.17:compile
[INFO] |     \- io.netty:netty-common:jar:4.1.75.Final:compile
[INFO] +- com.azure:azure-identity:jar:1.12.1:compile
[INFO] |  +- com.azure:azure-json:jar:1.1.0:compile
[INFO] |  +- com.microsoft.azure:msal4j:jar:1.15.0:compile
[INFO] |  |  \- com.nimbusds:oauth2-oidc-sdk:jar:11.9.1:compile
[INFO] |  |     +- com.github.stephenc.jcip:jcip-annotations:jar:1.0-1:compile
[INFO] |  |     +- com.nimbusds:content-type:jar:2.3:compile
[INFO] |  |     +- com.nimbusds:lang-tag:jar:1.7:compile
[INFO] |  |     \- com.nimbusds:nimbus-jose-jwt:jar:9.37.3:compile
[INFO] |  +- com.microsoft.azure:msal4j-persistence-extension:jar:1.3.0:compile
[INFO] |  |  \- net.java.dev.jna:jna:jar:5.13.0:compile
[INFO] |  \- net.java.dev.jna:jna-platform:jar:5.6.0:compile
[INFO] +- com.azure:azure-search-documents:jar:11.6.5:compile
[INFO] |  \- com.azure:azure-core-serializer-json-jackson:jar:1.4.12:compile
[INFO] +- com.azure:azure-storage-blob:jar:12.26.0:compile
[INFO] |  +- com.azure:azure-storage-common:jar:12.25.0:compile
[INFO] |  +- com.azure:azure-storage-internal-avro:jar:12.11.0:compile
[INFO] |  \- com.fasterxml.jackson.dataformat:jackson-dataformat-xml:jar:2.13.2:compile
[INFO] |     +- org.codehaus.woodstox:stax2-api:jar:4.2.1:compile
[INFO] |     \- com.fasterxml.woodstox:woodstox-core:jar:6.2.7:compile
  • exception message, full stack trace, and any available logs
Exception in thread "main" java.io.UncheckedIOException: java.io.IOException: java.lang.IllegalStateException: Unexpected token to begin object deserialization: END_DOCUMENT
	at com.azure.core.serializer.json.jackson.JacksonJsonSerializer.deserializeFromBytes(JacksonJsonSerializer.java:50)
	at com.azure.core.implementation.util.FluxByteBufferContent.toObject(FluxByteBufferContent.java:99)
	at com.azure.core.util.BinaryData.toObject(BinaryData.java:1116)
	at com.azure.core.util.BinaryData.toObject(BinaryData.java:927)
	at com.azure.ai.openai.OpenAIClient.getEmbeddings(OpenAIClient.java:569)
	at com.mpt.service.azurenativesdk.GetEmbeddingSample.main(GetEmbeddingSample.java:24)
Caused by: java.io.IOException: java.lang.IllegalStateException: Unexpected token to begin object deserialization: END_DOCUMENT
	at com.azure.core.serializer.json.jackson.implementation.JsonSerializableDeserializer.deserialize(JsonSerializableDeserializer.java:50)
	at com.azure.core.serializer.json.jackson.implementation.JsonSerializableDeserializer.deserialize(JsonSerializableDeserializer.java:20)
	at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:322)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4674)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3723)
	at com.azure.core.serializer.json.jackson.implementation.ObjectMapperShim.readValue(ObjectMapperShim.java:213)
	at com.azure.core.serializer.json.jackson.JacksonJsonSerializer.deserializeFromBytes(JacksonJsonSerializer.java:48)
	... 5 more
Caused by: java.lang.IllegalStateException: Unexpected token to begin object deserialization: END_DOCUMENT
	at com.azure.json.JsonReader.readMapOrObject(JsonReader.java:548)
	at com.azure.json.JsonReader.readObject(JsonReader.java:455)
	at com.azure.ai.openai.models.EmbeddingItem.fromJson(EmbeddingItem.java:80)
	at com.azure.ai.openai.models.Embeddings.lambda$fromJson$1(Embeddings.java:97)
	at com.azure.json.JsonReader.readArray(JsonReader.java:494)
	at com.azure.ai.openai.models.Embeddings.lambda$fromJson$2(Embeddings.java:97)
	at com.azure.json.JsonReader.readMapOrObject(JsonReader.java:551)
	at com.azure.json.JsonReader.readObject(JsonReader.java:455)
	at com.azure.ai.openai.models.Embeddings.fromJson(Embeddings.java:90)
	at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
	at com.azure.core.implementation.MethodHandleReflectiveInvoker.invokeWithArguments(MethodHandleReflectiveInvoker.java:39)
	at com.azure.core.serializer.json.jackson.implementation.JsonSerializableDeserializer.deserialize(JsonSerializableDeserializer.java:45)
	... 11 more
@github-actions github-actions bot added customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jul 16, 2024
@Lucky-Vince
Copy link
Author

This is the json data returned by calling the embedding model:

{
    "object": "list",
    "model": "text-embedding-ada-002",
    "data": [
        {
            "index": 0,
            "object": "embedding",
            "embedding": Array[1536]
        }
    ],
    "usage": {
        "prompt_tokens": 1,
        "total_tokens": 1
    }
}

@joshfree joshfree assigned mssfang and alzimmermsft and unassigned mssfang Jul 16, 2024
@joshfree joshfree added the Azure.Core azure-core label Jul 16, 2024
@joshfree joshfree changed the title [BUG]The deserialization method throws an illegal state exception when running the GetEmbeddingSample code [BUG] azure-json throws an illegal state exception when running the GetEmbeddingSample code Jul 16, 2024
@joshfree joshfree added the bug This issue requires a change to an existing behavior in the product in order to be resolved. label Jul 16, 2024
@github-actions github-actions bot removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Jul 16, 2024
@joshfree
Copy link
Member

Thanks for filing this github issue, @Makato-Sino! @alzimmermsft can you please take a look?

/cc @mssfang as fyi

@alzimmermsft
Copy link
Member

Thanks for reporting this @Makato-Sino.

After a quick investigation this appears to have been broken before and after the change to azure-json. Looking at the logic used to deserialize and serialize EmbeddingItem it looks to treat the embedding JSON property as a String when the actual definition is a float[].

@mssfang does the service expect this to be a float[] or Base64 encoded string, or both? If the possibility is either, this will need to be customized further to support both cases based on the JSON shape being seen in JSON deserialization. Serialization is another story as that information needs to be determined when creating EmbeddingItem, and right now the only constructor expects Base64 encoding, if both are possible we'll need to add another constructor taking the float[]-based format.

@alzimmermsft
Copy link
Member

Upon further investigation, the reason the exception includes END_DOCUMENT is that the attempt to read the array as a string gets the parser into a bad state.

@alzimmermsft alzimmermsft assigned mssfang and unassigned alzimmermsft Jul 16, 2024
@alzimmermsft alzimmermsft added OpenAI and removed Azure.Core azure-core labels Jul 16, 2024
@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jul 16, 2024
@mkemmerz
Copy link

Running into the same issue with the beta-10. I can reproduce it pretty easily. I wanted to create an Embeddings-object for one of my test cases:

JsonReader reader = JsonProviders.createReader("""
        {
        "data" : [ {
          "promptIndex" : 0,
          "embedding" : [ ... ],
          "embeddingAsString" : "..."
        } ],
        "usage" : {
          "promptTokens" : 10,
          "totalTokens" : 10
        }
      }""");

Embeddings.fromJson(reader)

leads to:
image

@mssfang
Copy link
Member

mssfang commented Aug 27, 2024

In the implementation, addEncodingFormat hardcoded encoding_format to Base64. The return response will always return the embedding as a Base64 String, and then we perform the internal conversion to List<Float>. The JSON response will contain only index and embedding(as base64 encoded String).

List<Float> doesn't exist until getEmbedding() is invoked.

We do this as it's more performant for the API to response with Base64, For example, network transfer with Base64 is text-based and easily transferable over text protocols. Handling floats requires careful handling (binary data is converted to a text format) to ensure combability with the HTTP protocol.

@mssfang mssfang added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Aug 27, 2024
@github-actions github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Aug 27, 2024
Copy link

Hi @Lucky-Vince. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@mssfang
Copy link
Member

mssfang commented Aug 27, 2024

As this SDK will never get the float[] from service side, the embedding will always expect to be a base64 encoded String. But we provide Lst<Float> getEmbedding() as the convenience method, we could add the support to deserialize/serialize float[] in EmbeddingItem's toJson and fromJson method as a nice to have feature.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-author-feedback Workflow: More information is needed from author to address the issue. OpenAI question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
Status: Done
5 participants