[JS] Upgrade to 0.9.3 is causing significant increase in hallucinations #1360

tekkeon · 2024-11-21T10:01:24Z

Describe the bug
After upgrading to 0.9.3, we're noticing that the output Gemini (we're specifically using Gemini Pro 1.5 002) has numerous problems in our system. We're getting hallucinated text and overall very strange outputs that make our application unusable. When we downgrade back to 0.9.1, everything works properly again.

We looked at the rendered prompts and noticed the only difference was that format was now specified as json.

Snippit from 0.9.1:

"output": {
    "jsonSchema": {
      "type": "object",
      "properties": {
...

Snippet from 0.9.3

"output": {
    "format": "json",
    "jsonSchema": {
      "type": "object",
      "properties": {
...

We suspect this change to default to JSON mode may be causing the issue.

To Reproduce
Upgrade from 0.9.1 to 0.9.3 and run some prompts with complex outputs.

Expected behavior
We expected AI output to remain consistent to what our system had been producing.

Runtime (please complete the following information):

OS: MacOS 14.6.1

Node version

v18.17.1

The text was updated successfully, but these errors were encountered:

pavelgj · 2024-11-21T14:11:04Z

Thanks for the report. I'm investigating.

pavelgj · 2024-11-21T15:24:59Z

There's a diff of model request between 0.5 and 0.9

the diff in instructions is negligible, however the new (0.9) json format does specify contentType and constrained: true option, which would make a difference in generation. In theory the difference should be for the better, but not necessarily.

pavelgj · 2024-11-21T15:28:55Z

cc @mbleigh

pavelgj · 2024-11-21T15:32:49Z

I have a PR: #1365
it would allow you to define custom formats and override the default behaviour changes in the json format. Ex:

ai.defineFormat(
  {
    name: 'myJson',
    format: 'json',
  },
  (schema) => {
    let instructions: string | undefined;

    if (schema) {
      instructions = `Output should be in JSON format and conform to the following schema:

\`\`\`
${JSON.stringify(schema)}
\`\`\`
`;
    }

    return {
      parseChunk: (chunk) => {
        return extractJson(chunk.accumulatedText);
      },

      parseMessage: (message) => {
        return extractJson(message.text);
      },

      instructions,
    };
  }
);

const MenuItemSchema = z.object({
  name: z.string(),
  description: z.string(),
  calories: z.number(),
  allergens: z.array(z.string()),
});

export const menuSuggestionFlow = ai.defineFlow(
  {
    name: 'menuSuggestionFlow',
    outputSchema: MenuItemSchema.nullable(),
  },
  async () => {
    const response = await ai.generate({
      prompt: 'Invent a menu item for a pirate themed restaurant.',
      output: { format: 'myJson', schema: MenuItemSchema },
    });

    return response.output;
  }
);

The diff between 0.5 and this json formatter is negligible:

i2amsam · 2024-11-21T15:39:35Z

this is an interesting path, I think the reported difference was between 0.9.1 and 0.9.3, were there any differences there? Anecdotally I saw the same thing, after successfully conforming to the output for a long time I saw non-conforming generations in my testing yesterday

pavelgj · 2024-11-21T15:42:40Z

this is an interesting path, I think the reported difference was between 0.9.1 and 0.9.3, were there any differences there? Anecdotally I saw the same thing, after successfully conforming to the output for a long time I saw non-conforming generations in my testing yesterday

d'oh... I should pay better attention... let me diff 0.9.1->0.9.3 real quick....

pavelgj · 2024-11-21T15:50:57Z

Interesting, the diff is only in format:

however, gemini json mode condition is format==='json' OR contentType==='application/json', so it should not have made any difference between 0.9.1 and 0.9.3

genkit/js/plugins/googleai/src/gemini.ts

Line 605 in 39c86af

const jsonMode =

pavelgj · 2024-11-21T16:00:56Z

yeah, verified that there's no diff between 0.9.1->0.9.3 at the gemini model call level....

i2amsam · 2024-11-21T17:06:19Z

I'm not sure that it's the same issue, but I isolated down the changes I noticed to a repeatable case in #1368 . Filed as a separate issue so as not to derail this thread.

@tekkeon are you able to provide any more context for your snippets? I think an example with

config: {
temperature: 0
}

on the generation showing hallucination / strange output would be very helpful.

If you're using the Developer UI you can export a trace from a bad generation under the trace details tab which would also help us debug if you have a sharable trace.

tekkeon · 2024-11-21T19:47:07Z

Thanks for the quick replies here.

@pavelgj @i2amsam I've taken a screenshot of the diff of the redacted rendered prompts. I'll take a look at the traces as well, though I suspect there will at least be some IP in there I wouldn't want shared publicly. I can share it with you privately though if it will help.

tekkeon · 2024-11-21T19:51:08Z

Interesting, the diff is only in format:

however, gemini json mode condition is format==='json' OR contentType==='application/json', so it should not have made any difference between 0.9.1 and 0.9.3

genkit/js/plugins/googleai/src/gemini.ts

Line 605 in 39c86af

const jsonMode =

@pavelgj Not sure if it's relevant, but the lines you referred to were changed 2 weeks ago adding in the || contentType === 'application/json' logic: d3c0dbe#diff-321672f29eb53cc56ab2871e37a98e016269ab4a0efa7445cbc3af677ff13938R586-R587

i2amsam · 2024-11-22T02:33:19Z

Hmmmm, I'm a little out of my wheelhouse here, but @pavelgj wouldn't we expect to have a constrained:true in tekkon's example here? It's in yours, and in the working 0.9.3 example I have. @tekkeon is this a ai.generate call or a prompt() call. Could you give us the outline of how you're constructing the generate call?

i2amsam · 2024-11-22T02:34:17Z

Oh, and @tekkeon are you using the Vertex version or the Gemini API version?

tekkeon · 2024-11-22T06:25:29Z

Oh great question - we're using the Vertex version.

mbleigh · 2024-11-22T17:22:31Z

A change in 0.9 is that Gemini models now use constrained generation by default (that's the constrained: true) you're seeing. It's possible that constrained generation is affecting output in a way that increases hallucination.

Can you try adding {output: {constrained: false}} to your generate call? This should disable constrained generation and may revert the behavior back to the previous.

i2amsam · 2024-11-22T18:28:12Z

@mbleigh I think tekkon's config doesn't have the constrained key in the output block at all #1360 (comment) could that be the bug?

mbleigh · 2024-11-22T20:27:46Z

Hmm, no that shouldn't be the problem. I'm a little stumped as to what might be causing the regression...if there's any way to get a minimal reproduction that exhibits the behavior it would help a lot.

tekkeon · 2024-11-22T21:22:44Z

I'm gonna try a couple of these ideas and see if that gets us any clues. Specifically:

Reverting the || contentType === 'application/json' mentioned above
Adding {output: {constrained: false}} to the generate call

If these don't reveal any new insights, I'll see if I can come up with a minimal reproduction. It's obviously tough to do as I'm guessing hallucination is a lot less likely with less complex prompts, so I'd have to come up with a complex enough prompt to start to trigger issues.

Alternatively, @mbleigh is there a secure channel where I can send you the specific example prompt and results we're seeing between the two versions? It won't contain real data from production systems, just fake data but through our actual systems and prompts.

pavelgj · 2024-11-22T21:37:46Z

I have a PR: #1365 it would allow you to define custom formats and override the default behaviour changes in the json format. Ex:

ai.defineFormat(
  {
    name: 'myJson',
    format: 'json',
  },
  (schema) => {
    let instructions: string | undefined;

    if (schema) {
      instructions = `Output should be in JSON format and conform to the following schema:

\`\`\`
${JSON.stringify(schema)}
\`\`\`
`;
    }

    return {
      parseChunk: (chunk) => {
        return extractJson(chunk.accumulatedText);
      },

      parseMessage: (message) => {
        return extractJson(message.text);
      },

      instructions,
    };
  }
);

const MenuItemSchema = z.object({
  name: z.string(),
  description: z.string(),
  calories: z.number(),
  allergens: z.array(z.string()),
});

export const menuSuggestionFlow = ai.defineFlow(
  {
    name: 'menuSuggestionFlow',
    outputSchema: MenuItemSchema.nullable(),
  },
  async () => {
    const response = await ai.generate({
      prompt: 'Invent a menu item for a pirate themed restaurant.',
      output: { format: 'myJson', schema: MenuItemSchema },
    });

    return response.output;
  }
);

this feature is now released in 0.9.4, you should be able to define a custom JSON formatter to control and evaluate various flag combinations:

ai.defineFormat(
  { name: 'customJson' },
  (schema) => {
    let instructions: string | undefined;
    if (schema) {
      instructions = `Output should be in JSON format and conform to the following schema:
\`\`\`
${JSON.stringify(schema)}
\`\`\`
`;
    }
    return {
      parseChunk: (chunk) => extractJson(chunk.accumulatedText),
      parseMessage: (message) => extractJson(message.text),
      instructions,
    };
  }
);

const { output } = await ai.generate({
  prompt: 'Invent a menu item for a pirate themed restaurant.',
  output: { format: 'customJson', schema: MenuItemSchema },
});

mbleigh · 2024-11-22T21:40:01Z

@tekkeon you can email me bleigh (at) google.com with the specific repro if needed 🙂

tekkeon · 2024-11-23T03:56:09Z

Okay so I have some interesting results here. After upgrading to 0.9.3, if I go into the built version in node_modules/@genkit-ai/vertexai/lib/gemini.js and I revert the two changes from this commit (adding back JSON.parse and removing cleanSchema), things start working properly.

@mbleigh I'm going to send you the inputs and outputs I get from the above test before and after making the edits to gemini.js. Hopefully it can help!

pavelgj · 2025-01-26T16:32:01Z

I'm hoping #1612 will imrove this.

dongyangli1226 · 2025-02-24T03:23:15Z

I'm hoping #1612 will imrove this.

Posted some comments and questions in your PR. Thanks for your support. @pavelgj

tekkeon added bug Something isn't working js labels Nov 21, 2024

github-project-automation bot added this to Genkit Backlog Nov 21, 2024

pavelgj self-assigned this Nov 21, 2024

i2amsam mentioned this issue Nov 21, 2024

[JS] Difference between 0.9.0 and 0.9.3 in Output Instruction Following for Gemini #1368

Closed

pavelgj added this to the js-1.0.0 milestone Dec 2, 2024

pavelgj mentioned this issue Jan 26, 2025

refactor(js/ai): refactored constrained generation into middleware, simplified json format #1612

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JS] Upgrade to 0.9.3 is causing significant increase in hallucinations #1360

[JS] Upgrade to 0.9.3 is causing significant increase in hallucinations #1360

tekkeon commented Nov 21, 2024 •

edited

Loading

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024

i2amsam commented Nov 21, 2024

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024 •

edited

Loading

i2amsam commented Nov 21, 2024

tekkeon commented Nov 21, 2024

tekkeon commented Nov 21, 2024 •

edited

Loading

i2amsam commented Nov 22, 2024

i2amsam commented Nov 22, 2024 •

edited

Loading

tekkeon commented Nov 22, 2024

mbleigh commented Nov 22, 2024

i2amsam commented Nov 22, 2024

mbleigh commented Nov 22, 2024

tekkeon commented Nov 22, 2024

pavelgj commented Nov 22, 2024 •

edited

Loading

mbleigh commented Nov 22, 2024

tekkeon commented Nov 23, 2024

pavelgj commented Jan 26, 2025

dongyangli1226 commented Feb 24, 2025

[JS] Upgrade to 0.9.3 is causing significant increase in hallucinations #1360

[JS] Upgrade to 0.9.3 is causing significant increase in hallucinations #1360

Comments

tekkeon commented Nov 21, 2024 • edited Loading

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024

i2amsam commented Nov 21, 2024

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024

pavelgj commented Nov 21, 2024 • edited Loading

i2amsam commented Nov 21, 2024

tekkeon commented Nov 21, 2024

tekkeon commented Nov 21, 2024 • edited Loading

i2amsam commented Nov 22, 2024

i2amsam commented Nov 22, 2024 • edited Loading

tekkeon commented Nov 22, 2024

mbleigh commented Nov 22, 2024

i2amsam commented Nov 22, 2024

mbleigh commented Nov 22, 2024

tekkeon commented Nov 22, 2024

pavelgj commented Nov 22, 2024 • edited Loading

mbleigh commented Nov 22, 2024

tekkeon commented Nov 23, 2024

pavelgj commented Jan 26, 2025

dongyangli1226 commented Feb 24, 2025

tekkeon commented Nov 21, 2024 •

edited

Loading

pavelgj commented Nov 21, 2024 •

edited

Loading

tekkeon commented Nov 21, 2024 •

edited

Loading

i2amsam commented Nov 22, 2024 •

edited

Loading

pavelgj commented Nov 22, 2024 •

edited

Loading