Adding types for evals in Langfuse #4380

wesleyearlstander · 2024-11-22T03:35:26Z

wesleyearlstander
Nov 22, 2024

Describe the feature or potential improvement

Hi everyone,

I’ve been thinking about how we manage evaluations (evals) in Langfuse, especially when it comes to maintaining compatibility across structural changes in metadata fields. Currently, whenever we make a structural update to a type, we often need to manually update every eval that depends on that field. This can be time-consuming and error-prone, especially as the number of evals grows.

Proposal: Mapping Layer with Typed Objects for Evals
Instead of directly binding evals to raw metadata fields, I propose introducing a mapping layer with types. Here’s the idea in detail:

Introduce a Typed Object Layer:
Create strongly-typed objects that represent the metadata structure we expect to work with in evals. These objects act as a translation layer, parsing the metadata and exposing only the fields that are relevant for evals.

Dynamic Parsing and Mapping:
Each type would define how to parse the raw metadata into its structured form. This ensures that even if we change the underlying structure of a metadata field, we only need to update the type definition, not every eval that uses it.

Eval Integration:
Evals can then bind to these typed objects rather than directly to raw metadata fields. This adds a layer of abstraction, decoupling eval logic from raw metadata structure.

Example Use Case:
Let’s say we have a metadata field user_info that currently includes user_id and email. If we later restructure this field to add profile_id or group related attributes under a nested object, we’d only need to update the parsing logic in the corresponding type. The evals using this type would automatically adapt to the changes.

Benefits
Decoupling and Maintainability:
Evals remain insulated from metadata structure changes, reducing the risk of breaking functionality.

Reusability:
Typed objects can be reused across multiple evals, encouraging consistency and reducing duplication.

Simplified Updates:
Updates to metadata structures become localized to the type definitions.

Challenges and Considerations
Implementation Complexity:
Introducing a mapping layer adds some upfront development effort, but I believe the long-term maintainability gains outweigh this cost.

Performance Impact:
Parsing metadata dynamically might have a slight performance overhead, but this can be mitigated with efficient parsing mechanisms.

Adoption and Transition:
Existing evals would need to be refactored to use the new typed layer. This could be phased in to minimize disruption.

Next Steps
Gather feedback on the idea from the community.
Prototype a simple version of the typed object layer to evaluate feasibility.
Identify common metadata patterns that would benefit from this abstraction.
I’d love to hear your thoughts on this approach. Do you see this improving our eval system’s maintainability and scalability? Are there potential challenges I might have overlooked?

Looking forward to the discussion!

Additional information

No response

wesleyearlstander · 2024-11-22T03:36:29Z

wesleyearlstander
Nov 22, 2024
Author

I'm tagging @marcklingen as I know you've probably got an opinion on this.

0 replies

brightsparc · 2025-01-09T06:59:49Z

brightsparc
Jan 9, 2025

Has any consideration been given to instruction following frameworks such as IFEval or Multi-IF which are focused on evaluating a LLM’s ability to follow “verifiable instructions" - to provide objective verification of compliance. Examples of such instructions can be “the response should be in three paragraphs”, “the response should be in more than 300 words”, etc.

IFEval

dataset: https://huggingface.co/datasets/google/IFEval
code: https://github.com/google-research/google-research/tree/master/instruction_following_eval
paper: https://arxiv.org/abs/2311.07911

Multi-IF

dataset: https://huggingface.co/datasets/facebook/Multi-IF
code: https://github.com/facebookresearch/Multi-IF
paper: https://arxiv.org/abs/2410.15553

2 replies

marcklingen Jan 9, 2025
Maintainer

thanks for sharing, what are the technical requirements for Langfuse from your perspective in order to support this?

Today, you can add your own evaluation prompts to the langfuse llm-as-a-judge runner

brightsparc Jan 9, 2025

Hi @marcklingen I would see this as another class of evaluation, given it doesn't require an LLM to make the judgement.

Instead you could have a list of instruction id + kwargs pairs that would map to those from the open source libraries:

And ideally work for multi turn scenarios, such that you could define the step in which to apply them (see the step function in the Multi-IF codebase)

This could be done now using an offline CI/CD process like the ragas example you have, but I see it being useful to be baked into UI.

I realise this is a little OT from the original post, so happy to create a new discussion issue if relevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langfuse

Adding types for evals in Langfuse #4380

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Langfuse

Adding types for evals in Langfuse #4380

wesleyearlstander Nov 22, 2024

Describe the feature or potential improvement

Additional information

Replies: 2 comments · 2 replies

wesleyearlstander Nov 22, 2024 Author

brightsparc Jan 9, 2025

IFEval

Multi-IF

marcklingen Jan 9, 2025 Maintainer

brightsparc Jan 9, 2025

wesleyearlstander
Nov 22, 2024

Replies: 2 comments 2 replies

wesleyearlstander
Nov 22, 2024
Author

brightsparc
Jan 9, 2025

marcklingen Jan 9, 2025
Maintainer