Skip to content

Commit

Permalink
LangSmith Test Runner (#4011)
Browse files Browse the repository at this point in the history
* Add run_on_dataset

* Add stacktrace to tracer errors

* Add test runner

* bump

* Fixup

* Add example walkthrough

* rename

* Add doc

* Use types instead of data classes, other updates

* Lint

* Describe.each()

* Add chat

---------

Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
  • Loading branch information
hinthornw and jacoblee93 authored Jan 24, 2024
1 parent cb968cb commit 3dfb2eb
Show file tree
Hide file tree
Showing 23 changed files with 2,534 additions and 3 deletions.
408 changes: 408 additions & 0 deletions docs/api_refs/typedoc.json

Large diffs are not rendered by default.

239 changes: 239 additions & 0 deletions docs/core_docs/docs/guides/langsmith_evaluation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
# LangSmith Walkthrough

LangChain makes it easy to prototype LLM applications and Agents. However, delivering LLM applications to production can be deceptively difficult. You will have to iterate on your prompts, chains, and other components to build a high-quality product.

LangSmith makes it easy to debug, test, and continuously improve your LLM applications.

When might this come in handy? You may find it useful when you want to:

- Quickly debug a new chain, agent, or set of tools
- Create and manage datasets for fine-tuning, few-shot prompting, and evaluation
- Run regression tests on your application to confidently develop
- Capture production analytics for product insights and continuous improvements

## Prerequisites

**[Create a LangSmith account](https://smith.langchain.com/) and create an API key (see bottom left corner). Familiarize yourself with the platform by looking through the [docs](https://docs.smith.langchain.com/)**

Note LangSmith is in closed beta; we're in the process of rolling it out to more users. However, you can fill out the form on the website for expedited access.

Now, let's get started!

## Log runs to LangSmith

First, configure your environment variables to tell LangChain to log traces. This is done by setting the `LANGCHAIN_TRACING_V2` environment variable to true.
You can tell LangChain which project to log to by setting the `LANGCHAIN_PROJECT` environment variable (if this isn't set, runs will be logged to the `default` project). This will automatically create the project for you if it doesn't exist. You must also set the `LANGCHAIN_ENDPOINT` and `LANGCHAIN_API_KEY` environment variables.

For more information on other ways to set up tracing, please reference the [LangSmith documentation](https://docs.smith.langchain.com/docs/).

However, in this example, we will use environment variables.

```bash
npm install @langchain/openai @langchain/community langsmith uuid
```

```typescript
import { v4 as uuidv4 } from "uuid";
const uniqueId = uuidv4().slice(0, 8);
process.env.LANGCHAIN_TRACING_V2 = "true";
process.env.LANGCHAIN_PROJECT = `JS Tracing Walkthrough - ${uniqueId}`;
process.env.LANGCHAIN_ENDPOINT = "https://api.smith.langchain.com";
process.env.LANGCHAIN_API_KEY = "<YOUR-API-KEY>"; // Replace with your API key

// For the chain in this tutorial
process.env.OPENAI_API_KEY = "<YOUR-OPENAI-API-KEY>";
// You can make an API key here: https://app.tavily.com/sign-in
process.env.TAVILY_API_KEY = "<YOUR-TAVILY-API-KEY>";
```

Create the langsmith client to interact with the API

```typescript
import { Client } from "langsmith";

const client = new Client();
```

Create a LangChain component and log runs to the platform. In this example, we will create an OpenAI function calling agent with access to a general search tool (Tavily). The agent's prompt can be viewed in the [Hub here](https://smith.langchain.com/hub/hwchase17/openai-functions-agent).

```typescript
import { AgentExecutor, createOpenAIFunctionsAgent } from "langchain/agents";
import { pull } from "langchain/hub";
import { TavilySearchResults } from "@langchain/community/tools/tavily_search";
import { ChatOpenAI } from "@langchain/openai";
import type { ChatPromptTemplate } from "@langchain/core/prompts";

const tools = [new TavilySearchResults()];

// Get the prompt to use - you can modify this!
// If you want to see the prompt in full, you can at:
// https://smith.langchain.com/hub/hwchase17/openai-functions-agent
const prompt = await pull<ChatPromptTemplate>(
"hwchase17/openai-functions-agent"
);

const llm = new ChatOpenAI({
modelName: "gpt-3.5-turbo-1106",
temperature: 0,
});

const agent = await createOpenAIFunctionsAgent({
llm,
tools,
prompt,
});

const agentExecutor = new AgentExecutor({
agent,
tools,
});
```

You can run the executor on multiple inputs concurrently, reducing latency. The runs are logged to LangSmith in the background.

```typescript
const inputs = [
{ input: "What is LangChain?" },
{ input: "What's LangSmith?" },
{ input: "When was Llama-v2 released?" },
{ input: "What is the langsmith cookbook?" },
{ input: "When did langchain first announce the hub?" },
];

const results = await agentExecutor.batch(inputs);
console.log(results.slice(0, 2));
```

```out
[
{
input: 'What is LangChain?',
output: 'LangChain is a framework that allows developers to build applications with Language Model Systems (LLMs) through composability. It provides a set of tools and modules for building context-aware language model systems, including features for retrieval augmented generation, analyzing structured data, chatbots, and more. LangChain also offers the LangChain Expression Language (LCEL) to create custom chains and adapt language models to specific business contexts. It was launched as an open-source project in October 2022 and has gained popularity for its capabilities in the field of generative AI and language model integration. You can find more information about LangChain on their [official website](https://www.langchain.com/).'
},
{
input: "What's LangSmith?",
output: 'LangSmith is a unified platform designed to help developers with debugging, testing, evaluating, and monitoring chains and intelligent agents built on any LLM (Language Model) framework. It provides full visibility into model inputs and outputs, facilitates dataset creation from existing logs, and seamlessly integrates logging/debugging workflows with testing/evaluation workflows. LangSmith aims to bridge the gap between prototype and production, offering a single, fully-integrated hub for developers to work from. It also assists in tracing and evaluating complex agent prompt chains, reducing the time required for debugging and refinement. LangSmith is part of the LangChain ecosystem, which is an open-source framework for building with LLMs.'
}
]
```

After setting up your environment, your agent traces should appear in the Projects section on the LangSmith app. Congratulations!

If the agent is not effectively using the tools, evaluate it to establish a baseline.

## Evaluate the Chain

LangSmith allows you to test and evaluate your LLM applications. Follow these steps to benchmark your agent:

### 1. Create a LangSmith dataset

Use the LangSmith client to create a dataset with input questions and corresponding labels.

For more information on datasets, including how to create them from CSVs or other files or how to create them in the platform, please refer to the [LangSmith documentation](https://docs.smith.langchain.com/).

```typescript
const outputs = [
"LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.",
"LangSmith is a unified platform for debugging, testing, and monitoring language model applications and agents powered by LangChain",
"July 18, 2023",
"The langsmith cookbook is a github repository containing detailed examples of how to use LangSmith to debug, evaluate, and monitor large language model-powered applications.",
"September 5, 2023",
];

const datasetName = `lcjs-qa-${uniqueId}`;
const dataset = await client.createDataset(datasetName);

await Promise.all(
inputs.map(async (input, i) => {
await client.createExample(
input,
{ output: outputs[i] },
{
datasetId: dataset.id,
}
);
})
);
```

### 2. Configure evaluation

Manually comparing the results of chains in the UI is effective, but it can be time consuming.
It can be helpful to use automated metrics and AI-assisted feedback to evaluate your component's performance.

Below, we will create a custom run evaluator that logs a heuristic evaluation.

```typescript
import type { RunEvalConfig } from "langchain/smith";
import { Run, Example } from "langsmith";

// An illustrative custom evaluator example
const notUnsure = (props: { run: Run; example?: Example }) => ({
key: "not_unsure",
score: !props.run?.outputs?.output.includes("not sure"),
});

const evaluation: RunEvalConfig = {
// The 'evaluators' are loaded from LangChain's evaluation
// library.
evaluators: [
{
evaluatorType: "labeled_criteria",
criteria: "correctness",
feedbackKey: "correctness",
formatEvaluatorInputs: ({
rawInput,
rawPrediction,
rawReferenceOutput,
}) => {
return {
input: rawInput.input,
prediction: rawPrediction.output,
reference: rawReferenceOutput.output,
};
},
},
],
// Custom evaluators can be user-defined RunEvaluator's
// or a compatible function
customEvaluators: [notUnsure],
};
```

### 3. Run the Benchmark

Use the [runOnDataset](https://api.js.langchain.com/functions/langchain_smith.runOnDataset.html) function to evaluate your model. This will:

1. Fetch example rows from the specified dataset.
2. Run your chain, agent (or any custom function) on each example.
3. Apply evaluators to the resulting run traces and corresponding reference examples to generate automated feedback.

The results will be visible in the LangSmith app.

```typescript
import { runOnDataset } from "langchain/smith";

await runOnDataset(agentExecutor, datasetName, {
evaluationConfig: evaluation,
});
```

```out
Predicting: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100.00% | 5/5
Completed
Running Evaluators: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100.00% | 5/5
```

### Review the test results

You can review the test results tracing UI below by clicking the URL in the output above or navigating to the "Testing & Datasets" page in LangSmith **"lcjs-qa-{uniqueId}"** dataset.

This will show the new runs and the feedback logged from the selected evaluators. You can also explore a summary of the results in tabular format below.

## Conclusion

Congratulations! You have successfully traced and evaluated a chain using LangSmith!

This was a quick guide to get started, but there are many more ways to use LangSmith to speed up your developer flow and produce better results.

For more information on how you can get the most out of LangSmith, check out [LangSmith documentation](https://docs.smith.langchain.com/), and please reach out with questions, feature requests, or feedback at [support@langchain.dev](mailto:support@langchain.dev).
1 change: 1 addition & 0 deletions environment_tests/test-exports-bun/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,5 +109,6 @@ export * from "langchain/experimental/chains/violation_of_expectations";
export * from "langchain/experimental/masking";
export * from "langchain/experimental/prompts/custom_format";
export * from "langchain/evaluation";
export * from "langchain/smith";
export * from "langchain/runnables";
export * from "langchain/runnables/remote";
1 change: 1 addition & 0 deletions environment_tests/test-exports-cf/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,5 +109,6 @@ export * from "langchain/experimental/chains/violation_of_expectations";
export * from "langchain/experimental/masking";
export * from "langchain/experimental/prompts/custom_format";
export * from "langchain/evaluation";
export * from "langchain/smith";
export * from "langchain/runnables";
export * from "langchain/runnables/remote";
1 change: 1 addition & 0 deletions environment_tests/test-exports-cjs/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,5 +109,6 @@ const experimental_chains_violation_of_expectations = require("langchain/experim
const experimental_masking = require("langchain/experimental/masking");
const experimental_prompts_custom_format = require("langchain/experimental/prompts/custom_format");
const evaluation = require("langchain/evaluation");
const smith = require("langchain/smith");
const runnables = require("langchain/runnables");
const runnables_remote = require("langchain/runnables/remote");
1 change: 1 addition & 0 deletions environment_tests/test-exports-esbuild/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,5 +109,6 @@ import * as experimental_chains_violation_of_expectations from "langchain/experi
import * as experimental_masking from "langchain/experimental/masking";
import * as experimental_prompts_custom_format from "langchain/experimental/prompts/custom_format";
import * as evaluation from "langchain/evaluation";
import * as smith from "langchain/smith";
import * as runnables from "langchain/runnables";
import * as runnables_remote from "langchain/runnables/remote";
1 change: 1 addition & 0 deletions environment_tests/test-exports-esm/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,5 +109,6 @@ import * as experimental_chains_violation_of_expectations from "langchain/experi
import * as experimental_masking from "langchain/experimental/masking";
import * as experimental_prompts_custom_format from "langchain/experimental/prompts/custom_format";
import * as evaluation from "langchain/evaluation";
import * as smith from "langchain/smith";
import * as runnables from "langchain/runnables";
import * as runnables_remote from "langchain/runnables/remote";
1 change: 1 addition & 0 deletions environment_tests/test-exports-vercel/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,5 +109,6 @@ export * from "langchain/experimental/chains/violation_of_expectations";
export * from "langchain/experimental/masking";
export * from "langchain/experimental/prompts/custom_format";
export * from "langchain/evaluation";
export * from "langchain/smith";
export * from "langchain/runnables";
export * from "langchain/runnables/remote";
1 change: 1 addition & 0 deletions environment_tests/test-exports-vite/src/entrypoints.js
Original file line number Diff line number Diff line change
Expand Up @@ -109,5 +109,6 @@ export * from "langchain/experimental/chains/violation_of_expectations";
export * from "langchain/experimental/masking";
export * from "langchain/experimental/prompts/custom_format";
export * from "langchain/evaluation";
export * from "langchain/smith";
export * from "langchain/runnables";
export * from "langchain/runnables/remote";
1 change: 1 addition & 0 deletions examples/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
"ioredis": "^5.3.2",
"js-yaml": "^4.1.0",
"langchain": "workspace:*",
"langsmith": "^0.0.59",
"ml-distance": "^4.0.0",
"mongodb": "^5.2.0",
"pg": "^8.11.0",
Expand Down
Loading

1 comment on commit 3dfb2eb

@vercel
Copy link

@vercel vercel bot commented on 3dfb2eb Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

langchainjs-docs – ./docs/core_docs/

langchainjs-docs-git-main-langchain.vercel.app
langchainjs-docs-langchain.vercel.app
langchainjs-docs-ruddy.vercel.app
js.langchain.com

Please sign in to comment.