LangSmith Test Runner (#4011)

* Add run_on_dataset * Add stacktrace to tracer errors * Add test runner * bump * Fixup * Add example walkthrough * rename * Add doc * Use types instead of data classes, other updates * Lint * Describe.each() * Add chat --------- Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
langchain-ai · Jan 24, 2024 · 3dfb2eb · 3dfb2eb · vercel · Jan 24, 2024
1 parent cb968cb
commit 3dfb2eb
Show file tree

Hide file tree

Showing 23 changed files with 2,534 additions and 3 deletions.
diff --git a/docs/api_refs/typedoc.json b/docs/api_refs/typedoc.json
diff --git a/docs/core_docs/docs/guides/langsmith_evaluation.mdx b/docs/core_docs/docs/guides/langsmith_evaluation.mdx
@@ -0,0 +1,239 @@
+# LangSmith Walkthrough
+
+LangChain makes it easy to prototype LLM applications and Agents. However, delivering LLM applications to production can be deceptively difficult. You will have to iterate on your prompts, chains, and other components to build a high-quality product.
+
+LangSmith makes it easy to debug, test, and continuously improve your LLM applications.
+
+When might this come in handy? You may find it useful when you want to:
+
+- Quickly debug a new chain, agent, or set of tools
+- Create and manage datasets for fine-tuning, few-shot prompting, and evaluation
+- Run regression tests on your application to confidently develop
+- Capture production analytics for product insights and continuous improvements
+
+## Prerequisites
+
+**[Create a LangSmith account](https://smith.langchain.com/) and create an API key (see bottom left corner). Familiarize yourself with the platform by looking through the [docs](https://docs.smith.langchain.com/)**
+
+Note LangSmith is in closed beta; we're in the process of rolling it out to more users. However, you can fill out the form on the website for expedited access.
+
+Now, let's get started!
+
+## Log runs to LangSmith
+
+First, configure your environment variables to tell LangChain to log traces. This is done by setting the `LANGCHAIN_TRACING_V2` environment variable to true.
+You can tell LangChain which project to log to by setting the `LANGCHAIN_PROJECT` environment variable (if this isn't set, runs will be logged to the `default` project). This will automatically create the project for you if it doesn't exist. You must also set the `LANGCHAIN_ENDPOINT` and `LANGCHAIN_API_KEY` environment variables.
+
+For more information on other ways to set up tracing, please reference the [LangSmith documentation](https://docs.smith.langchain.com/docs/).
+
+However, in this example, we will use environment variables.
+
+```bash
+npm install @langchain/openai @langchain/community langsmith uuid
+```
+
+```typescript
+import { v4 as uuidv4 } from "uuid";
+const uniqueId = uuidv4().slice(0, 8);
+process.env.LANGCHAIN_TRACING_V2 = "true";
+process.env.LANGCHAIN_PROJECT = `JS Tracing Walkthrough - ${uniqueId}`;
+process.env.LANGCHAIN_ENDPOINT = "https://api.smith.langchain.com";
+process.env.LANGCHAIN_API_KEY = "<YOUR-API-KEY>"; // Replace with your API key
+
+// For the chain in this tutorial
+process.env.OPENAI_API_KEY = "<YOUR-OPENAI-API-KEY>";
+// You can make an API key here: https://app.tavily.com/sign-in
+process.env.TAVILY_API_KEY = "<YOUR-TAVILY-API-KEY>";
+```
+
+Create the langsmith client to interact with the API
+
+```typescript
+import { Client } from "langsmith";
+
+const client = new Client();
+```
+
+Create a LangChain component and log runs to the platform. In this example, we will create an OpenAI function calling agent with access to a general search tool (Tavily). The agent's prompt can be viewed in the [Hub here](https://smith.langchain.com/hub/hwchase17/openai-functions-agent).
+
+```typescript
+import { AgentExecutor, createOpenAIFunctionsAgent } from "langchain/agents";
+import { pull } from "langchain/hub";
+import { TavilySearchResults } from "@langchain/community/tools/tavily_search";
+import { ChatOpenAI } from "@langchain/openai";
+import type { ChatPromptTemplate } from "@langchain/core/prompts";
+
+const tools = [new TavilySearchResults()];
+
+// Get the prompt to use - you can modify this!
+// If you want to see the prompt in full, you can at:
+// https://smith.langchain.com/hub/hwchase17/openai-functions-agent
+const prompt = await pull<ChatPromptTemplate>(
+  "hwchase17/openai-functions-agent"
+);
+
+const llm = new ChatOpenAI({
+  modelName: "gpt-3.5-turbo-1106",
+  temperature: 0,
+});
+
+const agent = await createOpenAIFunctionsAgent({
+  llm,
+  tools,
+  prompt,
+});
+
+const agentExecutor = new AgentExecutor({
+  agent,
+  tools,
+});
+```
+
+You can run the executor on multiple inputs concurrently, reducing latency. The runs are logged to LangSmith in the background.
+
+```typescript
+const inputs = [
+  { input: "What is LangChain?" },
+  { input: "What's LangSmith?" },
+  { input: "When was Llama-v2 released?" },
+  { input: "What is the langsmith cookbook?" },
+  { input: "When did langchain first announce the hub?" },
+];
+
+const results = await agentExecutor.batch(inputs);
+console.log(results.slice(0, 2));
+```
+
+```out
+[
+  {
+    input: 'What is LangChain?',
+    output: 'LangChain is a framework that allows developers to build applications with Language Model Systems (LLMs) through composability. It provides a set of tools and modules for building context-aware language model systems, including features for retrieval augmented generation, analyzing structured data, chatbots, and more. LangChain also offers the LangChain Expression Language (LCEL) to create custom chains and adapt language models to specific business contexts. It was launched as an open-source project in October 2022 and has gained popularity for its capabilities in the field of generative AI and language model integration. You can find more information about LangChain on their [official website](https://www.langchain.com/).'
+  },
+  {
+    input: "What's LangSmith?",
+    output: 'LangSmith is a unified platform designed to help developers with debugging, testing, evaluating, and monitoring chains and intelligent agents built on any LLM (Language Model) framework. It provides full visibility into model inputs and outputs, facilitates dataset creation from existing logs, and seamlessly integrates logging/debugging workflows with testing/evaluation workflows. LangSmith aims to bridge the gap between prototype and production, offering a single, fully-integrated hub for developers to work from. It also assists in tracing and evaluating complex agent prompt chains, reducing the time required for debugging and refinement. LangSmith is part of the LangChain ecosystem, which is an open-source framework for building with LLMs.'
+  }
+]
+```
+
+After setting up your environment, your agent traces should appear in the Projects section on the LangSmith app. Congratulations!
+
+If the agent is not effectively using the tools, evaluate it to establish a baseline.
+
+## Evaluate the Chain
+
+LangSmith allows you to test and evaluate your LLM applications. Follow these steps to benchmark your agent:
+
+### 1. Create a LangSmith dataset
+
+Use the LangSmith client to create a dataset with input questions and corresponding labels.
+
+For more information on datasets, including how to create them from CSVs or other files or how to create them in the platform, please refer to the [LangSmith documentation](https://docs.smith.langchain.com/).
+
+```typescript
+const outputs = [
+  "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.",
+  "LangSmith is a unified platform for debugging, testing, and monitoring language model applications and agents powered by LangChain",
+  "July 18, 2023",
+  "The langsmith cookbook is a github repository containing detailed examples of how to use LangSmith to debug, evaluate, and monitor large language model-powered applications.",
+  "September 5, 2023",
+];
+
+const datasetName = `lcjs-qa-${uniqueId}`;
+const dataset = await client.createDataset(datasetName);
+
+await Promise.all(
+  inputs.map(async (input, i) => {
+    await client.createExample(
+      input,
+      { output: outputs[i] },
+      {
+        datasetId: dataset.id,
+      }
+    );
+  })
+);
+```
+
+### 2. Configure evaluation
+
+Manually comparing the results of chains in the UI is effective, but it can be time consuming.
+It can be helpful to use automated metrics and AI-assisted feedback to evaluate your component's performance.
+
+Below, we will create a custom run evaluator that logs a heuristic evaluation.
+
+```typescript
+import type { RunEvalConfig } from "langchain/smith";
+import { Run, Example } from "langsmith";
+
+// An illustrative custom evaluator example
+const notUnsure = (props: { run: Run; example?: Example }) => ({
+  key: "not_unsure",
+  score: !props.run?.outputs?.output.includes("not sure"),
+});
+
+const evaluation: RunEvalConfig = {
+  // The 'evaluators' are loaded from LangChain's evaluation
+  // library.
+  evaluators: [
+    {
+      evaluatorType: "labeled_criteria",
+      criteria: "correctness",
+      feedbackKey: "correctness",
+      formatEvaluatorInputs: ({
+        rawInput,
+        rawPrediction,
+        rawReferenceOutput,
+      }) => {
+        return {
+          input: rawInput.input,
+          prediction: rawPrediction.output,
+          reference: rawReferenceOutput.output,
+        };
+      },
+    },
+  ],
+  // Custom evaluators can be user-defined RunEvaluator's
+  // or a compatible function
+  customEvaluators: [notUnsure],
+};
+```
+
+### 3. Run the Benchmark
+
+Use the [runOnDataset](https://api.js.langchain.com/functions/langchain_smith.runOnDataset.html) function to evaluate your model. This will:
+
+1. Fetch example rows from the specified dataset.
+2. Run your chain, agent (or any custom function) on each example.
+3. Apply evaluators to the resulting run traces and corresponding reference examples to generate automated feedback.
+
+The results will be visible in the LangSmith app.
+
+```typescript
+import { runOnDataset } from "langchain/smith";
+
+await runOnDataset(agentExecutor, datasetName, {
+  evaluationConfig: evaluation,
+});
+```
+
+```out
+Predicting: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100.00% | 5/5
+Completed
+Running Evaluators: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 100.00% | 5/5
+```
+
+### Review the test results
+
+You can review the test results tracing UI below by clicking the URL in the output above or navigating to the "Testing & Datasets" page in LangSmith **"lcjs-qa-{uniqueId}"** dataset.
+
+This will show the new runs and the feedback logged from the selected evaluators. You can also explore a summary of the results in tabular format below.
+
+## Conclusion
+
+Congratulations! You have successfully traced and evaluated a chain using LangSmith!
+
+This was a quick guide to get started, but there are many more ways to use LangSmith to speed up your developer flow and produce better results.
+
+For more information on how you can get the most out of LangSmith, check out [LangSmith documentation](https://docs.smith.langchain.com/), and please reach out with questions, feature requests, or feedback at [support@langchain.dev](mailto:support@langchain.dev).
diff --git a/environment_tests/test-exports-bun/src/entrypoints.js b/environment_tests/test-exports-bun/src/entrypoints.js
@@ -109,5 +109,6 @@ export * from "langchain/experimental/chains/violation_of_expectations";
 export * from "langchain/experimental/masking";
 export * from "langchain/experimental/prompts/custom_format";
 export * from "langchain/evaluation";
+export * from "langchain/smith";
 export * from "langchain/runnables";
 export * from "langchain/runnables/remote";
diff --git a/environment_tests/test-exports-cf/src/entrypoints.js b/environment_tests/test-exports-cf/src/entrypoints.js
@@ -109,5 +109,6 @@ export * from "langchain/experimental/chains/violation_of_expectations";
 export * from "langchain/experimental/masking";
 export * from "langchain/experimental/prompts/custom_format";
 export * from "langchain/evaluation";
+export * from "langchain/smith";
 export * from "langchain/runnables";
 export * from "langchain/runnables/remote";
diff --git a/environment_tests/test-exports-cjs/src/entrypoints.js b/environment_tests/test-exports-cjs/src/entrypoints.js
@@ -109,5 +109,6 @@ const experimental_chains_violation_of_expectations = require("langchain/experim
 const experimental_masking = require("langchain/experimental/masking");
 const experimental_prompts_custom_format = require("langchain/experimental/prompts/custom_format");
 const evaluation = require("langchain/evaluation");
+const smith = require("langchain/smith");
 const runnables = require("langchain/runnables");
 const runnables_remote = require("langchain/runnables/remote");
diff --git a/environment_tests/test-exports-esbuild/src/entrypoints.js b/environment_tests/test-exports-esbuild/src/entrypoints.js
@@ -109,5 +109,6 @@ import * as experimental_chains_violation_of_expectations from "langchain/experi
 import * as experimental_masking from "langchain/experimental/masking";
 import * as experimental_prompts_custom_format from "langchain/experimental/prompts/custom_format";
 import * as evaluation from "langchain/evaluation";
+import * as smith from "langchain/smith";
 import * as runnables from "langchain/runnables";
 import * as runnables_remote from "langchain/runnables/remote";
diff --git a/environment_tests/test-exports-esm/src/entrypoints.js b/environment_tests/test-exports-esm/src/entrypoints.js
@@ -109,5 +109,6 @@ import * as experimental_chains_violation_of_expectations from "langchain/experi
 import * as experimental_masking from "langchain/experimental/masking";
 import * as experimental_prompts_custom_format from "langchain/experimental/prompts/custom_format";
 import * as evaluation from "langchain/evaluation";
+import * as smith from "langchain/smith";
 import * as runnables from "langchain/runnables";
 import * as runnables_remote from "langchain/runnables/remote";
diff --git a/environment_tests/test-exports-vercel/src/entrypoints.js b/environment_tests/test-exports-vercel/src/entrypoints.js
@@ -109,5 +109,6 @@ export * from "langchain/experimental/chains/violation_of_expectations";
 export * from "langchain/experimental/masking";
 export * from "langchain/experimental/prompts/custom_format";
 export * from "langchain/evaluation";
+export * from "langchain/smith";
 export * from "langchain/runnables";
 export * from "langchain/runnables/remote";
diff --git a/environment_tests/test-exports-vite/src/entrypoints.js b/environment_tests/test-exports-vite/src/entrypoints.js
@@ -109,5 +109,6 @@ export * from "langchain/experimental/chains/violation_of_expectations";
 export * from "langchain/experimental/masking";
 export * from "langchain/experimental/prompts/custom_format";
 export * from "langchain/evaluation";
+export * from "langchain/smith";
 export * from "langchain/runnables";
 export * from "langchain/runnables/remote";
diff --git a/examples/package.json b/examples/package.json
@@ -60,6 +60,7 @@
     "ioredis": "^5.3.2",
     "js-yaml": "^4.1.0",
     "langchain": "workspace:*",
+    "langsmith": "^0.0.59",
     "ml-distance": "^4.0.0",
     "mongodb": "^5.2.0",
     "pg": "^8.11.0",