Generate document summaries #58

valeksiev · 2024-10-22T19:29:20Z

Summary by CodeRabbit

New Features
- Introduced a new environment variable SUMMARY_LENGTH to enhance document summarization.
- Added new dependencies for improved functionality, including @langchain/langgraph and @langchain/mistralai.
- Implemented document summarization capabilities with new functions for generating and refining summaries.
- Enhanced error handling during the embedding process for better reliability.
Bug Fixes
- Improved truthiness checks in contribution handling for clarity.
Documentation
- New files added to support summarization processes, including structured prompt systems and state graph approaches.

…ate-document-summaries

coderabbitai · 2024-10-22T19:29:30Z

Walkthrough

The pull request introduces several changes across multiple files. A new environment variable SUMMARY_LENGTH is added to .env.default. The package.json file sees updates in its dependencies, including the addition of new libraries and re-adding previously removed ones. Modifications in the baseHandler function enhance the handling of contributions, while the fileLoaderFactories type definition is expanded to support additional document loaders. New functionalities for document summarization are introduced in several files, including embed.ts, body.of.knowledge.ts, document.ts, and graph.ts, enhancing the overall document processing capabilities.

Changes

File	Change Summary
`.env.default`	Added new environment variable: `SUMMARY_LENGTH=10000`.
`package.json`	- Version updated from `0.9.2` to `0.10.0`. - Dependencies updated: re-added `@graphql-codegen/typescript-graphql-request`, `@graphql-codegen/typescript-operations`, `@types/graphql-upload`, `graphql`, `graphql-upload`. - Added new dependencies: `@langchain/langgraph` and `@langchain/mistralai`.
`src/callout.handlers/base.ts`	Enhanced handling of contributions in `baseHandler` function by iterating over `callout.contributions` and generating documents.
`src/callout.handlers/link.collection.ts`	Updated type definition for `fileLoaderFactories` to return `BaseDocumentLoader
`src/embed.ts`	Enhanced document processing with new summarization features; updated function signature and improved error handling.
`src/summarize/body.of.knowledge.ts`	Introduced functionality for summarizing content with prompt templates; added method `async summariseBodyOfKnowledge(chunks: Document[])`.
`src/summarize/document.ts`	Introduced summarization functionality for text documents; added method `export const summarizeDocument = async (chunks: Document[])`.
`src/summarize/graph.ts`	Introduced state graph approach for summarization; exported `summaryLength` constant and `buildGraph` function.
`src/space.embed/embed.space.ts`	Added error handling in `embedSpace` function to log errors during the embedding process.

Assessment against linked issues

Objective	Addressed	Explanation
Generate document summary using LangGraph and an iterative approach (1440)	✅
Generate a summary of the whole body of knowledge based on the generated summaries (1440)	✅
Push chunk sizes to the limits of the model (1440)	❌	No specific implementation for chunk size limits is evident in the changes.
Identify approaches how to optimize the embeddings using some user input (1440)	❌	No changes related to user input for embedding optimization are present.

Possibly related PRs

Handle callout contributions without neither post nor link #56: The changes in src/callout.handlers/base.ts involve modifications to the baseHandler function, which may relate to the overall handling of contributions, but do not directly connect to the addition of the SUMMARY_LENGTH variable in the main PR.

Suggested reviewers

valentinyanakiev
hero101

Poem

🐇 In a world of words, we hop and play,
Summaries crafted, brightening the day.
With SUMMARY_LENGTH set, we leap and bound,
New tales of knowledge, in chunks we found.
So gather 'round, let the stories unfold,
In the garden of wisdom, let our hearts be bold! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 15

🧹 Outside diff range and nitpick comments (8)

.env.default (1)
27-27: Add documentation for the SUMMARY_LENGTH variable.

Consider adding a comment to explain:

The purpose of this variable

The unit of measurement (characters/tokens)

Any model-specific constraints

Example:
+# Maximum length of generated document summaries (in tokens)
 SUMMARY_LENGTH=10000
src/callout.handlers/base.ts (1)
Line range hint 1-85: Consider enhancing document preparation for summarization

Given that this PR focuses on document summarization, consider enhancing the document generation to include metadata that could aid in the summarization process:

Add a timestamp to track document freshness

Include document relationships (parent-child between callout and contributions)

Add content type hints to help the summarization algorithm

Example metadata structure:
metadata: {
  documentId,
  source,
  type,
  title,
  timestamp: new Date().toISOString(),
  relationships: {
    parentId: callout.id,
    type: 'contribution'
  },
  contentHints: {
    format: 'text/markdown',
    priority: contribution.priority
  }
}
This would help the summarization process:

Prioritize recent content

Understand document relationships

Apply appropriate summarization strategies based on content type

🧰 Tools

🪛 Biome

[error] 55-55: Avoid redundant double-negation.

It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation

(lint/complexity/noExtraBooleanCast)

[error] 57-57: Avoid redundant double-negation.

It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation

(lint/complexity/noExtraBooleanCast)
src/space.embed/embed.space.ts (1)
Line range hint 73-79: Consider enhancing error handling for document summarization.

Given that this PR implements document summarization functionality, the error handling could be more specific to distinguish between embedding and summarization failures.

Consider:

Adding specific error codes for summarization failures

Including summary-related metadata in error messages

Structuring the error object to include both embedding and summarization status

Example enhancement:
export enum ErrorCode {
  VECTOR_INSERT = 'VECTOR_INSERT',
  SUMMARY_GENERATION = 'SUMMARY_GENERATION',
  SUMMARY_INTEGRATION = 'SUMMARY_INTEGRATION'
}

// Enhanced error structure
interface IngestError {
  code?: ErrorCode;
  message: string;
  context?: {
    phase: 'embedding' | 'summarization';
    documentCount?: number;
    failedDocuments?: string[];
  };
}
src/callout.handlers/link.collection.ts (1)

59-61: Consider if explicit loader types are necessary.

While the type signature change is technically correct, it might be unnecessary since both PDFLoader and DocxLoader extend BaseDocumentLoader. The current implementation doesn't use any loader-specific features that would require this explicit typing.
src/summarize/body.of.knowledge.ts (1)
5-15: Refactor to avoid duplication of the 'system' message

The 'system' message is identical in both summarizePrompt and refinePrompt. Consider defining it once to reduce duplication.

Here's how you can refactor the code:
+const systemMessage = [
+  'system',
+  `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so
+   preserve as much information as possible like names, references titles, dates, etc.`,
+];

const summarizePrompt = ChatPromptTemplate.fromMessages([
-  [
-    'system',
-    `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so
-     preserve as much information as possible like names, references titles, dates, etc.`,
-  ],
+  systemMessage,
  [
    'human',
    'Write a detailed summary, no more than {summaryLength} characters of the following: {context}',
  ],
]);

const refinePrompt = ChatPromptTemplate.fromMessages([
-  [
-    'system',
-    `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so
-     preserve as much information as possible like names, references titles, dates, etc.`,
-  ],
+  systemMessage,
  [
    'human',
    `Produce a final detailed summary, no more than {summaryLength} characters.
     Existing summary up to this point:

     {currentSummary}

     New context: {context}

     Given the new context, refine the original summary.`,
  ],
]);
Also applies to: 16-33
src/summarize/document.ts (1)

6-6: Refine the comment for clarity and professionalism

The comment on line 6 reads:
// the system message is the same for both prompts but putting it into a variable and typing it is too hard...
Consider rephrasing or removing this comment to maintain a professional tone and clarity in the codebase.
src/summarize/graph.ts (1)
58-64: Consider adding error handling for the state graph compilation

While compiling the state graph, it's beneficial to include error handling to catch any issues during the graph construction.
 const graph = new StateGraph(SummarizeAnnotation)
   .addNode('initialSummary', initialSummary)
   .addNode('refineSummary', refineSummary)
   .addEdge(START, 'initialSummary')
   .addConditionalEdges('initialSummary', shouldRefine, ['refineSummary', END])
   .addConditionalEdges('refineSummary', shouldRefine, ['refineSummary', END])
-  .compile();
+  .compile()
+  .catch((error) => {
+    throw new Error(`Failed to compile the state graph: ${error.message}`);
+  });
src/embed.ts (1)
105-105: Use actual space name instead of placeholder

The title field is set to the placeholder 'space name'. Replace it with the actual space name to ensure accurate metadata.

Consider updating the code as follows:
-    title: 'space name',
+    title: spaceID, // or use the actual space name if available

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fe4d1d0 and fea4a0a.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (9)

.env.default (1 hunks)
package.json (1 hunks)
src/callout.handlers/base.ts (1 hunks)
src/callout.handlers/link.collection.ts (1 hunks)
src/embed.ts (3 hunks)
src/space.embed/embed.space.ts (1 hunks)
src/summarize/body.of.knowledge.ts (1 hunks)
src/summarize/document.ts (1 hunks)
src/summarize/graph.ts (1 hunks)

🧰 Additional context used

🪛 Biome

src/callout.handlers/base.ts

[error] 55-55: Avoid redundant double-negation.

It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation

(lint/complexity/noExtraBooleanCast)

[error] 57-57: Avoid redundant double-negation.

It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation

(lint/complexity/noExtraBooleanCast)

🔇 Additional comments (11)

.env.default (1)

27-27: Verify if 10000 is within safe limits for the model.

The summary length of 10000 is significantly larger than the chunk size (1000). While this aligns with the PR objective to maximize chunk sizes, please verify that this limit is safe for the deployed model (deploy-gpt-35-turbo).

src/callout.handlers/base.ts (1)

52-52: LGTM: Improved code readability

The added empty line helps separate logical blocks of code, making it easier to read and maintain.
package.json (2)
53-54: Verify LangChain dependency compatibility.

The addition of @langchain/langgraph and @langchain/mistralai aligns with the PR objectives for document summarization. However, let's ensure version compatibility with the existing langchain package (v0.2.2).
#!/bin/bash
# Description: Check for potential version conflicts in package-lock.json

echo "Checking package-lock.json for LangChain dependencies..."
if [ -f "package-lock.json" ]; then
    jq '.dependencies | with_entries(select(.key | contains("langchain")))' package-lock.json
fi

# Look for LangChain imports to ensure correct usage
echo "Checking LangChain import patterns..."
rg -l "@langchain/(langgraph|mistralai)" --type ts
Consider adding a comment in the package.json to document the interdependency between these LangChain packages and their version constraints for future maintenance.

50-51: Verify the necessity of re-added GraphQL dependencies.

The GraphQL-related packages that were previously removed have been re-added. Let's verify if these dependencies are still actively used in the codebase.

Also applies to: 55-55, 59-60

✅ Verification successful

Let me gather more specific information about GraphQL usage in the codebase.

I'm looking for evidence of active GraphQL usage in the codebase, including schema files, operations, generated types, and build scripts. This will help determine if these dependencies are actually needed or if they're remnants that can be safely removed.

Based on the comprehensive results, I can now generate the final response:

GraphQL dependencies are actively used and required

The GraphQL-related packages are actively used throughout the codebase as evidenced by:

Multiple GraphQL schema files and fragments in the graphql/ directory

Active GraphQL client implementation in src/graphql.client/AlkemioCliClient.ts

Generated types being used across multiple files (9 TypeScript files importing GraphQL operations/generated types)

Codegen configuration (codegen.yml) and npm script for code generation
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check if GraphQL dependencies are used in the codebase

echo "Checking for GraphQL usage..."
# Look for GraphQL imports and usage
rg -l "graphql|GraphQL" --type ts

echo "Checking for specific package imports..."
# Check for specific package imports
rg -l "@graphql-codegen" --type ts
rg -l "graphql-upload" --type ts
Length of output: 470

Script:
#!/bin/bash
# Check for GraphQL schema, operations, and codegen configuration
echo "Checking for GraphQL related files..."
fd -e graphql
fd "codegen.yml|codegen.yaml|codegen.json"

echo -e "\nChecking for GraphQL operations and types usage..."
rg -l "import.*gql from|import.*generated" --type ts

echo -e "\nChecking for GraphQL client setup..."
rg -A 5 "GraphQLClient|useQuery|useMutation" --type ts

echo -e "\nChecking package.json scripts for codegen..."
if [ -f "package.json" ]; then
    jq '.scripts | with_entries(select(.value | contains("codegen")))' package.json
fi
Length of output: 2865
src/space.embed/embed.space.ts (2)

74-74: LGTM! Error logging enhancement.

Good addition of error logging before setting the error result. This will help with debugging and monitoring.

Line range hint 38-72: Consider integrating document summarization before embedding.

Based on the PR objectives to "generate summaries for documents using LangGraph," consider adding a summarization step before the embedding process.

Let's verify if summarization is implemented elsewhere:
src/callout.handlers/link.collection.ts (1)
59-61: Review document chunking strategy for summarization.

Given the PR's objective to maximize chunk sizes for better summarization, consider reviewing the document loading configuration. For example, the PDFLoader's splitPages: false setting affects how content is chunked.

Let's verify the current chunking configurations across loaders:

✅ Verification successful

Let me gather more information about the document loading and chunking configurations by checking other loaders and related code:

Based on the gathered information, I can now provide the final response:

Document chunking strategy appears to be consistently implemented

The codebase already implements a well-defined chunking strategy using RecursiveCharacterTextSplitter with explicit chunkSize and chunkOverlap parameters in src/embed.ts. The PDFLoader's splitPages: false configuration in src/callout.handlers/link.collection.ts aligns with this strategy by ensuring the PDF content is processed as a single document before being chunked by the text splitter.

The chunking is centralized in src/embed.ts where documents from all loaders are processed uniformly

The PDFLoader's configuration complements this by preventing premature content splitting
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for chunk size configurations in loaders
echo "Searching for chunk-related configurations in loaders..."
rg -i "chunk|split|segment" "src/loaders"

# Check PDFLoader usage and configurations
echo "Checking PDF loader configurations..."
ast-grep --pattern 'new PDFLoader($_, $_)'
Length of output: 417

Script:
#!/bin/bash
# Check for all document loader implementations and their configurations
echo "Searching for document loader implementations..."
ast-grep --pattern 'class $_ extends BaseDocumentLoader'

# Check for chunking related configurations in the entire codebase
echo "Checking chunking configurations..."
rg -i "chunksize|chunkoverlap|splitter" -A 3

# Check for text splitter implementations
echo "Checking text splitter usage..."
ast-grep --pattern 'new TextSplitter($$$)'
Length of output: 2949
src/summarize/body.of.knowledge.ts (1)

13-13: Ensure 'summaryLength' variable is correctly provided to the prompts

The prompt templates use the {summaryLength} variable, but it's not clear if this variable is being supplied when the prompts are rendered. Ensure that summaryLength is correctly passed into the prompt context to avoid runtime errors.

Please check whether summaryLength is being provided in the context when invoking the prompts. If not, you may need to include it in the parameters or fetch it from the configuration.

Also applies to: 24-24

src/embed.ts (3)

82-92: Ensure summarizeDocument handles expected input format

Confirm that summarizeDocument accepts an array of documents (splitted) as input. If it expects a single document or a different format, adjust the input accordingly.

Consider verifying the expected input for summarizeDocument and modify if necessary.

77-77: Include chunkIndex consistently in metadata

Adding chunkIndex to metadata enhances traceability of chunks. Ensure that this inclusion is consistent throughout the codebase where chunks are processed.

10-12: Imports added for summarization functionality

The new imports for summarization are appropriate and necessary for the added features.

src/callout.handlers/base.ts

src/summarize/body.of.knowledge.ts

src/summarize/document.ts

src/summarize/graph.ts

src/embed.ts

src/callout.handlers/base.ts

src/embed.ts

coderabbitai

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (6)

src/summarize/body.of.knowledge.ts (2)
9-11: Consider improving clarity of the system message

The word "concising" is not commonly used and might be unclear. Consider rephrasing for better clarity.
-  'You are tasked with concising summaries based entirely on the user input. While doing so preserve as much information as possible like names, references titles, dates, etc.'
+  'You are tasked with creating concise summaries based entirely on the user input. While doing so, preserve as much information as possible such as names, reference titles, dates, etc.'
19-31: Consider normalizing whitespace in the template string

The template string has inconsistent indentation which might affect readability of the code (though not the output).
   HumanMessagePromptTemplate.fromTemplate(
     `Produce a final detailed summary, no more than {summaryLength} characters.
-     Existing summary up to this point:
+    Existing summary up to this point:

-     {currentSummary}
+    {currentSummary}

-     New context: {context}
+    New context: {context}

-     Given the new context, refine the original summary.`
+    Given the new context, refine the original summary.`
   ),
src/summarize/document.ts (1)
19-30: Improve template string formatting for clarity.

The template includes extra newlines that might affect the output format. Consider making it more concise.
   HumanMessagePromptTemplate.fromTemplate(
     `Produce a final detailed summary, no more than {summaryLength} characters.
      Existing summary up to this point:
      {currentSummary}
-
      New context: {context}
-
      Given the new context, refine the original summary.`
+      New context: {context}
+      Given the new context, refine the original summary while maintaining consistency.`
   ),
src/callout.handlers/base.ts (1)
Line range hint 61-77: Add error handling for document generation

The document generation process should handle potential errors gracefully.

Consider wrapping the document generation in a try-catch:
    if (docLike) {
+     try {
        const { pageContent, documentId, source, type, title } =
          generateDocument(docLike);
        result.push(
          new Document({
            pageContent,
            metadata: {
              documentId,
              source,
              type,
              title,
            },
          })
        );
+     } catch (error) {
+       logger.error(
+         `Failed to generate document for contribution in callout ${callout.id}:`,
+         error
+       );
+     }
    }
src/summarize/graph.ts (1)
27-39: Consider adding return type annotation for better type safety.

The function's return type can be explicitly annotated to improve type safety and documentation.
 export const buildGraph = (
   summarizePrompt: ChatPromptTemplate,
   refinePrompt: ChatPromptTemplate
-) => {
+): StateGraph<typeof SummarizeAnnotation> => {
src/embed.ts (1)
21-25: Consider breaking down this function for better maintainability.

While the type changes improve safety, this function has grown to handle multiple responsibilities:

Document processing

Summarization

Embedding generation

Database operations

Consider splitting this into smaller, focused functions:
async function processDocuments(docs: Document[], splitter: TextSplitter): Promise<ProcessedDocs>;
async function generateSummaries(docs: ProcessedDocs): Promise<Summaries>;
async function generateEmbeddings(docs: Summaries): Promise<Embeddings>;
async function persistToDatabase(embeddings: Embeddings): Promise<boolean>;
This would improve:

Testability

Maintainability

Error handling

Code reuse

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fea4a0a and 85ceb2f.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (7)

package.json (2 hunks)
src/callout.handlers/base.ts (1 hunks)
src/embed.ts (3 hunks)
src/space.embed/embed.space.ts (2 hunks)
src/summarize/body.of.knowledge.ts (1 hunks)
src/summarize/document.ts (1 hunks)
src/summarize/graph.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

package.json
src/space.embed/embed.space.ts

🧰 Additional context used

📓 Learnings (1)

src/summarize/graph.ts (1)

Learnt from: valeksiev
PR: alkem-io/virtual-contributor-ingest-space#58
File: src/summarize/graph.ts:36-49
Timestamp: 2024-10-23T08:15:50.839Z
Learning: In `src/summarize/graph.ts`, the `shouldRefine` function handles index bounds checking for `input.index` before invoking `refineSummary`, so additional bounds checks within `refineSummary` are unnecessary.

🔇 Additional comments (11)

src/summarize/body.of.knowledge.ts (3)

1-8: LGTM! Imports are well-structured

The imports are correctly organized, using the recommended @langchain/core package for prompt templates.

13-18: LGTM! Well-structured prompt template

The prompt template is correctly implemented using appropriate message types and includes necessary parameters.

33-37: Verify chunk size handling

The PR objectives mention maximizing chunk sizes, but there's no explicit handling of chunk sizes in this implementation.
src/summarize/document.ts (2)
1-7: LGTM! Well-structured imports.

The imports are specific, well-scoped, and follow good practices by using named imports.

13-18: Verify the summaryLength parameter handling.

The prompt uses {summaryLength} but its import/definition isn't visible in this file. Ensure it's properly passed through the buildGraph function.

✅ Verification successful

Based on the results, I can now generate the final response since I have sufficient information about the summaryLength parameter handling.

The summaryLength parameter is properly defined and passed through the system

The parameter is well-handled throughout the codebase:

Defined in src/summarize/graph.ts as an environment variable SUMMARY_LENGTH with a default value of 10000

Properly passed to the summarization chain in graph.ts during invocation

Correctly referenced in prompt templates across document summarization files
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify summaryLength parameter handling
# Test: Check if summaryLength is defined in graph.ts and properly passed to this function
rg -A 5 "summaryLength" "./src/summarize/"
Length of output: 2586
src/callout.handlers/base.ts (1)
Line range hint 61-77: Consider integrating document summarization

Based on the PR objectives to generate document summaries using LangGraph, this section should integrate with the summarization functionality.

Consider these approaches:

Add summarization as part of document generation

Create a separate summarization step after document generation

Example integration:
if (docLike) {
  try {
    const { pageContent, documentId, source, type, title } =
      generateDocument(docLike);
    
    // Option 1: Integrate with summarization
    const summary = await summarizeDocument(pageContent);
    
    result.push(
      new Document({
        pageContent,
        metadata: {
          documentId,
          source,
          type,
          title,
          summary, // Include summary in metadata
        },
      })
    );
  } catch (error) {
    logger.error(`Failed to process document:`, error);
  }
}
Let's verify if the summarization functionality is available:
src/summarize/graph.ts (3)
1-25: LGTM! Environment setup is well-structured.

The implementation includes proper error handling for environment variables and appropriate model configuration.

61-77: LGTM! Clear and maintainable control flow implementation.

The current implementation of shouldRefine with explicit END state handling is more readable and maintainable than a boolean return value. It clearly shows the state transition logic which is crucial for understanding the graph flow.

40-44: ⚠️ Potential issue

Add validation for empty chunks array.

The initialSummary function should validate that chunks array is not empty before accessing the first element.
 const initialSummary = async (input: typeof SummarizeAnnotation.State) => {
+  if (!input.chunks?.length) {
+    throw new Error('No chunks available for summarization.');
+  }
   const context = input.chunks[0].pageContent;
   const summary = await summaryChain.invoke({ context, summaryLength });
   return { summary: summary.content, index: 1 };
 };
Likely invalid or redundant comment.
src/embed.ts (2)

10-13: LGTM! Well-organized imports.

The new imports are properly scoped and align with the added summarization functionality.

79-79: LGTM! Improved metadata tracking.

The addition of embeddingType and chunkIndex to metadata enhances traceability.

coderabbitai · 2024-10-23T08:39:12Z

src/summarize/body.of.knowledge.ts

+export const summariseBodyOfKnowledge = async (chunks: Document[]) => {
+  const graph = buildGraph(summarizePrompt, refinePrompt);
+  const final = await graph.invoke({ chunks });
+  return final.summary;
+};


🛠️ Refactor suggestion

Consider adding error handling and type definitions

The function implementation is clean but could benefit from:

Error handling for potential failures in graph operations

Explicit return type definition

-export const summariseBodyOfKnowledge = async (chunks: Document[]) => { +export const summariseBodyOfKnowledge = async (chunks: Document[]): Promise<string> => { + try { const graph = buildGraph(summarizePrompt, refinePrompt); const final = await graph.invoke({ chunks }); return final.summary; + } catch (error) { + throw new Error(`Failed to summarize body of knowledge: ${error.message}`); + } };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export const summariseBodyOfKnowledge = async (chunks: Document[]) => {

const graph = buildGraph(summarizePrompt, refinePrompt);

const final = await graph.invoke({ chunks });

return final.summary;

};

export const summariseBodyOfKnowledge = async (chunks: Document[]): Promise<string> => {

try {

const graph = buildGraph(summarizePrompt, refinePrompt);

const final = await graph.invoke({ chunks });

return final.summary;

} catch (error) {

throw new Error(`Failed to summarize body of knowledge: ${error.message}`);

}

};

src/summarize/document.ts

coderabbitai · 2024-10-23T08:39:13Z

src/summarize/document.ts

+export const summarizeDocument = async (chunks: Document[]) => {
+  const graph = buildGraph(summarizePrompt, refinePrompt);
+  const final = await graph.invoke({ chunks });
+  return final.summary;
+};


⚠️ Potential issue

Add error handling and input validation.

The function should include:

Input validation for empty chunks array

Error handling for graph.invoke

Type definition for the graph result

-export const summarizeDocument = async (chunks: Document[]) => { +export const summarizeDocument = async (chunks: Document[]): Promise<string> => { + if (!chunks.length) { + throw new Error('No documents provided for summarization'); + } + const graph = buildGraph(summarizePrompt, refinePrompt); - const final = await graph.invoke({ chunks }); - return final.summary; + try { + const final = await graph.invoke({ chunks }); + return final.summary; + } catch (error) { + throw new Error(`Failed to generate summary: ${error.message}`); + } };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

export const summarizeDocument = async (chunks: Document[]) => {

const graph = buildGraph(summarizePrompt, refinePrompt);

const final = await graph.invoke({ chunks });

return final.summary;

};

export const summarizeDocument = async (chunks: Document[]): Promise<string> => {

if (!chunks.length) {

throw new Error('No documents provided for summarization');

}

const graph = buildGraph(summarizePrompt, refinePrompt);

try {

const final = await graph.invoke({ chunks });

return final.summary;

} catch (error) {

throw new Error(`Failed to generate summary: ${error.message}`);

}

};

coderabbitai · 2024-10-23T08:39:13Z

src/callout.handlers/base.ts

  for (const contribution of callout.contributions || []) {
    let docLike;
    if (contribution.link) {
      docLike = contribution.link;
    } else if (contribution.post) {
      docLike = contribution.post;
    }
+


⚠️ Potential issue

Enhance type safety and error handling for contribution processing

The current implementation could benefit from improved type safety and error handling:

Consider applying these changes:

+ type DocLikeContent = { + framing?: string; + // add other relevant properties based on your schema + }; for (const contribution of callout.contributions || []) { - let docLike; + let docLike: DocLikeContent | undefined; if (contribution.link) { docLike = contribution.link; } else if (contribution.post) { docLike = contribution.post; } else { + logger.warn(`Skipping contribution without link or post for callout ${documentId}`); + continue; }

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-10-23T08:39:13Z

src/embed.ts

+    if (doc.pageContent.length > summaryLength) {
+      try {
+        const documentSummary = await summarizeDocument(splitted);
+        ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
+        documents.push(documentSummary);
+        metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
+
+        summaries.push(documentSummary);
+      } catch (err) {
+        logger.error(err);
+      }
+    } else {
+      summaries.push(doc.pageContent);
+    }


🛠️ Refactor suggestion

Enhance error handling and documentation.

The summarization logic is sound but could be more robust.

Consider these improvements:

+ // Threshold for document summarization + const SUMMARY_LENGTH_THRESHOLD = summaryLength; - if (doc.pageContent.length > summaryLength) { + if (doc.pageContent.length > SUMMARY_LENGTH_THRESHOLD) { try { const documentSummary = await summarizeDocument(splitted); + if (!documentSummary) { + logger.warn(`Failed to generate summary for document ${doc.metadata.documentId}`); + summaries.push(doc.pageContent); + continue; + } ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`); documents.push(documentSummary); metadatas.push({ ...doc.metadata, embeddingType: 'summary' }); summaries.push(documentSummary); } catch (err) { logger.error(err); + logger.warn(`Falling back to original content for document ${doc.metadata.documentId}`); + summaries.push(doc.pageContent); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (doc.pageContent.length > summaryLength) {

try {

const documentSummary = await summarizeDocument(splitted);

ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);

documents.push(documentSummary);

metadatas.push({ ...doc.metadata, embeddingType: 'summary' });

summaries.push(documentSummary);

} catch (err) {

logger.error(err);

}

} else {

summaries.push(doc.pageContent);

}

const SUMMARY_LENGTH_THRESHOLD = summaryLength;

if (doc.pageContent.length > SUMMARY_LENGTH_THRESHOLD) {

try {

const documentSummary = await summarizeDocument(splitted);

if (!documentSummary) {

logger.warn(`Failed to generate summary for document ${doc.metadata.documentId}`);

summaries.push(doc.pageContent);

continue;

}

ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);

documents.push(documentSummary);

metadatas.push({ ...doc.metadata, embeddingType: 'summary' });

summaries.push(documentSummary);

} catch (err) {

logger.error(err);

logger.warn(`Falling back to original content for document ${doc.metadata.documentId}`);

summaries.push(doc.pageContent);

}

} else {

summaries.push(doc.pageContent);

}

coderabbitai · 2024-10-23T08:39:13Z

src/embed.ts

+  const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
+  const bokChunks = await splitter.splitDocuments([bokDescriptions]);
+  const bokSummary = await summariseBodyOfKnowledge(bokChunks);
+  ids.push('body-of-knowledge-summary');
+  documents.push(bokSummary);
+
+  metadatas.push({
+    documentId: spaceID,
+    soruce: space.profile.url,
+    type: 'bodyOfKnowledgeSummary',
+    title: space.profile?.displayName,
+  });


🛠️ Refactor suggestion

⚠️ Potential issue

Add validation and fix the 'source' typo.

Fix the typo in metadata:

metadatas.push({ documentId: spaceID, - soruce: space.profile.url, + source: space.profile.url, type: 'bodyOfKnowledgeSummary', title: space.profile?.displayName, });

Add validation and error handling:

+ if (summaries.length === 0) { + logger.warn('No summaries generated, skipping body of knowledge generation'); + return true; + } const bokDescriptions = new Document({ pageContent: summaries.join('\n') }); const bokChunks = await splitter.splitDocuments([bokDescriptions]); - const bokSummary = await summariseBodyOfKnowledge(bokChunks); + let bokSummary; + try { + bokSummary = await summariseBodyOfKnowledge(bokChunks); + if (!bokSummary) { + logger.error('Failed to generate body of knowledge summary'); + return false; + } + } catch (err) { + logger.error('Error generating body of knowledge summary:', err); + return false; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const bokDescriptions = new Document({ pageContent: summaries.join('\n') });

const bokChunks = await splitter.splitDocuments([bokDescriptions]);

const bokSummary = await summariseBodyOfKnowledge(bokChunks);

ids.push('body-of-knowledge-summary');

documents.push(bokSummary);

metadatas.push({

documentId: spaceID,

soruce: space.profile.url,

type: 'bodyOfKnowledgeSummary',

title: space.profile?.displayName,

});

if (summaries.length === 0) {

logger.warn('No summaries generated, skipping body of knowledge generation');

return true;

}

const bokDescriptions = new Document({ pageContent: summaries.join('\n') });

const bokChunks = await splitter.splitDocuments([bokDescriptions]);

let bokSummary;

try {

bokSummary = await summariseBodyOfKnowledge(bokChunks);

if (!bokSummary) {

logger.error('Failed to generate body of knowledge summary');

return false;

}

} catch (err) {

logger.error('Error generating body of knowledge summary:', err);

return false;

}

ids.push('body-of-knowledge-summary');

documents.push(bokSummary);

metadatas.push({

documentId: spaceID,

source: space.profile.url,

type: 'bodyOfKnowledgeSummary',

title: space.profile?.displayName,

});

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)

src/embed.ts (2)
104-109: Add validation for optional profile properties.

While the metadata structure is correct, consider adding validation for optional profile properties:
 metadatas.push({
   documentId: spaceID,
-  source: space.profile.url,
-  type: 'bodyOfKnowledgeSummary',
-  title: space.profile?.displayName,
+  source: space.profile?.url ?? 'unknown',
+  type: 'bodyOfKnowledgeSummary',
+  title: space.profile?.displayName ?? 'Untitled Space',
 });
Line range hint 21-195: Consider breaking down the large function into smaller, focused components.

The function currently handles multiple responsibilities:

Document processing and summarization

Embedding generation

Collection management

Data persistence

Consider splitting these into separate services or modules for better maintainability and testing:

DocumentProcessingService

EmbeddingService

CollectionManagementService

This refactoring would:

Improve testability

Enhance maintainability

Make the code more modular

Allow for better error handling at each layer

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 85ceb2f and 3051143.

📒 Files selected for processing (2)

src/embed.ts (3 hunks)
src/summarize/document.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/summarize/document.ts

🧰 Additional context used

🔇 Additional comments (2)

src/embed.ts (2)

10-13: LGTM: New imports align with summarization objectives.

The added imports support the document summarization functionality as outlined in the PR objectives.

21-25: LGTM: Improved type safety with Pick utility.

The function signature change enhances type safety by explicitly declaring required properties from Space and Profile objects.

coderabbitai · 2024-10-23T09:27:40Z

src/embed.ts

+  const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
+  const bokChunks = await splitter.splitDocuments([bokDescriptions]);
+  const bokSummary = await summariseBodyOfKnowledge(bokChunks);
+  ids.push('body-of-knowledge-summary');
+  documents.push(bokSummary);
+


⚠️ Potential issue

Add validation and error handling for body of knowledge summary.

The current implementation needs additional safeguards:

+ if (summaries.length === 0) { + logger.warn('No summaries to process for body of knowledge'); + return false; + } const bokDescriptions = new Document({ pageContent: summaries.join('\n') }); const bokChunks = await splitter.splitDocuments([bokDescriptions]); -const bokSummary = await summariseBodyOfKnowledge(bokChunks); +let bokSummary; +try { + bokSummary = await summariseBodyOfKnowledge(bokChunks); + if (!bokSummary) { + logger.error('Failed to generate body of knowledge summary'); + return false; + } +} catch (error) { + logger.error('Error generating body of knowledge summary:', error); + return false; +} ids.push('body-of-knowledge-summary'); documents.push(bokSummary);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const bokDescriptions = new Document({ pageContent: summaries.join('\n') });

const bokChunks = await splitter.splitDocuments([bokDescriptions]);

const bokSummary = await summariseBodyOfKnowledge(bokChunks);

ids.push('body-of-knowledge-summary');

documents.push(bokSummary);

if (summaries.length === 0) {

logger.warn('No summaries to process for body of knowledge');

return false;

}

const bokDescriptions = new Document({ pageContent: summaries.join('\n') });

const bokChunks = await splitter.splitDocuments([bokDescriptions]);

let bokSummary;

try {

bokSummary = await summariseBodyOfKnowledge(bokChunks);

if (!bokSummary) {

logger.error('Failed to generate body of knowledge summary');

return false;

}

} catch (error) {

logger.error('Error generating body of knowledge summary:', error);

return false;

}

ids.push('body-of-knowledge-summary');

documents.push(bokSummary);

coderabbitai · 2024-10-23T09:27:40Z

src/embed.ts

+      metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex });
    });
+
+    if (doc.pageContent.length > summaryLength) {
+      try {
+        const documentSummary = await summarizeDocument(splitted);
+        ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
+        documents.push(documentSummary);
+        metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
+
+        summaries.push(documentSummary);
+      } catch (err) {
+        logger.error(err);
+      }
+    } else {
+      summaries.push(doc.pageContent);
+    }


🛠️ Refactor suggestion

Enhance error handling and logging in document processing.

While the summarization logic is well-structured, consider these improvements:

Validate the generated summary

Add logging for skipped documents

Handle edge cases more gracefully

metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex }); }); if (doc.pageContent.length > summaryLength) { + logger.debug(`Generating summary for document ${doc.metadata.documentId}`); try { const documentSummary = await summarizeDocument(splitted); + if (!documentSummary) { + logger.warn(`Failed to generate summary for ${doc.metadata.documentId}, using original content`); + summaries.push(doc.pageContent); + return; + } ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`); documents.push(documentSummary); metadatas.push({ ...doc.metadata, embeddingType: 'summary' }); summaries.push(documentSummary); } catch (err) { logger.error(err); + logger.warn(`Error generating summary for ${doc.metadata.documentId}, using original content`); + summaries.push(doc.pageContent); } } else { + logger.debug(`Document ${doc.metadata.documentId} is below summary threshold, using original content`); summaries.push(doc.pageContent); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex });

});

if (doc.pageContent.length > summaryLength) {

try {

const documentSummary = await summarizeDocument(splitted);

ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);

documents.push(documentSummary);

metadatas.push({ ...doc.metadata, embeddingType: 'summary' });

summaries.push(documentSummary);

} catch (err) {

logger.error(err);

}

} else {

summaries.push(doc.pageContent);

}

metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex });

});

if (doc.pageContent.length > summaryLength) {

logger.debug(`Generating summary for document ${doc.metadata.documentId}`);

try {

const documentSummary = await summarizeDocument(splitted);

if (!documentSummary) {

logger.warn(`Failed to generate summary for ${doc.metadata.documentId}, using original content`);

summaries.push(doc.pageContent);

return;

}

ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);

documents.push(documentSummary);

metadatas.push({ ...doc.metadata, embeddingType: 'summary' });

summaries.push(documentSummary);

} catch (err) {

logger.error(err);

logger.warn(`Error generating summary for ${doc.metadata.documentId}, using original content`);

summaries.push(doc.pageContent);

}

} else {

logger.debug(`Document ${doc.metadata.documentId} is below summary threshold, using original content`);

summaries.push(doc.pageContent);

}

valeksiev added 3 commits October 8, 2024 09:45

@wip generate document summaries

3ff6b31

Merge remote-tracking branch 'origin/develop' into alkemio-1440/gener…

ae00a23

…ate-document-summaries

implement document and BoK summaries

7edb140

valeksiev added 2 commits October 22, 2024 22:30

revert logger changes

c17855e

remove console.log

fea4a0a

valeksiev requested review from hero101 and valentinyanakiev October 22, 2024 19:31

coderabbitai bot reviewed Oct 22, 2024

View reviewed changes

hero101 requested changes Oct 23, 2024

View reviewed changes

src/callout.handlers/base.ts Outdated Show resolved Hide resolved

src/embed.ts Show resolved Hide resolved

valeksiev added 2 commits October 23, 2024 11:31

address coderabbit comments

8084918

bump version to 0.10.0

85ceb2f

coderabbitai bot reviewed Oct 23, 2024

View reviewed changes

fix typos and improve prompts

3051143

coderabbitai bot reviewed Oct 23, 2024

View reviewed changes

valentinyanakiev approved these changes Oct 23, 2024

View reviewed changes

valentinyanakiev merged commit d37879f into develop Oct 23, 2024
2 checks passed

valentinyanakiev deleted the alkemio-1440/generate-document-summaries branch October 23, 2024 11:06

coderabbitai bot mentioned this pull request Jan 20, 2025

Ingest space 63/ingest knowledge base #64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate document summaries #58

Generate document summaries #58

valeksiev commented Oct 22, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 22, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Oct 23, 2024

coderabbitai bot Oct 23, 2024

coderabbitai bot Oct 23, 2024

coderabbitai bot Oct 23, 2024

coderabbitai bot Oct 23, 2024

coderabbitai bot left a comment

coderabbitai bot Oct 23, 2024

coderabbitai bot Oct 23, 2024

Generate document summaries #58

Generate document summaries #58

Conversation

valeksiev commented Oct 22, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Oct 22, 2024 • edited Loading

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 23, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 23, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 23, 2024

Choose a reason for hiding this comment

valeksiev commented Oct 22, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 22, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)