Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate document summaries #58

Merged
merged 8 commits into from
Oct 23, 2024

Conversation

valeksiev
Copy link
Collaborator

@valeksiev valeksiev commented Oct 22, 2024

Fixes alkem-io/alkemio#1440

Summary by CodeRabbit

  • New Features

    • Introduced a new environment variable SUMMARY_LENGTH to enhance document summarization.
    • Added new dependencies for improved functionality, including @langchain/langgraph and @langchain/mistralai.
    • Implemented document summarization capabilities with new functions for generating and refining summaries.
    • Enhanced error handling during the embedding process for better reliability.
  • Bug Fixes

    • Improved truthiness checks in contribution handling for clarity.
  • Documentation

    • New files added to support summarization processes, including structured prompt systems and state graph approaches.

Copy link

coderabbitai bot commented Oct 22, 2024

Walkthrough

The pull request introduces several changes across multiple files. A new environment variable SUMMARY_LENGTH is added to .env.default. The package.json file sees updates in its dependencies, including the addition of new libraries and re-adding previously removed ones. Modifications in the baseHandler function enhance the handling of contributions, while the fileLoaderFactories type definition is expanded to support additional document loaders. New functionalities for document summarization are introduced in several files, including embed.ts, body.of.knowledge.ts, document.ts, and graph.ts, enhancing the overall document processing capabilities.

Changes

File Change Summary
.env.default Added new environment variable: SUMMARY_LENGTH=10000.
package.json - Version updated from 0.9.2 to 0.10.0.
- Dependencies updated: re-added @graphql-codegen/typescript-graphql-request, @graphql-codegen/typescript-operations, @types/graphql-upload, graphql, graphql-upload.
- Added new dependencies: @langchain/langgraph and @langchain/mistralai.
src/callout.handlers/base.ts Enhanced handling of contributions in baseHandler function by iterating over callout.contributions and generating documents.
src/callout.handlers/link.collection.ts Updated type definition for fileLoaderFactories to return `BaseDocumentLoader
src/embed.ts Enhanced document processing with new summarization features; updated function signature and improved error handling.
src/summarize/body.of.knowledge.ts Introduced functionality for summarizing content with prompt templates; added method async summariseBodyOfKnowledge(chunks: Document[]).
src/summarize/document.ts Introduced summarization functionality for text documents; added method export const summarizeDocument = async (chunks: Document[]).
src/summarize/graph.ts Introduced state graph approach for summarization; exported summaryLength constant and buildGraph function.
src/space.embed/embed.space.ts Added error handling in embedSpace function to log errors during the embedding process.

Assessment against linked issues

Objective Addressed Explanation
Generate document summary using LangGraph and an iterative approach (1440)
Generate a summary of the whole body of knowledge based on the generated summaries (1440)
Push chunk sizes to the limits of the model (1440) No specific implementation for chunk size limits is evident in the changes.
Identify approaches how to optimize the embeddings using some user input (1440) No changes related to user input for embedding optimization are present.

Possibly related PRs

Suggested reviewers

  • valentinyanakiev
  • hero101

Poem

🐇 In a world of words, we hop and play,
Summaries crafted, brightening the day.
With SUMMARY_LENGTH set, we leap and bound,
New tales of knowledge, in chunks we found.
So gather 'round, let the stories unfold,
In the garden of wisdom, let our hearts be bold! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Outside diff range and nitpick comments (8)
.env.default (1)

27-27: Add documentation for the SUMMARY_LENGTH variable.

Consider adding a comment to explain:

  • The purpose of this variable
  • The unit of measurement (characters/tokens)
  • Any model-specific constraints

Example:

+# Maximum length of generated document summaries (in tokens)
 SUMMARY_LENGTH=10000
src/callout.handlers/base.ts (1)

Line range hint 1-85: Consider enhancing document preparation for summarization

Given that this PR focuses on document summarization, consider enhancing the document generation to include metadata that could aid in the summarization process:

  1. Add a timestamp to track document freshness
  2. Include document relationships (parent-child between callout and contributions)
  3. Add content type hints to help the summarization algorithm

Example metadata structure:

metadata: {
  documentId,
  source,
  type,
  title,
  timestamp: new Date().toISOString(),
  relationships: {
    parentId: callout.id,
    type: 'contribution'
  },
  contentHints: {
    format: 'text/markdown',
    priority: contribution.priority
  }
}

This would help the summarization process:

  • Prioritize recent content
  • Understand document relationships
  • Apply appropriate summarization strategies based on content type
🧰 Tools
🪛 Biome

[error] 55-55: Avoid redundant double-negation.

It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation

(lint/complexity/noExtraBooleanCast)


[error] 57-57: Avoid redundant double-negation.

It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation

(lint/complexity/noExtraBooleanCast)

src/space.embed/embed.space.ts (1)

Line range hint 73-79: Consider enhancing error handling for document summarization.

Given that this PR implements document summarization functionality, the error handling could be more specific to distinguish between embedding and summarization failures.

Consider:

  1. Adding specific error codes for summarization failures
  2. Including summary-related metadata in error messages
  3. Structuring the error object to include both embedding and summarization status

Example enhancement:

export enum ErrorCode {
  VECTOR_INSERT = 'VECTOR_INSERT',
  SUMMARY_GENERATION = 'SUMMARY_GENERATION',
  SUMMARY_INTEGRATION = 'SUMMARY_INTEGRATION'
}

// Enhanced error structure
interface IngestError {
  code?: ErrorCode;
  message: string;
  context?: {
    phase: 'embedding' | 'summarization';
    documentCount?: number;
    failedDocuments?: string[];
  };
}
src/callout.handlers/link.collection.ts (1)

59-61: Consider if explicit loader types are necessary.

While the type signature change is technically correct, it might be unnecessary since both PDFLoader and DocxLoader extend BaseDocumentLoader. The current implementation doesn't use any loader-specific features that would require this explicit typing.

src/summarize/body.of.knowledge.ts (1)

5-15: Refactor to avoid duplication of the 'system' message

The 'system' message is identical in both summarizePrompt and refinePrompt. Consider defining it once to reduce duplication.

Here's how you can refactor the code:

+const systemMessage = [
+  'system',
+  `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so
+   preserve as much information as possible like names, references titles, dates, etc.`,
+];

const summarizePrompt = ChatPromptTemplate.fromMessages([
-  [
-    'system',
-    `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so
-     preserve as much information as possible like names, references titles, dates, etc.`,
-  ],
+  systemMessage,
  [
    'human',
    'Write a detailed summary, no more than {summaryLength} characters of the following: {context}',
  ],
]);

const refinePrompt = ChatPromptTemplate.fromMessages([
-  [
-    'system',
-    `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so
-     preserve as much information as possible like names, references titles, dates, etc.`,
-  ],
+  systemMessage,
  [
    'human',
    `Produce a final detailed summary, no more than {summaryLength} characters.
     Existing summary up to this point:

     {currentSummary}

     New context: {context}

     Given the new context, refine the original summary.`,
  ],
]);

Also applies to: 16-33

src/summarize/document.ts (1)

6-6: Refine the comment for clarity and professionalism

The comment on line 6 reads:
// the system message is the same for both prompts but putting it into a variable and typing it is too hard...
Consider rephrasing or removing this comment to maintain a professional tone and clarity in the codebase.

src/summarize/graph.ts (1)

58-64: Consider adding error handling for the state graph compilation

While compiling the state graph, it's beneficial to include error handling to catch any issues during the graph construction.

 const graph = new StateGraph(SummarizeAnnotation)
   .addNode('initialSummary', initialSummary)
   .addNode('refineSummary', refineSummary)
   .addEdge(START, 'initialSummary')
   .addConditionalEdges('initialSummary', shouldRefine, ['refineSummary', END])
   .addConditionalEdges('refineSummary', shouldRefine, ['refineSummary', END])
-  .compile();
+  .compile()
+  .catch((error) => {
+    throw new Error(`Failed to compile the state graph: ${error.message}`);
+  });
src/embed.ts (1)

105-105: Use actual space name instead of placeholder

The title field is set to the placeholder 'space name'. Replace it with the actual space name to ensure accurate metadata.

Consider updating the code as follows:

-    title: 'space name',
+    title: spaceID, // or use the actual space name if available
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fe4d1d0 and fea4a0a.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (9)
  • .env.default (1 hunks)
  • package.json (1 hunks)
  • src/callout.handlers/base.ts (1 hunks)
  • src/callout.handlers/link.collection.ts (1 hunks)
  • src/embed.ts (3 hunks)
  • src/space.embed/embed.space.ts (1 hunks)
  • src/summarize/body.of.knowledge.ts (1 hunks)
  • src/summarize/document.ts (1 hunks)
  • src/summarize/graph.ts (1 hunks)
🧰 Additional context used
🪛 Biome
src/callout.handlers/base.ts

[error] 55-55: Avoid redundant double-negation.

It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation

(lint/complexity/noExtraBooleanCast)


[error] 57-57: Avoid redundant double-negation.

It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation

(lint/complexity/noExtraBooleanCast)

🔇 Additional comments (11)
.env.default (1)

27-27: Verify if 10000 is within safe limits for the model.

The summary length of 10000 is significantly larger than the chunk size (1000). While this aligns with the PR objective to maximize chunk sizes, please verify that this limit is safe for the deployed model (deploy-gpt-35-turbo).

src/callout.handlers/base.ts (1)

52-52: LGTM: Improved code readability

The added empty line helps separate logical blocks of code, making it easier to read and maintain.

package.json (2)

53-54: Verify LangChain dependency compatibility.

The addition of @langchain/langgraph and @langchain/mistralai aligns with the PR objectives for document summarization. However, let's ensure version compatibility with the existing langchain package (v0.2.2).

#!/bin/bash
# Description: Check for potential version conflicts in package-lock.json

echo "Checking package-lock.json for LangChain dependencies..."
if [ -f "package-lock.json" ]; then
    jq '.dependencies | with_entries(select(.key | contains("langchain")))' package-lock.json
fi

# Look for LangChain imports to ensure correct usage
echo "Checking LangChain import patterns..."
rg -l "@langchain/(langgraph|mistralai)" --type ts

Consider adding a comment in the package.json to document the interdependency between these LangChain packages and their version constraints for future maintenance.


50-51: Verify the necessity of re-added GraphQL dependencies.

The GraphQL-related packages that were previously removed have been re-added. Let's verify if these dependencies are still actively used in the codebase.

Also applies to: 55-55, 59-60

✅ Verification successful

Let me gather more specific information about GraphQL usage in the codebase.

I'm looking for evidence of active GraphQL usage in the codebase, including schema files, operations, generated types, and build scripts. This will help determine if these dependencies are actually needed or if they're remnants that can be safely removed.


Based on the comprehensive results, I can now generate the final response:

GraphQL dependencies are actively used and required

The GraphQL-related packages are actively used throughout the codebase as evidenced by:

  • Multiple GraphQL schema files and fragments in the graphql/ directory
  • Active GraphQL client implementation in src/graphql.client/AlkemioCliClient.ts
  • Generated types being used across multiple files (9 TypeScript files importing GraphQL operations/generated types)
  • Codegen configuration (codegen.yml) and npm script for code generation
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if GraphQL dependencies are used in the codebase

echo "Checking for GraphQL usage..."
# Look for GraphQL imports and usage
rg -l "graphql|GraphQL" --type ts

echo "Checking for specific package imports..."
# Check for specific package imports
rg -l "@graphql-codegen" --type ts
rg -l "graphql-upload" --type ts

Length of output: 470


Script:

#!/bin/bash
# Check for GraphQL schema, operations, and codegen configuration
echo "Checking for GraphQL related files..."
fd -e graphql
fd "codegen.yml|codegen.yaml|codegen.json"

echo -e "\nChecking for GraphQL operations and types usage..."
rg -l "import.*gql from|import.*generated" --type ts

echo -e "\nChecking for GraphQL client setup..."
rg -A 5 "GraphQLClient|useQuery|useMutation" --type ts

echo -e "\nChecking package.json scripts for codegen..."
if [ -f "package.json" ]; then
    jq '.scripts | with_entries(select(.value | contains("codegen")))' package.json
fi

Length of output: 2865

src/space.embed/embed.space.ts (2)

74-74: LGTM! Error logging enhancement.

Good addition of error logging before setting the error result. This will help with debugging and monitoring.


Line range hint 38-72: Consider integrating document summarization before embedding.

Based on the PR objectives to "generate summaries for documents using LangGraph," consider adding a summarization step before the embedding process.

Let's verify if summarization is implemented elsewhere:

src/callout.handlers/link.collection.ts (1)

59-61: Review document chunking strategy for summarization.

Given the PR's objective to maximize chunk sizes for better summarization, consider reviewing the document loading configuration. For example, the PDFLoader's splitPages: false setting affects how content is chunked.

Let's verify the current chunking configurations across loaders:

✅ Verification successful

Let me gather more information about the document loading and chunking configurations by checking other loaders and related code:


Based on the gathered information, I can now provide the final response:

Document chunking strategy appears to be consistently implemented

The codebase already implements a well-defined chunking strategy using RecursiveCharacterTextSplitter with explicit chunkSize and chunkOverlap parameters in src/embed.ts. The PDFLoader's splitPages: false configuration in src/callout.handlers/link.collection.ts aligns with this strategy by ensuring the PDF content is processed as a single document before being chunked by the text splitter.

  • The chunking is centralized in src/embed.ts where documents from all loaders are processed uniformly
  • The PDFLoader's configuration complements this by preventing premature content splitting
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for chunk size configurations in loaders
echo "Searching for chunk-related configurations in loaders..."
rg -i "chunk|split|segment" "src/loaders"

# Check PDFLoader usage and configurations
echo "Checking PDF loader configurations..."
ast-grep --pattern 'new PDFLoader($_, $_)'

Length of output: 417


Script:

#!/bin/bash
# Check for all document loader implementations and their configurations
echo "Searching for document loader implementations..."
ast-grep --pattern 'class $_ extends BaseDocumentLoader'

# Check for chunking related configurations in the entire codebase
echo "Checking chunking configurations..."
rg -i "chunksize|chunkoverlap|splitter" -A 3

# Check for text splitter implementations
echo "Checking text splitter usage..."
ast-grep --pattern 'new TextSplitter($$$)'

Length of output: 2949

src/summarize/body.of.knowledge.ts (1)

13-13: Ensure 'summaryLength' variable is correctly provided to the prompts

The prompt templates use the {summaryLength} variable, but it's not clear if this variable is being supplied when the prompts are rendered. Ensure that summaryLength is correctly passed into the prompt context to avoid runtime errors.

Please check whether summaryLength is being provided in the context when invoking the prompts. If not, you may need to include it in the parameters or fetch it from the configuration.

Also applies to: 24-24

src/embed.ts (3)

82-92: Ensure summarizeDocument handles expected input format

Confirm that summarizeDocument accepts an array of documents (splitted) as input. If it expects a single document or a different format, adjust the input accordingly.

Consider verifying the expected input for summarizeDocument and modify if necessary.


77-77: Include chunkIndex consistently in metadata

Adding chunkIndex to metadata enhances traceability of chunks. Ensure that this inclusion is consistent throughout the codebase where chunks are processed.


10-12: Imports added for summarization functionality

The new imports for summarization are appropriate and necessary for the added features.

src/callout.handlers/base.ts Outdated Show resolved Hide resolved
src/summarize/body.of.knowledge.ts Outdated Show resolved Hide resolved
src/summarize/body.of.knowledge.ts Outdated Show resolved Hide resolved
src/summarize/body.of.knowledge.ts Outdated Show resolved Hide resolved
src/summarize/document.ts Outdated Show resolved Hide resolved
src/summarize/graph.ts Show resolved Hide resolved
src/embed.ts Outdated Show resolved Hide resolved
src/embed.ts Show resolved Hide resolved
src/embed.ts Outdated Show resolved Hide resolved
src/embed.ts Outdated Show resolved Hide resolved
src/callout.handlers/base.ts Outdated Show resolved Hide resolved
src/embed.ts Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (6)
src/summarize/body.of.knowledge.ts (2)

9-11: Consider improving clarity of the system message

The word "concising" is not commonly used and might be unclear. Consider rephrasing for better clarity.

-  'You are tasked with concising summaries based entirely on the user input. While doing so preserve as much information as possible like names, references titles, dates, etc.'
+  'You are tasked with creating concise summaries based entirely on the user input. While doing so, preserve as much information as possible such as names, reference titles, dates, etc.'

19-31: Consider normalizing whitespace in the template string

The template string has inconsistent indentation which might affect readability of the code (though not the output).

   HumanMessagePromptTemplate.fromTemplate(
     `Produce a final detailed summary, no more than {summaryLength} characters.
-     Existing summary up to this point:
+    Existing summary up to this point:

-     {currentSummary}
+    {currentSummary}

-     New context: {context}
+    New context: {context}

-     Given the new context, refine the original summary.`
+    Given the new context, refine the original summary.`
   ),
src/summarize/document.ts (1)

19-30: Improve template string formatting for clarity.

The template includes extra newlines that might affect the output format. Consider making it more concise.

   HumanMessagePromptTemplate.fromTemplate(
     `Produce a final detailed summary, no more than {summaryLength} characters.
      Existing summary up to this point:
      {currentSummary}
-
      New context: {context}
-
      Given the new context, refine the original summary.`
+      New context: {context}
+      Given the new context, refine the original summary while maintaining consistency.`
   ),
src/callout.handlers/base.ts (1)

Line range hint 61-77: Add error handling for document generation

The document generation process should handle potential errors gracefully.

Consider wrapping the document generation in a try-catch:

    if (docLike) {
+     try {
        const { pageContent, documentId, source, type, title } =
          generateDocument(docLike);
        result.push(
          new Document({
            pageContent,
            metadata: {
              documentId,
              source,
              type,
              title,
            },
          })
        );
+     } catch (error) {
+       logger.error(
+         `Failed to generate document for contribution in callout ${callout.id}:`,
+         error
+       );
+     }
    }
src/summarize/graph.ts (1)

27-39: Consider adding return type annotation for better type safety.

The function's return type can be explicitly annotated to improve type safety and documentation.

 export const buildGraph = (
   summarizePrompt: ChatPromptTemplate,
   refinePrompt: ChatPromptTemplate
-) => {
+): StateGraph<typeof SummarizeAnnotation> => {
src/embed.ts (1)

21-25: Consider breaking down this function for better maintainability.

While the type changes improve safety, this function has grown to handle multiple responsibilities:

  1. Document processing
  2. Summarization
  3. Embedding generation
  4. Database operations

Consider splitting this into smaller, focused functions:

async function processDocuments(docs: Document[], splitter: TextSplitter): Promise<ProcessedDocs>;
async function generateSummaries(docs: ProcessedDocs): Promise<Summaries>;
async function generateEmbeddings(docs: Summaries): Promise<Embeddings>;
async function persistToDatabase(embeddings: Embeddings): Promise<boolean>;

This would improve:

  • Testability
  • Maintainability
  • Error handling
  • Code reuse
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fea4a0a and 85ceb2f.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (7)
  • package.json (2 hunks)
  • src/callout.handlers/base.ts (1 hunks)
  • src/embed.ts (3 hunks)
  • src/space.embed/embed.space.ts (2 hunks)
  • src/summarize/body.of.knowledge.ts (1 hunks)
  • src/summarize/document.ts (1 hunks)
  • src/summarize/graph.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • package.json
  • src/space.embed/embed.space.ts
🧰 Additional context used
📓 Learnings (1)
src/summarize/graph.ts (1)
Learnt from: valeksiev
PR: alkem-io/virtual-contributor-ingest-space#58
File: src/summarize/graph.ts:36-49
Timestamp: 2024-10-23T08:15:50.839Z
Learning: In `src/summarize/graph.ts`, the `shouldRefine` function handles index bounds checking for `input.index` before invoking `refineSummary`, so additional bounds checks within `refineSummary` are unnecessary.
🔇 Additional comments (11)
src/summarize/body.of.knowledge.ts (3)

1-8: LGTM! Imports are well-structured

The imports are correctly organized, using the recommended @langchain/core package for prompt templates.


13-18: LGTM! Well-structured prompt template

The prompt template is correctly implemented using appropriate message types and includes necessary parameters.


33-37: Verify chunk size handling

The PR objectives mention maximizing chunk sizes, but there's no explicit handling of chunk sizes in this implementation.

src/summarize/document.ts (2)

1-7: LGTM! Well-structured imports.

The imports are specific, well-scoped, and follow good practices by using named imports.


13-18: Verify the summaryLength parameter handling.

The prompt uses {summaryLength} but its import/definition isn't visible in this file. Ensure it's properly passed through the buildGraph function.

✅ Verification successful

Based on the results, I can now generate the final response since I have sufficient information about the summaryLength parameter handling.

The summaryLength parameter is properly defined and passed through the system

The parameter is well-handled throughout the codebase:

  • Defined in src/summarize/graph.ts as an environment variable SUMMARY_LENGTH with a default value of 10000
  • Properly passed to the summarization chain in graph.ts during invocation
  • Correctly referenced in prompt templates across document summarization files
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify summaryLength parameter handling
# Test: Check if summaryLength is defined in graph.ts and properly passed to this function
rg -A 5 "summaryLength" "./src/summarize/"

Length of output: 2586

src/callout.handlers/base.ts (1)

Line range hint 61-77: Consider integrating document summarization

Based on the PR objectives to generate document summaries using LangGraph, this section should integrate with the summarization functionality.

Consider these approaches:

  1. Add summarization as part of document generation
  2. Create a separate summarization step after document generation

Example integration:

if (docLike) {
  try {
    const { pageContent, documentId, source, type, title } =
      generateDocument(docLike);
    
    // Option 1: Integrate with summarization
    const summary = await summarizeDocument(pageContent);
    
    result.push(
      new Document({
        pageContent,
        metadata: {
          documentId,
          source,
          type,
          title,
          summary, // Include summary in metadata
        },
      })
    );
  } catch (error) {
    logger.error(`Failed to process document:`, error);
  }
}

Let's verify if the summarization functionality is available:

src/summarize/graph.ts (3)

1-25: LGTM! Environment setup is well-structured.

The implementation includes proper error handling for environment variables and appropriate model configuration.


61-77: LGTM! Clear and maintainable control flow implementation.

The current implementation of shouldRefine with explicit END state handling is more readable and maintainable than a boolean return value. It clearly shows the state transition logic which is crucial for understanding the graph flow.


40-44: ⚠️ Potential issue

Add validation for empty chunks array.

The initialSummary function should validate that chunks array is not empty before accessing the first element.

 const initialSummary = async (input: typeof SummarizeAnnotation.State) => {
+  if (!input.chunks?.length) {
+    throw new Error('No chunks available for summarization.');
+  }
   const context = input.chunks[0].pageContent;
   const summary = await summaryChain.invoke({ context, summaryLength });
   return { summary: summary.content, index: 1 };
 };

Likely invalid or redundant comment.

src/embed.ts (2)

10-13: LGTM! Well-organized imports.

The new imports are properly scoped and align with the added summarization functionality.


79-79: LGTM! Improved metadata tracking.

The addition of embeddingType and chunkIndex to metadata enhances traceability.

Comment on lines +33 to +37
export const summariseBodyOfKnowledge = async (chunks: Document[]) => {
const graph = buildGraph(summarizePrompt, refinePrompt);
const final = await graph.invoke({ chunks });
return final.summary;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding error handling and type definitions

The function implementation is clean but could benefit from:

  1. Error handling for potential failures in graph operations
  2. Explicit return type definition
-export const summariseBodyOfKnowledge = async (chunks: Document[]) => {
+export const summariseBodyOfKnowledge = async (chunks: Document[]): Promise<string> => {
+  try {
     const graph = buildGraph(summarizePrompt, refinePrompt);
     const final = await graph.invoke({ chunks });
     return final.summary;
+  } catch (error) {
+    throw new Error(`Failed to summarize body of knowledge: ${error.message}`);
+  }
 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const summariseBodyOfKnowledge = async (chunks: Document[]) => {
const graph = buildGraph(summarizePrompt, refinePrompt);
const final = await graph.invoke({ chunks });
return final.summary;
};
export const summariseBodyOfKnowledge = async (chunks: Document[]): Promise<string> => {
try {
const graph = buildGraph(summarizePrompt, refinePrompt);
const final = await graph.invoke({ chunks });
return final.summary;
} catch (error) {
throw new Error(`Failed to summarize body of knowledge: ${error.message}`);
}
};

src/summarize/document.ts Outdated Show resolved Hide resolved
Comment on lines +32 to +36
export const summarizeDocument = async (chunks: Document[]) => {
const graph = buildGraph(summarizePrompt, refinePrompt);
const final = await graph.invoke({ chunks });
return final.summary;
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling and input validation.

The function should include:

  1. Input validation for empty chunks array
  2. Error handling for graph.invoke
  3. Type definition for the graph result
-export const summarizeDocument = async (chunks: Document[]) => {
+export const summarizeDocument = async (chunks: Document[]): Promise<string> => {
+  if (!chunks.length) {
+    throw new Error('No documents provided for summarization');
+  }
+
   const graph = buildGraph(summarizePrompt, refinePrompt);
-  const final = await graph.invoke({ chunks });
-  return final.summary;
+  try {
+    const final = await graph.invoke({ chunks });
+    return final.summary;
+  } catch (error) {
+    throw new Error(`Failed to generate summary: ${error.message}`);
+  }
 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const summarizeDocument = async (chunks: Document[]) => {
const graph = buildGraph(summarizePrompt, refinePrompt);
const final = await graph.invoke({ chunks });
return final.summary;
};
export const summarizeDocument = async (chunks: Document[]): Promise<string> => {
if (!chunks.length) {
throw new Error('No documents provided for summarization');
}
const graph = buildGraph(summarizePrompt, refinePrompt);
try {
const final = await graph.invoke({ chunks });
return final.summary;
} catch (error) {
throw new Error(`Failed to generate summary: ${error.message}`);
}
};

Comment on lines 53 to +60
for (const contribution of callout.contributions || []) {
let docLike;
if (contribution.link) {
docLike = contribution.link;
} else if (contribution.post) {
docLike = contribution.post;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Enhance type safety and error handling for contribution processing

The current implementation could benefit from improved type safety and error handling:

Consider applying these changes:

+ type DocLikeContent = {
+   framing?: string;
+   // add other relevant properties based on your schema
+ };

  for (const contribution of callout.contributions || []) {
-   let docLike;
+   let docLike: DocLikeContent | undefined;
    if (contribution.link) {
      docLike = contribution.link;
    } else if (contribution.post) {
      docLike = contribution.post;
    } else {
+     logger.warn(`Skipping contribution without link or post for callout ${documentId}`);
+     continue;
    }

Committable suggestion was skipped due to low confidence.

Comment on lines +82 to +95
if (doc.pageContent.length > summaryLength) {
try {
const documentSummary = await summarizeDocument(splitted);
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
documents.push(documentSummary);
metadatas.push({ ...doc.metadata, embeddingType: 'summary' });

summaries.push(documentSummary);
} catch (err) {
logger.error(err);
}
} else {
summaries.push(doc.pageContent);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling and documentation.

The summarization logic is sound but could be more robust.

Consider these improvements:

+ // Threshold for document summarization
+ const SUMMARY_LENGTH_THRESHOLD = summaryLength;

- if (doc.pageContent.length > summaryLength) {
+ if (doc.pageContent.length > SUMMARY_LENGTH_THRESHOLD) {
   try {
     const documentSummary = await summarizeDocument(splitted);
+    if (!documentSummary) {
+      logger.warn(`Failed to generate summary for document ${doc.metadata.documentId}`);
+      summaries.push(doc.pageContent);
+      continue;
+    }
     ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
     documents.push(documentSummary);
     metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
     summaries.push(documentSummary);
   } catch (err) {
     logger.error(err);
+    logger.warn(`Falling back to original content for document ${doc.metadata.documentId}`);
+    summaries.push(doc.pageContent);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (doc.pageContent.length > summaryLength) {
try {
const documentSummary = await summarizeDocument(splitted);
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
documents.push(documentSummary);
metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
summaries.push(documentSummary);
} catch (err) {
logger.error(err);
}
} else {
summaries.push(doc.pageContent);
}
const SUMMARY_LENGTH_THRESHOLD = summaryLength;
if (doc.pageContent.length > SUMMARY_LENGTH_THRESHOLD) {
try {
const documentSummary = await summarizeDocument(splitted);
if (!documentSummary) {
logger.warn(`Failed to generate summary for document ${doc.metadata.documentId}`);
summaries.push(doc.pageContent);
continue;
}
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
documents.push(documentSummary);
metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
summaries.push(documentSummary);
} catch (err) {
logger.error(err);
logger.warn(`Falling back to original content for document ${doc.metadata.documentId}`);
summaries.push(doc.pageContent);
}
} else {
summaries.push(doc.pageContent);
}

src/embed.ts Outdated
Comment on lines 98 to 109
const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
const bokChunks = await splitter.splitDocuments([bokDescriptions]);
const bokSummary = await summariseBodyOfKnowledge(bokChunks);
ids.push('body-of-knowledge-summary');
documents.push(bokSummary);

metadatas.push({
documentId: spaceID,
soruce: space.profile.url,
type: 'bodyOfKnowledgeSummary',
title: space.profile?.displayName,
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Add validation and fix the 'source' typo.

  1. Fix the typo in metadata:
  metadatas.push({
    documentId: spaceID,
-   soruce: space.profile.url,
+   source: space.profile.url,
    type: 'bodyOfKnowledgeSummary',
    title: space.profile?.displayName,
  });
  1. Add validation and error handling:
+ if (summaries.length === 0) {
+   logger.warn('No summaries generated, skipping body of knowledge generation');
+   return true;
+ }

  const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
  const bokChunks = await splitter.splitDocuments([bokDescriptions]);
- const bokSummary = await summariseBodyOfKnowledge(bokChunks);
+ let bokSummary;
+ try {
+   bokSummary = await summariseBodyOfKnowledge(bokChunks);
+   if (!bokSummary) {
+     logger.error('Failed to generate body of knowledge summary');
+     return false;
+   }
+ } catch (err) {
+   logger.error('Error generating body of knowledge summary:', err);
+   return false;
+ }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
const bokChunks = await splitter.splitDocuments([bokDescriptions]);
const bokSummary = await summariseBodyOfKnowledge(bokChunks);
ids.push('body-of-knowledge-summary');
documents.push(bokSummary);
metadatas.push({
documentId: spaceID,
soruce: space.profile.url,
type: 'bodyOfKnowledgeSummary',
title: space.profile?.displayName,
});
if (summaries.length === 0) {
logger.warn('No summaries generated, skipping body of knowledge generation');
return true;
}
const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
const bokChunks = await splitter.splitDocuments([bokDescriptions]);
let bokSummary;
try {
bokSummary = await summariseBodyOfKnowledge(bokChunks);
if (!bokSummary) {
logger.error('Failed to generate body of knowledge summary');
return false;
}
} catch (err) {
logger.error('Error generating body of knowledge summary:', err);
return false;
}
ids.push('body-of-knowledge-summary');
documents.push(bokSummary);
metadatas.push({
documentId: spaceID,
source: space.profile.url,
type: 'bodyOfKnowledgeSummary',
title: space.profile?.displayName,
});

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
src/embed.ts (2)

104-109: Add validation for optional profile properties.

While the metadata structure is correct, consider adding validation for optional profile properties:

 metadatas.push({
   documentId: spaceID,
-  source: space.profile.url,
-  type: 'bodyOfKnowledgeSummary',
-  title: space.profile?.displayName,
+  source: space.profile?.url ?? 'unknown',
+  type: 'bodyOfKnowledgeSummary',
+  title: space.profile?.displayName ?? 'Untitled Space',
 });

Line range hint 21-195: Consider breaking down the large function into smaller, focused components.

The function currently handles multiple responsibilities:

  1. Document processing and summarization
  2. Embedding generation
  3. Collection management
  4. Data persistence

Consider splitting these into separate services or modules for better maintainability and testing:

  • DocumentProcessingService
  • EmbeddingService
  • CollectionManagementService

This refactoring would:

  • Improve testability
  • Enhance maintainability
  • Make the code more modular
  • Allow for better error handling at each layer
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 85ceb2f and 3051143.

📒 Files selected for processing (2)
  • src/embed.ts (3 hunks)
  • src/summarize/document.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/summarize/document.ts
🧰 Additional context used
🔇 Additional comments (2)
src/embed.ts (2)

10-13: LGTM: New imports align with summarization objectives.

The added imports support the document summarization functionality as outlined in the PR objectives.


21-25: LGTM: Improved type safety with Pick utility.

The function signature change enhances type safety by explicitly declaring required properties from Space and Profile objects.

Comment on lines +98 to +103
const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
const bokChunks = await splitter.splitDocuments([bokDescriptions]);
const bokSummary = await summariseBodyOfKnowledge(bokChunks);
ids.push('body-of-knowledge-summary');
documents.push(bokSummary);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add validation and error handling for body of knowledge summary.

The current implementation needs additional safeguards:

+ if (summaries.length === 0) {
+   logger.warn('No summaries to process for body of knowledge');
+   return false;
+ }

 const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
 const bokChunks = await splitter.splitDocuments([bokDescriptions]);
-const bokSummary = await summariseBodyOfKnowledge(bokChunks);
+let bokSummary;
+try {
+  bokSummary = await summariseBodyOfKnowledge(bokChunks);
+  if (!bokSummary) {
+    logger.error('Failed to generate body of knowledge summary');
+    return false;
+  }
+} catch (error) {
+  logger.error('Error generating body of knowledge summary:', error);
+  return false;
+}
 ids.push('body-of-knowledge-summary');
 documents.push(bokSummary);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
const bokChunks = await splitter.splitDocuments([bokDescriptions]);
const bokSummary = await summariseBodyOfKnowledge(bokChunks);
ids.push('body-of-knowledge-summary');
documents.push(bokSummary);
if (summaries.length === 0) {
logger.warn('No summaries to process for body of knowledge');
return false;
}
const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
const bokChunks = await splitter.splitDocuments([bokDescriptions]);
let bokSummary;
try {
bokSummary = await summariseBodyOfKnowledge(bokChunks);
if (!bokSummary) {
logger.error('Failed to generate body of knowledge summary');
return false;
}
} catch (error) {
logger.error('Error generating body of knowledge summary:', error);
return false;
}
ids.push('body-of-knowledge-summary');
documents.push(bokSummary);

Comment on lines +79 to +95
metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex });
});

if (doc.pageContent.length > summaryLength) {
try {
const documentSummary = await summarizeDocument(splitted);
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
documents.push(documentSummary);
metadatas.push({ ...doc.metadata, embeddingType: 'summary' });

summaries.push(documentSummary);
} catch (err) {
logger.error(err);
}
} else {
summaries.push(doc.pageContent);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling and logging in document processing.

While the summarization logic is well-structured, consider these improvements:

  1. Validate the generated summary
  2. Add logging for skipped documents
  3. Handle edge cases more gracefully
 metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex });
});

if (doc.pageContent.length > summaryLength) {
+ logger.debug(`Generating summary for document ${doc.metadata.documentId}`);
  try {
    const documentSummary = await summarizeDocument(splitted);
+   if (!documentSummary) {
+     logger.warn(`Failed to generate summary for ${doc.metadata.documentId}, using original content`);
+     summaries.push(doc.pageContent);
+     return;
+   }
    ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
    documents.push(documentSummary);
    metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
    summaries.push(documentSummary);
  } catch (err) {
    logger.error(err);
+   logger.warn(`Error generating summary for ${doc.metadata.documentId}, using original content`);
+   summaries.push(doc.pageContent);
  }
} else {
+ logger.debug(`Document ${doc.metadata.documentId} is below summary threshold, using original content`);
  summaries.push(doc.pageContent);
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex });
});
if (doc.pageContent.length > summaryLength) {
try {
const documentSummary = await summarizeDocument(splitted);
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
documents.push(documentSummary);
metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
summaries.push(documentSummary);
} catch (err) {
logger.error(err);
}
} else {
summaries.push(doc.pageContent);
}
metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex });
});
if (doc.pageContent.length > summaryLength) {
logger.debug(`Generating summary for document ${doc.metadata.documentId}`);
try {
const documentSummary = await summarizeDocument(splitted);
if (!documentSummary) {
logger.warn(`Failed to generate summary for ${doc.metadata.documentId}, using original content`);
summaries.push(doc.pageContent);
return;
}
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
documents.push(documentSummary);
metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
summaries.push(documentSummary);
} catch (err) {
logger.error(err);
logger.warn(`Error generating summary for ${doc.metadata.documentId}, using original content`);
summaries.push(doc.pageContent);
}
} else {
logger.debug(`Document ${doc.metadata.documentId} is below summary threshold, using original content`);
summaries.push(doc.pageContent);
}

@valentinyanakiev valentinyanakiev merged commit d37879f into develop Oct 23, 2024
2 checks passed
@valentinyanakiev valentinyanakiev deleted the alkemio-1440/generate-document-summaries branch October 23, 2024 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Embed documents summary
3 participants