-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate document summaries #58
Generate document summaries #58
Conversation
WalkthroughThe pull request introduces several changes across multiple files. A new environment variable Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
🧹 Outside diff range and nitpick comments (8)
.env.default (1)
27-27
: Add documentation for the SUMMARY_LENGTH variable.Consider adding a comment to explain:
- The purpose of this variable
- The unit of measurement (characters/tokens)
- Any model-specific constraints
Example:
+# Maximum length of generated document summaries (in tokens) SUMMARY_LENGTH=10000
src/callout.handlers/base.ts (1)
Line range hint
1-85
: Consider enhancing document preparation for summarizationGiven that this PR focuses on document summarization, consider enhancing the document generation to include metadata that could aid in the summarization process:
- Add a timestamp to track document freshness
- Include document relationships (parent-child between callout and contributions)
- Add content type hints to help the summarization algorithm
Example metadata structure:
metadata: { documentId, source, type, title, timestamp: new Date().toISOString(), relationships: { parentId: callout.id, type: 'contribution' }, contentHints: { format: 'text/markdown', priority: contribution.priority } }This would help the summarization process:
- Prioritize recent content
- Understand document relationships
- Apply appropriate summarization strategies based on content type
🧰 Tools
🪛 Biome
[error] 55-55: Avoid redundant double-negation.
It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation(lint/complexity/noExtraBooleanCast)
[error] 57-57: Avoid redundant double-negation.
It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation(lint/complexity/noExtraBooleanCast)
src/space.embed/embed.space.ts (1)
Line range hint
73-79
: Consider enhancing error handling for document summarization.Given that this PR implements document summarization functionality, the error handling could be more specific to distinguish between embedding and summarization failures.
Consider:
- Adding specific error codes for summarization failures
- Including summary-related metadata in error messages
- Structuring the error object to include both embedding and summarization status
Example enhancement:
export enum ErrorCode { VECTOR_INSERT = 'VECTOR_INSERT', SUMMARY_GENERATION = 'SUMMARY_GENERATION', SUMMARY_INTEGRATION = 'SUMMARY_INTEGRATION' } // Enhanced error structure interface IngestError { code?: ErrorCode; message: string; context?: { phase: 'embedding' | 'summarization'; documentCount?: number; failedDocuments?: string[]; }; }src/callout.handlers/link.collection.ts (1)
59-61
: Consider if explicit loader types are necessary.While the type signature change is technically correct, it might be unnecessary since both
PDFLoader
andDocxLoader
extendBaseDocumentLoader
. The current implementation doesn't use any loader-specific features that would require this explicit typing.src/summarize/body.of.knowledge.ts (1)
5-15
: Refactor to avoid duplication of the 'system' messageThe 'system' message is identical in both
summarizePrompt
andrefinePrompt
. Consider defining it once to reduce duplication.Here's how you can refactor the code:
+const systemMessage = [ + 'system', + `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so + preserve as much information as possible like names, references titles, dates, etc.`, +]; const summarizePrompt = ChatPromptTemplate.fromMessages([ - [ - 'system', - `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so - preserve as much information as possible like names, references titles, dates, etc.`, - ], + systemMessage, [ 'human', 'Write a detailed summary, no more than {summaryLength} characters of the following: {context}', ], ]); const refinePrompt = ChatPromptTemplate.fromMessages([ - [ - 'system', - `You are tasked with concising summaries sent by a user based entirely on the user input. While doing so - preserve as much information as possible like names, references titles, dates, etc.`, - ], + systemMessage, [ 'human', `Produce a final detailed summary, no more than {summaryLength} characters. Existing summary up to this point: {currentSummary} New context: {context} Given the new context, refine the original summary.`, ], ]);Also applies to: 16-33
src/summarize/document.ts (1)
6-6
: Refine the comment for clarity and professionalismThe comment on line 6 reads:
// the system message is the same for both prompts but putting it into a variable and typing it is too hard...
Consider rephrasing or removing this comment to maintain a professional tone and clarity in the codebase.src/summarize/graph.ts (1)
58-64
: Consider adding error handling for the state graph compilationWhile compiling the state graph, it's beneficial to include error handling to catch any issues during the graph construction.
const graph = new StateGraph(SummarizeAnnotation) .addNode('initialSummary', initialSummary) .addNode('refineSummary', refineSummary) .addEdge(START, 'initialSummary') .addConditionalEdges('initialSummary', shouldRefine, ['refineSummary', END]) .addConditionalEdges('refineSummary', shouldRefine, ['refineSummary', END]) - .compile(); + .compile() + .catch((error) => { + throw new Error(`Failed to compile the state graph: ${error.message}`); + });src/embed.ts (1)
105-105
: Use actual space name instead of placeholderThe
title
field is set to the placeholder'space name'
. Replace it with the actual space name to ensure accurate metadata.Consider updating the code as follows:
- title: 'space name', + title: spaceID, // or use the actual space name if available
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
package-lock.json
is excluded by!**/package-lock.json
📒 Files selected for processing (9)
- .env.default (1 hunks)
- package.json (1 hunks)
- src/callout.handlers/base.ts (1 hunks)
- src/callout.handlers/link.collection.ts (1 hunks)
- src/embed.ts (3 hunks)
- src/space.embed/embed.space.ts (1 hunks)
- src/summarize/body.of.knowledge.ts (1 hunks)
- src/summarize/document.ts (1 hunks)
- src/summarize/graph.ts (1 hunks)
🧰 Additional context used
🪛 Biome
src/callout.handlers/base.ts
[error] 55-55: Avoid redundant double-negation.
It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation(lint/complexity/noExtraBooleanCast)
[error] 57-57: Avoid redundant double-negation.
It is not necessary to use double-negation when a value will already be coerced to a boolean.
Unsafe fix: Remove redundant double-negation(lint/complexity/noExtraBooleanCast)
🔇 Additional comments (11)
.env.default (1)
27-27
: Verify if 10000 is within safe limits for the model.The summary length of 10000 is significantly larger than the chunk size (1000). While this aligns with the PR objective to maximize chunk sizes, please verify that this limit is safe for the deployed model (
deploy-gpt-35-turbo
).src/callout.handlers/base.ts (1)
52-52
: LGTM: Improved code readabilityThe added empty line helps separate logical blocks of code, making it easier to read and maintain.
package.json (2)
53-54
: Verify LangChain dependency compatibility.The addition of
@langchain/langgraph
and@langchain/mistralai
aligns with the PR objectives for document summarization. However, let's ensure version compatibility with the existinglangchain
package (v0.2.2).#!/bin/bash # Description: Check for potential version conflicts in package-lock.json echo "Checking package-lock.json for LangChain dependencies..." if [ -f "package-lock.json" ]; then jq '.dependencies | with_entries(select(.key | contains("langchain")))' package-lock.json fi # Look for LangChain imports to ensure correct usage echo "Checking LangChain import patterns..." rg -l "@langchain/(langgraph|mistralai)" --type tsConsider adding a comment in the package.json to document the interdependency between these LangChain packages and their version constraints for future maintenance.
50-51
: Verify the necessity of re-added GraphQL dependencies.The GraphQL-related packages that were previously removed have been re-added. Let's verify if these dependencies are still actively used in the codebase.
Also applies to: 55-55, 59-60
✅ Verification successful
Let me gather more specific information about GraphQL usage in the codebase.
I'm looking for evidence of active GraphQL usage in the codebase, including schema files, operations, generated types, and build scripts. This will help determine if these dependencies are actually needed or if they're remnants that can be safely removed.
Based on the comprehensive results, I can now generate the final response:
GraphQL dependencies are actively used and required
The GraphQL-related packages are actively used throughout the codebase as evidenced by:
- Multiple GraphQL schema files and fragments in the
graphql/
directory- Active GraphQL client implementation in
src/graphql.client/AlkemioCliClient.ts
- Generated types being used across multiple files (9 TypeScript files importing GraphQL operations/generated types)
- Codegen configuration (
codegen.yml
) and npm script for code generation🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check if GraphQL dependencies are used in the codebase echo "Checking for GraphQL usage..." # Look for GraphQL imports and usage rg -l "graphql|GraphQL" --type ts echo "Checking for specific package imports..." # Check for specific package imports rg -l "@graphql-codegen" --type ts rg -l "graphql-upload" --type tsLength of output: 470
Script:
#!/bin/bash # Check for GraphQL schema, operations, and codegen configuration echo "Checking for GraphQL related files..." fd -e graphql fd "codegen.yml|codegen.yaml|codegen.json" echo -e "\nChecking for GraphQL operations and types usage..." rg -l "import.*gql from|import.*generated" --type ts echo -e "\nChecking for GraphQL client setup..." rg -A 5 "GraphQLClient|useQuery|useMutation" --type ts echo -e "\nChecking package.json scripts for codegen..." if [ -f "package.json" ]; then jq '.scripts | with_entries(select(.value | contains("codegen")))' package.json fiLength of output: 2865
src/space.embed/embed.space.ts (2)
74-74
: LGTM! Error logging enhancement.Good addition of error logging before setting the error result. This will help with debugging and monitoring.
Line range hint
38-72
: Consider integrating document summarization before embedding.Based on the PR objectives to "generate summaries for documents using LangGraph," consider adding a summarization step before the embedding process.
Let's verify if summarization is implemented elsewhere:
src/callout.handlers/link.collection.ts (1)
59-61
: Review document chunking strategy for summarization.Given the PR's objective to maximize chunk sizes for better summarization, consider reviewing the document loading configuration. For example, the PDFLoader's
splitPages: false
setting affects how content is chunked.Let's verify the current chunking configurations across loaders:
✅ Verification successful
Let me gather more information about the document loading and chunking configurations by checking other loaders and related code:
Based on the gathered information, I can now provide the final response:
Document chunking strategy appears to be consistently implemented
The codebase already implements a well-defined chunking strategy using
RecursiveCharacterTextSplitter
with explicitchunkSize
andchunkOverlap
parameters insrc/embed.ts
. The PDFLoader'ssplitPages: false
configuration insrc/callout.handlers/link.collection.ts
aligns with this strategy by ensuring the PDF content is processed as a single document before being chunked by the text splitter.
- The chunking is centralized in
src/embed.ts
where documents from all loaders are processed uniformly- The PDFLoader's configuration complements this by preventing premature content splitting
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for chunk size configurations in loaders echo "Searching for chunk-related configurations in loaders..." rg -i "chunk|split|segment" "src/loaders" # Check PDFLoader usage and configurations echo "Checking PDF loader configurations..." ast-grep --pattern 'new PDFLoader($_, $_)'Length of output: 417
Script:
#!/bin/bash # Check for all document loader implementations and their configurations echo "Searching for document loader implementations..." ast-grep --pattern 'class $_ extends BaseDocumentLoader' # Check for chunking related configurations in the entire codebase echo "Checking chunking configurations..." rg -i "chunksize|chunkoverlap|splitter" -A 3 # Check for text splitter implementations echo "Checking text splitter usage..." ast-grep --pattern 'new TextSplitter($$$)'Length of output: 2949
src/summarize/body.of.knowledge.ts (1)
13-13
: Ensure 'summaryLength' variable is correctly provided to the promptsThe prompt templates use the
{summaryLength}
variable, but it's not clear if this variable is being supplied when the prompts are rendered. Ensure thatsummaryLength
is correctly passed into the prompt context to avoid runtime errors.Please check whether
summaryLength
is being provided in the context when invoking the prompts. If not, you may need to include it in the parameters or fetch it from the configuration.Also applies to: 24-24
src/embed.ts (3)
82-92
: EnsuresummarizeDocument
handles expected input formatConfirm that
summarizeDocument
accepts an array of documents (splitted
) as input. If it expects a single document or a different format, adjust the input accordingly.Consider verifying the expected input for
summarizeDocument
and modify if necessary.
77-77
: IncludechunkIndex
consistently in metadataAdding
chunkIndex
to metadata enhances traceability of chunks. Ensure that this inclusion is consistent throughout the codebase where chunks are processed.
10-12
: Imports added for summarization functionalityThe new imports for summarization are appropriate and necessary for the added features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (6)
src/summarize/body.of.knowledge.ts (2)
9-11
: Consider improving clarity of the system messageThe word "concising" is not commonly used and might be unclear. Consider rephrasing for better clarity.
- 'You are tasked with concising summaries based entirely on the user input. While doing so preserve as much information as possible like names, references titles, dates, etc.' + 'You are tasked with creating concise summaries based entirely on the user input. While doing so, preserve as much information as possible such as names, reference titles, dates, etc.'
19-31
: Consider normalizing whitespace in the template stringThe template string has inconsistent indentation which might affect readability of the code (though not the output).
HumanMessagePromptTemplate.fromTemplate( `Produce a final detailed summary, no more than {summaryLength} characters. - Existing summary up to this point: + Existing summary up to this point: - {currentSummary} + {currentSummary} - New context: {context} + New context: {context} - Given the new context, refine the original summary.` + Given the new context, refine the original summary.` ),src/summarize/document.ts (1)
19-30
: Improve template string formatting for clarity.The template includes extra newlines that might affect the output format. Consider making it more concise.
HumanMessagePromptTemplate.fromTemplate( `Produce a final detailed summary, no more than {summaryLength} characters. Existing summary up to this point: {currentSummary} - New context: {context} - Given the new context, refine the original summary.` + New context: {context} + Given the new context, refine the original summary while maintaining consistency.` ),src/callout.handlers/base.ts (1)
Line range hint
61-77
: Add error handling for document generationThe document generation process should handle potential errors gracefully.
Consider wrapping the document generation in a try-catch:
if (docLike) { + try { const { pageContent, documentId, source, type, title } = generateDocument(docLike); result.push( new Document({ pageContent, metadata: { documentId, source, type, title, }, }) ); + } catch (error) { + logger.error( + `Failed to generate document for contribution in callout ${callout.id}:`, + error + ); + } }src/summarize/graph.ts (1)
27-39
: Consider adding return type annotation for better type safety.The function's return type can be explicitly annotated to improve type safety and documentation.
export const buildGraph = ( summarizePrompt: ChatPromptTemplate, refinePrompt: ChatPromptTemplate -) => { +): StateGraph<typeof SummarizeAnnotation> => {src/embed.ts (1)
21-25
: Consider breaking down this function for better maintainability.While the type changes improve safety, this function has grown to handle multiple responsibilities:
- Document processing
- Summarization
- Embedding generation
- Database operations
Consider splitting this into smaller, focused functions:
async function processDocuments(docs: Document[], splitter: TextSplitter): Promise<ProcessedDocs>; async function generateSummaries(docs: ProcessedDocs): Promise<Summaries>; async function generateEmbeddings(docs: Summaries): Promise<Embeddings>; async function persistToDatabase(embeddings: Embeddings): Promise<boolean>;This would improve:
- Testability
- Maintainability
- Error handling
- Code reuse
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
package-lock.json
is excluded by!**/package-lock.json
📒 Files selected for processing (7)
- package.json (2 hunks)
- src/callout.handlers/base.ts (1 hunks)
- src/embed.ts (3 hunks)
- src/space.embed/embed.space.ts (2 hunks)
- src/summarize/body.of.knowledge.ts (1 hunks)
- src/summarize/document.ts (1 hunks)
- src/summarize/graph.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- package.json
- src/space.embed/embed.space.ts
🧰 Additional context used
📓 Learnings (1)
src/summarize/graph.ts (1)
Learnt from: valeksiev PR: alkem-io/virtual-contributor-ingest-space#58 File: src/summarize/graph.ts:36-49 Timestamp: 2024-10-23T08:15:50.839Z Learning: In `src/summarize/graph.ts`, the `shouldRefine` function handles index bounds checking for `input.index` before invoking `refineSummary`, so additional bounds checks within `refineSummary` are unnecessary.
🔇 Additional comments (11)
src/summarize/body.of.knowledge.ts (3)
1-8
: LGTM! Imports are well-structuredThe imports are correctly organized, using the recommended
@langchain/core
package for prompt templates.
13-18
: LGTM! Well-structured prompt templateThe prompt template is correctly implemented using appropriate message types and includes necessary parameters.
33-37
: Verify chunk size handlingThe PR objectives mention maximizing chunk sizes, but there's no explicit handling of chunk sizes in this implementation.
src/summarize/document.ts (2)
1-7
: LGTM! Well-structured imports.The imports are specific, well-scoped, and follow good practices by using named imports.
13-18
: Verify the summaryLength parameter handling.The prompt uses {summaryLength} but its import/definition isn't visible in this file. Ensure it's properly passed through the buildGraph function.
✅ Verification successful
Based on the results, I can now generate the final response since I have sufficient information about the
summaryLength
parameter handling.The summaryLength parameter is properly defined and passed through the system
The parameter is well-handled throughout the codebase:
- Defined in
src/summarize/graph.ts
as an environment variableSUMMARY_LENGTH
with a default value of 10000- Properly passed to the summarization chain in
graph.ts
during invocation- Correctly referenced in prompt templates across document summarization files
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify summaryLength parameter handling # Test: Check if summaryLength is defined in graph.ts and properly passed to this function rg -A 5 "summaryLength" "./src/summarize/"Length of output: 2586
src/callout.handlers/base.ts (1)
Line range hint
61-77
: Consider integrating document summarizationBased on the PR objectives to generate document summaries using LangGraph, this section should integrate with the summarization functionality.
Consider these approaches:
- Add summarization as part of document generation
- Create a separate summarization step after document generation
Example integration:
if (docLike) { try { const { pageContent, documentId, source, type, title } = generateDocument(docLike); // Option 1: Integrate with summarization const summary = await summarizeDocument(pageContent); result.push( new Document({ pageContent, metadata: { documentId, source, type, title, summary, // Include summary in metadata }, }) ); } catch (error) { logger.error(`Failed to process document:`, error); } }Let's verify if the summarization functionality is available:
src/summarize/graph.ts (3)
1-25
: LGTM! Environment setup is well-structured.The implementation includes proper error handling for environment variables and appropriate model configuration.
61-77
: LGTM! Clear and maintainable control flow implementation.The current implementation of
shouldRefine
with explicit END state handling is more readable and maintainable than a boolean return value. It clearly shows the state transition logic which is crucial for understanding the graph flow.
40-44
:⚠️ Potential issueAdd validation for empty chunks array.
The initialSummary function should validate that chunks array is not empty before accessing the first element.
const initialSummary = async (input: typeof SummarizeAnnotation.State) => { + if (!input.chunks?.length) { + throw new Error('No chunks available for summarization.'); + } const context = input.chunks[0].pageContent; const summary = await summaryChain.invoke({ context, summaryLength }); return { summary: summary.content, index: 1 }; };Likely invalid or redundant comment.
src/embed.ts (2)
10-13
: LGTM! Well-organized imports.The new imports are properly scoped and align with the added summarization functionality.
79-79
: LGTM! Improved metadata tracking.The addition of
embeddingType
andchunkIndex
to metadata enhances traceability.
export const summariseBodyOfKnowledge = async (chunks: Document[]) => { | ||
const graph = buildGraph(summarizePrompt, refinePrompt); | ||
const final = await graph.invoke({ chunks }); | ||
return final.summary; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider adding error handling and type definitions
The function implementation is clean but could benefit from:
- Error handling for potential failures in graph operations
- Explicit return type definition
-export const summariseBodyOfKnowledge = async (chunks: Document[]) => {
+export const summariseBodyOfKnowledge = async (chunks: Document[]): Promise<string> => {
+ try {
const graph = buildGraph(summarizePrompt, refinePrompt);
const final = await graph.invoke({ chunks });
return final.summary;
+ } catch (error) {
+ throw new Error(`Failed to summarize body of knowledge: ${error.message}`);
+ }
};
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
export const summariseBodyOfKnowledge = async (chunks: Document[]) => { | |
const graph = buildGraph(summarizePrompt, refinePrompt); | |
const final = await graph.invoke({ chunks }); | |
return final.summary; | |
}; | |
export const summariseBodyOfKnowledge = async (chunks: Document[]): Promise<string> => { | |
try { | |
const graph = buildGraph(summarizePrompt, refinePrompt); | |
const final = await graph.invoke({ chunks }); | |
return final.summary; | |
} catch (error) { | |
throw new Error(`Failed to summarize body of knowledge: ${error.message}`); | |
} | |
}; |
export const summarizeDocument = async (chunks: Document[]) => { | ||
const graph = buildGraph(summarizePrompt, refinePrompt); | ||
const final = await graph.invoke({ chunks }); | ||
return final.summary; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling and input validation.
The function should include:
- Input validation for empty chunks array
- Error handling for graph.invoke
- Type definition for the graph result
-export const summarizeDocument = async (chunks: Document[]) => {
+export const summarizeDocument = async (chunks: Document[]): Promise<string> => {
+ if (!chunks.length) {
+ throw new Error('No documents provided for summarization');
+ }
+
const graph = buildGraph(summarizePrompt, refinePrompt);
- const final = await graph.invoke({ chunks });
- return final.summary;
+ try {
+ const final = await graph.invoke({ chunks });
+ return final.summary;
+ } catch (error) {
+ throw new Error(`Failed to generate summary: ${error.message}`);
+ }
};
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
export const summarizeDocument = async (chunks: Document[]) => { | |
const graph = buildGraph(summarizePrompt, refinePrompt); | |
const final = await graph.invoke({ chunks }); | |
return final.summary; | |
}; | |
export const summarizeDocument = async (chunks: Document[]): Promise<string> => { | |
if (!chunks.length) { | |
throw new Error('No documents provided for summarization'); | |
} | |
const graph = buildGraph(summarizePrompt, refinePrompt); | |
try { | |
const final = await graph.invoke({ chunks }); | |
return final.summary; | |
} catch (error) { | |
throw new Error(`Failed to generate summary: ${error.message}`); | |
} | |
}; |
for (const contribution of callout.contributions || []) { | ||
let docLike; | ||
if (contribution.link) { | ||
docLike = contribution.link; | ||
} else if (contribution.post) { | ||
docLike = contribution.post; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance type safety and error handling for contribution processing
The current implementation could benefit from improved type safety and error handling:
Consider applying these changes:
+ type DocLikeContent = {
+ framing?: string;
+ // add other relevant properties based on your schema
+ };
for (const contribution of callout.contributions || []) {
- let docLike;
+ let docLike: DocLikeContent | undefined;
if (contribution.link) {
docLike = contribution.link;
} else if (contribution.post) {
docLike = contribution.post;
} else {
+ logger.warn(`Skipping contribution without link or post for callout ${documentId}`);
+ continue;
}
Committable suggestion was skipped due to low confidence.
if (doc.pageContent.length > summaryLength) { | ||
try { | ||
const documentSummary = await summarizeDocument(splitted); | ||
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`); | ||
documents.push(documentSummary); | ||
metadatas.push({ ...doc.metadata, embeddingType: 'summary' }); | ||
|
||
summaries.push(documentSummary); | ||
} catch (err) { | ||
logger.error(err); | ||
} | ||
} else { | ||
summaries.push(doc.pageContent); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance error handling and documentation.
The summarization logic is sound but could be more robust.
Consider these improvements:
+ // Threshold for document summarization
+ const SUMMARY_LENGTH_THRESHOLD = summaryLength;
- if (doc.pageContent.length > summaryLength) {
+ if (doc.pageContent.length > SUMMARY_LENGTH_THRESHOLD) {
try {
const documentSummary = await summarizeDocument(splitted);
+ if (!documentSummary) {
+ logger.warn(`Failed to generate summary for document ${doc.metadata.documentId}`);
+ summaries.push(doc.pageContent);
+ continue;
+ }
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
documents.push(documentSummary);
metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
summaries.push(documentSummary);
} catch (err) {
logger.error(err);
+ logger.warn(`Falling back to original content for document ${doc.metadata.documentId}`);
+ summaries.push(doc.pageContent);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if (doc.pageContent.length > summaryLength) { | |
try { | |
const documentSummary = await summarizeDocument(splitted); | |
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`); | |
documents.push(documentSummary); | |
metadatas.push({ ...doc.metadata, embeddingType: 'summary' }); | |
summaries.push(documentSummary); | |
} catch (err) { | |
logger.error(err); | |
} | |
} else { | |
summaries.push(doc.pageContent); | |
} | |
const SUMMARY_LENGTH_THRESHOLD = summaryLength; | |
if (doc.pageContent.length > SUMMARY_LENGTH_THRESHOLD) { | |
try { | |
const documentSummary = await summarizeDocument(splitted); | |
if (!documentSummary) { | |
logger.warn(`Failed to generate summary for document ${doc.metadata.documentId}`); | |
summaries.push(doc.pageContent); | |
continue; | |
} | |
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`); | |
documents.push(documentSummary); | |
metadatas.push({ ...doc.metadata, embeddingType: 'summary' }); | |
summaries.push(documentSummary); | |
} catch (err) { | |
logger.error(err); | |
logger.warn(`Falling back to original content for document ${doc.metadata.documentId}`); | |
summaries.push(doc.pageContent); | |
} | |
} else { | |
summaries.push(doc.pageContent); | |
} |
src/embed.ts
Outdated
const bokDescriptions = new Document({ pageContent: summaries.join('\n') }); | ||
const bokChunks = await splitter.splitDocuments([bokDescriptions]); | ||
const bokSummary = await summariseBodyOfKnowledge(bokChunks); | ||
ids.push('body-of-knowledge-summary'); | ||
documents.push(bokSummary); | ||
|
||
metadatas.push({ | ||
documentId: spaceID, | ||
soruce: space.profile.url, | ||
type: 'bodyOfKnowledgeSummary', | ||
title: space.profile?.displayName, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add validation and fix the 'source' typo.
- Fix the typo in metadata:
metadatas.push({
documentId: spaceID,
- soruce: space.profile.url,
+ source: space.profile.url,
type: 'bodyOfKnowledgeSummary',
title: space.profile?.displayName,
});
- Add validation and error handling:
+ if (summaries.length === 0) {
+ logger.warn('No summaries generated, skipping body of knowledge generation');
+ return true;
+ }
const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
const bokChunks = await splitter.splitDocuments([bokDescriptions]);
- const bokSummary = await summariseBodyOfKnowledge(bokChunks);
+ let bokSummary;
+ try {
+ bokSummary = await summariseBodyOfKnowledge(bokChunks);
+ if (!bokSummary) {
+ logger.error('Failed to generate body of knowledge summary');
+ return false;
+ }
+ } catch (err) {
+ logger.error('Error generating body of knowledge summary:', err);
+ return false;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const bokDescriptions = new Document({ pageContent: summaries.join('\n') }); | |
const bokChunks = await splitter.splitDocuments([bokDescriptions]); | |
const bokSummary = await summariseBodyOfKnowledge(bokChunks); | |
ids.push('body-of-knowledge-summary'); | |
documents.push(bokSummary); | |
metadatas.push({ | |
documentId: spaceID, | |
soruce: space.profile.url, | |
type: 'bodyOfKnowledgeSummary', | |
title: space.profile?.displayName, | |
}); | |
if (summaries.length === 0) { | |
logger.warn('No summaries generated, skipping body of knowledge generation'); | |
return true; | |
} | |
const bokDescriptions = new Document({ pageContent: summaries.join('\n') }); | |
const bokChunks = await splitter.splitDocuments([bokDescriptions]); | |
let bokSummary; | |
try { | |
bokSummary = await summariseBodyOfKnowledge(bokChunks); | |
if (!bokSummary) { | |
logger.error('Failed to generate body of knowledge summary'); | |
return false; | |
} | |
} catch (err) { | |
logger.error('Error generating body of knowledge summary:', err); | |
return false; | |
} | |
ids.push('body-of-knowledge-summary'); | |
documents.push(bokSummary); | |
metadatas.push({ | |
documentId: spaceID, | |
source: space.profile.url, | |
type: 'bodyOfKnowledgeSummary', | |
title: space.profile?.displayName, | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
src/embed.ts (2)
104-109
: Add validation for optional profile properties.While the metadata structure is correct, consider adding validation for optional profile properties:
metadatas.push({ documentId: spaceID, - source: space.profile.url, - type: 'bodyOfKnowledgeSummary', - title: space.profile?.displayName, + source: space.profile?.url ?? 'unknown', + type: 'bodyOfKnowledgeSummary', + title: space.profile?.displayName ?? 'Untitled Space', });
Line range hint
21-195
: Consider breaking down the large function into smaller, focused components.The function currently handles multiple responsibilities:
- Document processing and summarization
- Embedding generation
- Collection management
- Data persistence
Consider splitting these into separate services or modules for better maintainability and testing:
- DocumentProcessingService
- EmbeddingService
- CollectionManagementService
This refactoring would:
- Improve testability
- Enhance maintainability
- Make the code more modular
- Allow for better error handling at each layer
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- src/embed.ts (3 hunks)
- src/summarize/document.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/summarize/document.ts
🧰 Additional context used
🔇 Additional comments (2)
src/embed.ts (2)
10-13
: LGTM: New imports align with summarization objectives.The added imports support the document summarization functionality as outlined in the PR objectives.
21-25
: LGTM: Improved type safety with Pick utility.The function signature change enhances type safety by explicitly declaring required properties from Space and Profile objects.
const bokDescriptions = new Document({ pageContent: summaries.join('\n') }); | ||
const bokChunks = await splitter.splitDocuments([bokDescriptions]); | ||
const bokSummary = await summariseBodyOfKnowledge(bokChunks); | ||
ids.push('body-of-knowledge-summary'); | ||
documents.push(bokSummary); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation and error handling for body of knowledge summary.
The current implementation needs additional safeguards:
+ if (summaries.length === 0) {
+ logger.warn('No summaries to process for body of knowledge');
+ return false;
+ }
const bokDescriptions = new Document({ pageContent: summaries.join('\n') });
const bokChunks = await splitter.splitDocuments([bokDescriptions]);
-const bokSummary = await summariseBodyOfKnowledge(bokChunks);
+let bokSummary;
+try {
+ bokSummary = await summariseBodyOfKnowledge(bokChunks);
+ if (!bokSummary) {
+ logger.error('Failed to generate body of knowledge summary');
+ return false;
+ }
+} catch (error) {
+ logger.error('Error generating body of knowledge summary:', error);
+ return false;
+}
ids.push('body-of-knowledge-summary');
documents.push(bokSummary);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
const bokDescriptions = new Document({ pageContent: summaries.join('\n') }); | |
const bokChunks = await splitter.splitDocuments([bokDescriptions]); | |
const bokSummary = await summariseBodyOfKnowledge(bokChunks); | |
ids.push('body-of-knowledge-summary'); | |
documents.push(bokSummary); | |
if (summaries.length === 0) { | |
logger.warn('No summaries to process for body of knowledge'); | |
return false; | |
} | |
const bokDescriptions = new Document({ pageContent: summaries.join('\n') }); | |
const bokChunks = await splitter.splitDocuments([bokDescriptions]); | |
let bokSummary; | |
try { | |
bokSummary = await summariseBodyOfKnowledge(bokChunks); | |
if (!bokSummary) { | |
logger.error('Failed to generate body of knowledge summary'); | |
return false; | |
} | |
} catch (error) { | |
logger.error('Error generating body of knowledge summary:', error); | |
return false; | |
} | |
ids.push('body-of-knowledge-summary'); | |
documents.push(bokSummary); |
metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex }); | ||
}); | ||
|
||
if (doc.pageContent.length > summaryLength) { | ||
try { | ||
const documentSummary = await summarizeDocument(splitted); | ||
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`); | ||
documents.push(documentSummary); | ||
metadatas.push({ ...doc.metadata, embeddingType: 'summary' }); | ||
|
||
summaries.push(documentSummary); | ||
} catch (err) { | ||
logger.error(err); | ||
} | ||
} else { | ||
summaries.push(doc.pageContent); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance error handling and logging in document processing.
While the summarization logic is well-structured, consider these improvements:
- Validate the generated summary
- Add logging for skipped documents
- Handle edge cases more gracefully
metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex });
});
if (doc.pageContent.length > summaryLength) {
+ logger.debug(`Generating summary for document ${doc.metadata.documentId}`);
try {
const documentSummary = await summarizeDocument(splitted);
+ if (!documentSummary) {
+ logger.warn(`Failed to generate summary for ${doc.metadata.documentId}, using original content`);
+ summaries.push(doc.pageContent);
+ return;
+ }
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`);
documents.push(documentSummary);
metadatas.push({ ...doc.metadata, embeddingType: 'summary' });
summaries.push(documentSummary);
} catch (err) {
logger.error(err);
+ logger.warn(`Error generating summary for ${doc.metadata.documentId}, using original content`);
+ summaries.push(doc.pageContent);
}
} else {
+ logger.debug(`Document ${doc.metadata.documentId} is below summary threshold, using original content`);
summaries.push(doc.pageContent);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex }); | |
}); | |
if (doc.pageContent.length > summaryLength) { | |
try { | |
const documentSummary = await summarizeDocument(splitted); | |
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`); | |
documents.push(documentSummary); | |
metadatas.push({ ...doc.metadata, embeddingType: 'summary' }); | |
summaries.push(documentSummary); | |
} catch (err) { | |
logger.error(err); | |
} | |
} else { | |
summaries.push(doc.pageContent); | |
} | |
metadatas.push({ ...chunk.metadata, embeddingType: 'chunk', chunkIndex }); | |
}); | |
if (doc.pageContent.length > summaryLength) { | |
logger.debug(`Generating summary for document ${doc.metadata.documentId}`); | |
try { | |
const documentSummary = await summarizeDocument(splitted); | |
if (!documentSummary) { | |
logger.warn(`Failed to generate summary for ${doc.metadata.documentId}, using original content`); | |
summaries.push(doc.pageContent); | |
return; | |
} | |
ids.push(`${doc.metadata.documentId}-${doc.metadata.type}-summary`); | |
documents.push(documentSummary); | |
metadatas.push({ ...doc.metadata, embeddingType: 'summary' }); | |
summaries.push(documentSummary); | |
} catch (err) { | |
logger.error(err); | |
logger.warn(`Error generating summary for ${doc.metadata.documentId}, using original content`); | |
summaries.push(doc.pageContent); | |
} | |
} else { | |
logger.debug(`Document ${doc.metadata.documentId} is below summary threshold, using original content`); | |
summaries.push(doc.pageContent); | |
} |
Fixes alkem-io/alkemio#1440
Summary by CodeRabbit
New Features
SUMMARY_LENGTH
to enhance document summarization.@langchain/langgraph
and@langchain/mistralai
.Bug Fixes
Documentation