Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest space 63/ingest knowledge base #64

Merged
merged 4 commits into from
Jan 23, 2025

Conversation

valeksiev
Copy link
Collaborator

@valeksiev valeksiev commented Jan 20, 2025

Fixes: #63

Summary by CodeRabbit

Release Notes

  • Configuration

    • Updated chunk processing parameters for improved data segmentation.
  • GraphQL

    • Enhanced GraphQL fragments with new location and profile fields.
    • Added new fragments for callouts and knowledge base ingestion.
    • Introduced a new query for knowledge base ingestion.
  • Dependency Management

    • Updated package name to @alkemio/ingest-alkemio-data.
    • Added langsmith dependency.
  • Feature Enhancements

    • Introduced new functions for embedding knowledge bases and processing callouts.
    • Improved error handling and logging mechanisms.
  • Event Handling

    • Replaced space-specific events with more generic body of knowledge events.
    • Updated event bus and client to support new ingestion types.

Copy link

coderabbitai bot commented Jan 20, 2025

Walkthrough

This pull request introduces a comprehensive refactoring of the ingestion service, transitioning from a space-centric to a more generalized body of knowledge ingestion approach. The changes include modifications to GraphQL fragments, event handling, client interactions, and data processing mechanisms. Key updates involve creating new types for body of knowledge ingestion, adding support for knowledge base ingestion, and restructuring the event bus and client interactions to support a more flexible data ingestion model.

Changes

File Change Summary
.env.default Chunk processing parameters modified: CHUNK_SIZE increased from 1000 to 9000, CHUNK_OVERLAP changed from 100 to 500
package.json Package renamed to @alkemio/ingest-alkemio-data, added langsmith dependency
graphql/fragments/* Added new fragments like CalloutFields, ProfileFields with expanded data structures
src/event.bus/events/* Replaced space-specific events with more generic body of knowledge events
src/data.readers/* Added new methods like embedSpace, embedKnowledgeBase to support flexible ingestion
src/graphql.client/AlkemioCliClient.ts Added ingestKnowledgeBase method

Assessment against linked issues

Objective Addressed Explanation
Ingest KnowledgeBase New methods and event handling added to support knowledge base ingestion
Multiple context types Partial implementation with introduction of BodyOfKnowledgeType enum

Possibly related PRs

  • #4533: Whiteboard content handling modifications
  • #4763: CreateWhiteboardInput enhancements for profile visuals
  • #7388: Dialog for managing callouts in VC creation process

Poem

🐰 Ingestion Rabbit's Delight

From spaces to knowledge, we take flight,
Fragments dancing, events so bright,
Chunks of wisdom, overlapping tight,
A new ingestion model shines with might!

🌟 Code hopping with pure delight! 🚀

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🔭 Outside diff range comments (2)
src/event.bus/connection.ts (2)

Line range hint 47-53: Update queue configuration to reflect new event types.

The queue names in the configuration still reference "space" (RABBITMQ_INGEST_SPACE_QUEUE, RABBITMQ_INGEST_SPACE_RESULT_QUEUE) despite the transition to body of knowledge events.

Consider updating the environment variables to use more generic names or specific body of knowledge terminology:

-      incomingQueue: this.getEnvValue('RABBITMQ_INGEST_SPACE_QUEUE'),
-      outgoingQueue: this.getEnvValue('RABBITMQ_INGEST_SPACE_RESULT_QUEUE'),
+      incomingQueue: this.getEnvValue('RABBITMQ_INGEST_BOK_QUEUE'),
+      outgoingQueue: this.getEnvValue('RABBITMQ_INGEST_BOK_RESULT_QUEUE'),

Line range hint 89-94: Update event binding to match new event types.

The queue binding still uses 'IngestSpaceResult' as the routing key despite the transition to body of knowledge events.

Update the binding to use the new event type:

       await this.channel.bindQueue(
         this.config.outgoingQueue,
         this.config.exchange,
-        'IngestSpaceResult'
+        'IngestBodyOfKnowledgeResult'
       );
🧹 Nitpick comments (15)
package.json (1)

62-62: Consider pinning the langsmith version.

The langsmith package is at version 0.1.x which indicates it's not yet stable. Using the caret (^) version range could lead to breaking changes.

-    "langsmith": "^0.1.66",
+    "langsmith": "0.1.66",
src/embed.body.of.knowledge.ts (2)

22-25: Simplify the UTC timestamp conversion and improve the comment.

The current method for obtaining the UTC timestamp is unnecessarily complex, and the comment could be more professional.

Consider using Date.now() to get the UTC timestamp directly. Apply this diff to simplify the code and update the comment:

- // this shenanigan is here to ensure the Timestamp is in UTC timezone
- result.timestamp = new Date(
-   new Date().toLocaleString('en', { timeZone: 'UTC' })
- ).getTime();
+ // Ensure the timestamp is in UTC timezone
+ result.timestamp = Date.now();

Repeat this change at lines 104-106.

Also applies to: 104-106


100-101: Correct the spelling of 'occurred' in the error message.

There's a typographical error: 'occured' should be spelled 'occurred'.

Apply this diff to fix the typo:

        resultEvent.error = {
-         message: 'An error occured while embedding.',
+         message: 'An error occurred while embedding.',
        };
src/index.ts (1)

15-15: Update log message to reflect body of knowledge terminology.

The log message still references "space" which is inconsistent with the transition to body of knowledge terminology.

-    logger.info(`Ingest completed for space: ${event.bodyOfKnowledgeId}`);
+    logger.info(`Ingest completed for body of knowledge: ${event.bodyOfKnowledgeId}`);
src/event.bus/events/ingest.body.of.knowledge.result.ts (2)

11-13: Expand error codes to cover more failure scenarios.

The ErrorCode enum currently only handles vector insertion failures. Consider adding more error codes for other potential failure scenarios.

 export enum ErrorCode {
   VECTOR_INSERT = 'vector_insert',
+  GRAPHQL_CONNECTION = 'graphql_connection',
+  NOT_FOUND = 'not_found',
+  INVALID_DATA = 'invalid_data',
+  PROCESSING_ERROR = 'processing_error',
 }

26-26: Initialize timestamp in constructor.

The optional timestamp should be initialized with the current timestamp by default.

-    public timestamp?: number,
+    public timestamp: number = Date.now(),
src/logger.ts (1)

41-44: Consider enhancing error object handling.

The implementation is solid and follows TypeScript best practices. Consider adding support for objects with message properties that aren't Error instances.

 export const getErrorMessage = (error: unknown) => {
   if (error instanceof Error) return error.message;
+  if (typeof error === 'object' && error !== null && 'message' in error) {
+    return String((error as { message: unknown }).message);
+  }
   return String(error);
 };
src/data.readers/konwledge.base.ts (2)

28-28: Remove commented out code.

The commented portion of the line should be removed to maintain clean code.

-  const documents: Document[] = []; //= await processknowledgeBaseTree(
+  const documents: Document[] = [];

13-26: Consider extracting knowledge base fetching logic.

The knowledge base fetching and validation logic could be extracted into a separate function for better modularity.

+async function fetchKnowledgeBase(knowledgeBaseId: string, alkemioClient: AlkemioCliClient) {
+  try {
+    const knowledgeBase = await alkemioClient.ingestKnowledgeBase(knowledgeBaseId);
+    if (!knowledgeBase) {
+      throw new Error(`Knowledge base ${knowledgeBaseId} not found.`);
+    }
+    return knowledgeBase;
+  } catch (error) {
+    logger.error(error);
+    throw new Error(`Failed to fetch knowledge base ${knowledgeBaseId}: ${error}`);
+  }
+}

 export const embedKnowledgeBase = async (
   event: IngestBodyOfKnowledge,
   alkemioClient: AlkemioCliClient
 ) => {
   const knowledgeBaseId = event.bodyOfKnowledgeId;
-  let knowledgeBase;
-  try {
-    knowledgeBase = await alkemioClient.ingestKnowledgeBase(knowledgeBaseId);
-  } catch (error) {
-    logger.error(error);
-    throw new Error('GraphQL connection failed.');
-  }
-
-  if (!knowledgeBase) {
-    logger.error(`knowledgeBase ${knowledgeBaseId} not found.`);
-    throw new Error(`knowledgeBase ${knowledgeBaseId} not found.`);
-  }
+  const knowledgeBase = await fetchKnowledgeBase(knowledgeBaseId, alkemioClient);
src/callout.handlers/base.ts (1)

Line range hint 54-76: Consider enhancing contribution document generation.

The current implementation could benefit from improved type safety and logging for contribution processing.

+  interface ContributionDocument {
+    link?: { [key: string]: unknown };
+    post?: { [key: string]: unknown };
+  }

+  const processContribution = (contribution: ContributionDocument) => {
+    const docLike = contribution.link || contribution.post;
+    if (!docLike) return null;
+
+    const { pageContent, documentId, source, type, title } = generateDocument(docLike);
+    logger.debug(`Processing contribution document: ${documentId}`);
+    return new Document({
+      id: documentId,
+      pageContent,
+      metadata: { documentId, source, type, title },
+    });
+  };

   for (const contribution of callout.contributions || []) {
-    let docLike;
-    if (contribution.link) {
-      docLike = contribution.link;
-    } else if (contribution.post) {
-      docLike = contribution.post;
-    }
-
-    if (docLike) {
-      const { pageContent, documentId, source, type, title } =
-        generateDocument(docLike);
-      result.push(
-        new Document({
-          id: documentId,
-          pageContent,
-          metadata: {
-            documentId,
-            source,
-            type,
-            title,
-          },
-        })
-      );
+    const doc = processContribution(contribution);
+    if (doc) {
+      result.push(doc);
     }
   }
src/embed.documents.ts (1)

56-56: Update logging message to reflect body of knowledge terminology.

The logging message still references "space" despite the transition to body of knowledge.

-  logger.info(`Splitting documents for space: ${bokID}`);
+  logger.info(`Splitting documents for body of knowledge: ${bokID}`);
graphql/fragments/space-ingest.graphql (1)

Line range hint 1-1: Consider renaming the fragment to match new terminology.

The fragment name still uses "SpaceIngest" despite the transition to body of knowledge approach.

Consider renaming to BodyOfKnowledgeIngest for consistency with the new terminology.

graphql/fragments/profile.graphql (1)

8-12: Consider adding field validation constraints.

The new location fields enhance the profile structure. Consider specifying validation constraints (e.g., @length, @pattern) for postalCode and potentially making essential fields non-nullable where appropriate.

graphql/fragments/callout.graphql (2)

47-47: Remove unnecessary empty line.

For consistency, remove the empty line between profile and closing brace.


1-50: Consider pagination for comments and contributions.

The current structure might return large datasets for comments and contributions. Consider implementing pagination using first/after parameters to manage response size and improve performance.

Example modification:

comments(first: Int, after: String) {
  pageInfo {
    hasNextPage
    endCursor
  }
  edges {
    node {
      # existing fields
    }
  }
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d37879f and 236f145.

⛔ Files ignored due to path filters (2)
  • package-lock.json is excluded by !**/package-lock.json
  • src/generated/graphql.ts is excluded by !**/generated/**
📒 Files selected for processing (25)
  • .env.default (1 hunks)
  • graphql/fragments/callout.graphql (1 hunks)
  • graphql/fragments/profile.graphql (1 hunks)
  • graphql/fragments/space-ingest.graphql (2 hunks)
  • graphql/queries/knowledge-base-ingest.graphql (1 hunks)
  • package.json (2 hunks)
  • src/callout.handlers/base.ts (2 hunks)
  • src/callout.handlers/index.ts (1 hunks)
  • src/data.readers/konwledge.base.ts (1 hunks)
  • src/data.readers/process.space.tree.ts (3 hunks)
  • src/data.readers/space.ts (1 hunks)
  • src/data.readers/types.d.ts (1 hunks)
  • src/embed.body.of.knowledge.ts (1 hunks)
  • src/embed.documents.ts (3 hunks)
  • src/event.bus/connection.ts (3 hunks)
  • src/event.bus/events/ingest.body.of.knowledge.result.ts (1 hunks)
  • src/event.bus/events/ingest.body.of.knowledge.ts (1 hunks)
  • src/event.bus/events/ingest.space.result.ts (0 hunks)
  • src/event.bus/events/ingest.space.ts (0 hunks)
  • src/graphql.client/AlkemioCliClient.ts (1 hunks)
  • src/index.ts (1 hunks)
  • src/logger.ts (1 hunks)
  • src/process.callouts.ts (1 hunks)
  • src/space.embed/embed.space.ts (0 hunks)
  • src/summarize/graph.ts (2 hunks)
💤 Files with no reviewable changes (3)
  • src/space.embed/embed.space.ts
  • src/event.bus/events/ingest.space.ts
  • src/event.bus/events/ingest.space.result.ts
✅ Files skipped from review due to trivial changes (1)
  • src/callout.handlers/index.ts
🧰 Additional context used
🪛 eslint
src/summarize/graph.ts

[error] 6-6: 'traceable' is defined but never used.

(@typescript-eslint/no-unused-vars)

🔇 Additional comments (13)
src/summarize/graph.ts (1)

24-30: Verify LangSmith configuration.

The model is now wrapped with LangSmith's SDK wrapper, which enables tracing and monitoring capabilities. However, this requires proper configuration to function correctly.

Please ensure the following environment variables are properly configured:

  • LANGCHAIN_TRACING_V2
  • LANGCHAIN_ENDPOINT
  • LANGCHAIN_API_KEY

Run this script to verify the configuration:

package.json (1)

2-2: Package name change aligns with project objectives.

The package name change from @alkemio/space-ingest to @alkemio/ingest-alkemio-data better reflects the transition from a space-centric to a generalized body of knowledge approach.

src/data.readers/types.d.ts (1)

1-9: Type definitions are well-structured.

The added type declarations for BodyOfKnowledgeReadResult and ReadResult are correct and enhance type safety.

src/event.bus/events/ingest.body.of.knowledge.ts (1)

1-17: Enums and class definitions are appropriate.

The new enums IngestionPurpose and BodyOfKnowledgeType, along with the IngestBodyOfKnowledge class, are well-defined and align with the overall event handling structure.

src/data.readers/process.space.tree.ts (1)

25-25: LGTM! Good addition of document ID.

Adding the document ID to the Document constructor ensures consistent identification across the system.

src/callout.handlers/base.ts (1)

41-41: LGTM! Consistent document ID implementation.

The addition of document IDs to both main callout and contribution documents maintains consistency across the system.

Also applies to: 67-67

src/graphql.client/AlkemioCliClient.ts (1)

76-81: LGTM! Clean implementation of knowledge base ingestion.

The new method follows existing patterns and maintains consistency with the class's error handling approach.

src/event.bus/connection.ts (1)

129-135: LGTM! Event parsing is correctly updated.

The message parsing logic correctly extracts the new body of knowledge fields.

src/embed.documents.ts (1)

20-26: LGTM! Clean transition to body of knowledge model.

The function signature and parameter handling have been properly updated to use the new BodyOfKnowledgeReadResult type.

graphql/fragments/space-ingest.graphql (1)

15-19: LGTM! Clean implementation of callouts structure.

The new calloutsSet structure with fragment spread improves code reusability and maintainability.

graphql/queries/knowledge-base-ingest.graphql (1)

1-15: Well-structured query implementation!

The query is well-organized and properly utilizes fragments for code reuse. The structure aligns well with the space ingestion pattern, promoting consistency.

Let's verify the fragment dependencies are properly defined:

✅ Verification successful

Fragment dependencies verified successfully

All required fragments (ProfileFields, CalloutFields, ProfileNoTagsetFields) are properly defined and available in the codebase. The dependency chain is complete and free of circular dependencies.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if all required fragments are properly imported
rg -l "fragment (ProfileFields|CalloutFields|ProfileNoTagsetFields)" graphql/fragments/

Length of output: 201


Script:

#!/bin/bash
# Check the content of fragment files
echo "=== profile.graphql ==="
cat graphql/fragments/profile.graphql
echo -e "\n=== callout.graphql ==="
cat graphql/fragments/callout.graphql

Length of output: 1130


Script:

#!/bin/bash
# Check the content of ProfileNoTagsetFields fragment
echo "=== profile.no.tagset.graphql ==="
cat graphql/fragments/profile.no.tagset.graphql

Length of output: 314

.env.default (1)

24-25: Verify resource implications of increased chunk parameters.

The significant increases in CHUNK_SIZE (9x) and CHUNK_OVERLAP (5x) could impact memory usage and processing performance. Please ensure:

  1. The system has sufficient memory to handle larger chunks
  2. The increased overlap doesn't cause significant performance degradation
  3. These values have been tested with representative data volumes
graphql/fragments/callout.graphql (1)

9-22: Well-implemented polymorphic sender type!

The inline fragments for User and VirtualContributor types are properly implemented, maintaining type safety while sharing common profile fields.

src/summarize/graph.ts Outdated Show resolved Hide resolved
src/embed.body.of.knowledge.ts Outdated Show resolved Hide resolved
src/index.ts Show resolved Hide resolved
src/process.callouts.ts Show resolved Hide resolved
src/process.callouts.ts Outdated Show resolved Hide resolved
src/data.readers/space.ts Show resolved Hide resolved
src/data.readers/space.ts Outdated Show resolved Hide resolved
src/data.readers/process.space.tree.ts Outdated Show resolved Hide resolved
src/data.readers/konwledge.base.ts Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
src/data.readers/knowledge.base.ts (2)

14-14: Typo in comment: 'priviliges' should be 'privileges'.

The comment on line 14 contains a typographical error.

Apply this diff to correct the typo:

-  // make sure the service user has sufficient priviliges
+  // make sure the service user has sufficient privileges

30-30: Remove commented-out code to improve readability.

There is a commented-out code snippet that appears to be leftover from previous implementations. Removing it will clean up the code and enhance readability.

Apply this diff to remove the commented-out code:

-  const documents: Document[] = []; //= await processknowledgeBaseTree(
+  const documents: Document[] = [];
src/data.readers/space.ts (1)

16-16: Typo in comment: 'priviliges' should be 'privileges'.

The comment on line 16 contains a typographical error.

Apply this diff to correct the typo:

-  // make sure the service user has sufficient priviliges
+  // make sure the service user has sufficient privileges
src/embed.body.of.knowledge.ts (5)

17-29: Simplify timestamp handling and improve comment clarity.

The timestamp conversion can be simplified, and the comment should be more professional.

Apply this diff to improve the code:

-  // this shenanigan is here to ensure the Timestamp is in UTC timezone
-  result.timestamp = new Date(
-    new Date().toLocaleString('en', { timeZone: 'UTC' })
-  ).getTime();
+  // Ensure timestamp is in UTC format
+  result.timestamp = new Date().toISOString();

39-39: Remove redundant variable assignment.

The purpose variable is redundant as it's already available in event.purpose and is only used once in the code.

Apply this diff to remove the redundant assignment:

-  const purpose = event.purpose;

Then update line 84 to use event.purpose directly.


50-55: Include the error message in client initialization failure.

Consider including the actual error message to aid in debugging.

Apply this diff to improve error handling:

   } catch (error) {
     logger.error(error);
-    return setResultError(resultEvent, 'AlkemioClient can not be initialised.');
+    return setResultError(
+      resultEvent,
+      `AlkemioClient can not be initialised: ${getErrorMessage(error)}`
+    );
   }

79-85: Consider using a more explicit return type for embedDocuments.

Using a boolean return type for embedDocuments makes the success/failure conditions implicit. Consider returning a more detailed result type.

Consider creating and using an EmbeddingResult type that provides more context about the embedding operation's outcome.


102-102: Fix typo in error message.

There's a typo in the error message: "occured" should be "occurred".

Apply this diff to fix the typo:

-      message: 'An error occured while embedding.',
+      message: 'An error occurred while embedding.',
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 236f145 and 3448c26.

📒 Files selected for processing (6)
  • src/data.readers/knowledge.base.ts (1 hunks)
  • src/data.readers/process.space.tree.ts (3 hunks)
  • src/data.readers/space.ts (1 hunks)
  • src/embed.body.of.knowledge.ts (1 hunks)
  • src/process.callouts.ts (1 hunks)
  • src/summarize/graph.ts (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/summarize/graph.ts
  • src/data.readers/process.space.tree.ts
🔇 Additional comments (5)
src/process.callouts.ts (1)

21-23: Ensure safe handling of 'document' when pushing into 'documents'.

Using the spread operator with an unknown array size could cause a stack overflow. Also, type checking for 'document' would improve safety.

Consider modifying the code as follows:

       if (document) {
-        documents.push(...document);
+        if (Array.isArray(document)) {
+          documents.push(...document);
+        } else {
+          documents.push(document);
+        }
       }
src/data.readers/space.ts (1)

34-37: Add type validation before casting to ensure safety.

Unsafe casting of 'space' to Partial<Space> could lead to runtime errors if 'space' does not match the expected structure. Consider adding a type guard function to validate 'space' before casting.

Consider modifying the code as follows:

+  // Type guard function
+  function isPartialSpace(obj: unknown): obj is Partial<Space> {
+    return obj !== null && typeof obj === 'object' && 'id' in obj;
+  }
+
+  if (!isPartialSpace(space)) {
+    throw new Error(`Invalid space data received for ${spaceId}`);
+  }
+
   const documents: Document[] = await processSpaceTree(
-    [space as Partial<Space>],
+    [space],
     alkemioClient
   );
src/embed.body.of.knowledge.ts (3)

1-16: LGTM! Well-organized imports.

The imports are properly organized and reflect the architectural transition to a generalized body of knowledge ingestion approach.


105-108: Simplify timestamp handling and improve comment clarity.

The timestamp conversion can be simplified using toISOString().

Apply this diff to improve the code:

-  // this shenanigan is here to ensure the Timestamp is in UTC timezone
-  resultEvent.timestamp = new Date(
-    new Date().toLocaleString('en', { timeZone: 'UTC' })
-  ).getTime();
+  // Ensure timestamp is in UTC format
+  resultEvent.timestamp = new Date().toISOString();

59-67: Add explicit handling for unknown body of knowledge types.

The current implementation assumes any non-ALKEMIO_SPACE type is a knowledge base. Consider adding explicit type validation.

Apply this diff to improve type handling:

     if (event.type === BodyOfKnowledgeType.ALKEMIO_SPACE) {
       logger.info(`Ingest invoked for Space: ${event.bodyOfKnowledgeId}`);
       result = await embedSpace(event, alkemioClient);
+    } else if (event.type === BodyOfKnowledgeType.KNOWLEDGE_BASE) {
+      logger.info(
+        `Ingest invoked for KnowledgeBase: ${event.bodyOfKnowledgeId}`
+      );
+      result = await embedKnowledgeBase(event, alkemioClient);
     } else {
-      logger.info(
-        `Ingest invoked for KnowledgeBase: ${event.bodyOfKnowledgeId}`
-      );
-      result = await embedKnowledgeBase(event, alkemioClient);
+      throw new Error(`Unsupported body of knowledge type: ${event.type}`);
     }

Let's verify the available types:

src/data.readers/knowledge.base.ts Outdated Show resolved Hide resolved
src/data.readers/space.ts Outdated Show resolved Hide resolved
src/data.readers/knowledge.base.ts Outdated Show resolved Hide resolved
src/data.readers/knowledge.base.ts Outdated Show resolved Hide resolved
src/data.readers/space.ts Outdated Show resolved Hide resolved
src/data.readers/space.ts Outdated Show resolved Hide resolved
src/embed.body.of.knowledge.ts Show resolved Hide resolved
src/event.bus/connection.ts Outdated Show resolved Hide resolved
src/event.bus/events/ingest.body.of.knowledge.result.ts Outdated Show resolved Hide resolved
src/process.callouts.ts Outdated Show resolved Hide resolved
src/process.callouts.ts Outdated Show resolved Hide resolved
src/graphql.client/AlkemioCliClient.ts Show resolved Hide resolved
@valeksiev valeksiev requested a review from hero101 January 23, 2025 09:49
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
package.json (1)

Update engines field to match dependency requirements

The current engines field (>=16.15.0) is incompatible with core dependencies that require Node.js >= 18. This could lead to installation and runtime issues. Please update the engines field to:

"engines": {
  "node": ">=18.0.0",
  "npm": ">=8.5.5"
}
🔗 Analysis chain

Line range hint 71-79: Review Node.js version requirements.

There's a discrepancy between the engine requirements (>=16.15.0) and the Volta configuration (21.7.3). This could lead to compatibility issues. Consider:

  1. Updating the engines field to match the development environment
  2. Documenting the recommended Node.js version in the README
  3. Adding CI checks to ensure compatibility across supported versions

Run this script to check for Node.js version-specific syntax:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential Node.js version compatibility issues
rg -g '*.{js,ts}' -l 'import\s+.*\s+from|export\s+.*\s+from|class\s+.*\s+extends|async|await|??|#private|#\w+|\?\.'

Length of output: 356


Script:

#!/bin/bash
# Check for modern JavaScript syntax usage
echo "=== Checking for ESM imports/exports ==="
rg -g '*.{js,ts}' 'import .* from|export .* from'

echo -e "\n=== Checking for modern class features and operators ==="
rg -g '*.{js,ts}' '#private|#[a-zA-Z_]|class .* extends|\?\?|\?\.|\.\.\.'

echo -e "\n=== Checking for async/await ==="
rg -g '*.{js,ts}' 'async|await'

echo -e "\n=== Checking CI configs for Node.js version ==="
fd -g '*workflow*.{yml,yaml}' .github/workflows/

echo -e "\n=== Checking for version requirements in docs ==="
rg -g '*.md' -i 'node.*version|prerequisites|requirements'

echo -e "\n=== Checking package-lock.json for engine requirements ==="
fd -g 'package-lock.json' -x cat {} | grep -A 5 '"engines":'

Length of output: 72608

♻️ Duplicate comments (2)
src/data.readers/space.ts (1)

17-20: ⚠️ Potential issue

Add type validation before casting.

The unsafe type casting to Partial<Space> could lead to runtime errors if the space object doesn't match the expected structure.

+  // Type guard function
+  function isPartialSpace(obj: unknown): obj is Partial<Space> {
+    return obj !== null && typeof obj === 'object' && 'id' in obj;
+  }
+
+  if (!isPartialSpace(space)) {
+    throw new Error(`Invalid space data received for ${spaceId}`);
+  }
+
   const documents: Document[] = await processSpaceTree(
-    [space as Partial<Space>],
+    [space],
     alkemioClient
   );
src/process.callouts.ts (1)

21-23: ⚠️ Potential issue

Improve document array handling.

The spread operator with unknown array size could cause stack overflow.

-      if (document) {
-        documents.push(...document);
+      if (document && Array.isArray(document)) {
+        // Use concat for better performance with large arrays
+        documents = documents.concat(document.slice(0, 1000)); // Limit chunk size
       }
🧹 Nitpick comments (6)
src/process.callouts.ts (2)

7-10: Consider strengthening type safety.

Using Partial<Callout> suggests incomplete type information. Consider creating a more specific type that includes only the required fields for this operation (e.g., visibility).

type ProcessableCallout = Pick<Callout, 'visibility' | /* other required fields */>;

export const processCallouts = async (
  callouts: ProcessableCallout[],
  alkemioClient: AlkemioCliClient
): Promise<Document[]> => {

13-25: Consider adding progress tracking for large callout arrays.

For better observability, especially with large arrays of callouts, consider adding progress logging.

+  const total = callouts.length;
+  let processed = 0;
   for (const callout of callouts) {
+    processed++;
+    logger.debug(`Processing callout ${processed}/${total}`);
src/event.bus/connection.ts (1)

Line range hint 105-117: Enhance error handling and connection validation.

While the basic error handling is present, consider these improvements:

  1. Validate connection state before sending
  2. Add specific error types for different failure scenarios
 async send(message: IngestBodyOfKnowledgeResult) {
   try {
     if (!this.channel) {
       await this.connect();
+      if (!this.channel) {
+        throw new Error('Failed to establish channel after connection attempt');
+      }
     }
+    if (!this.connected) {
+      throw new Error('Connection is not in ready state');
+    }

     this.channel.sendToQueue(
       this.config.outgoingQueue,
       Buffer.from(JSON.stringify(message))
     );
   } catch (error) {
+    const errorMessage = error instanceof Error ? error.message : 'Unknown error';
+    logger.error(`Failed to send message: ${errorMessage}`);
     logger.error(error);
     throw error;
   }
 }
src/embed.documents.ts (2)

25-26: Consider enhancing logging configuration.

While the logging changes are correct, consider using a structured logging approach:

  1. Use a logging context instead of modifying defaultMeta directly
  2. Add log correlation ID for better traceability

Example enhancement:

-  logger.defaultMeta.bodyOfKnowledgeId = bokID;
+  const logContext = logger.child({ 
+    bodyOfKnowledgeId: bokID,
+    correlationId: crypto.randomUUID()
+  });

Also applies to: 48-48, 56-56


Line range hint 20-191: Consider improving error handling and resource management.

The function handles a complex workflow with multiple external services (Azure OpenAI, ChromaDB). Consider these improvements:

  1. Implement proper cleanup in case of partial failures
  2. Add retry logic for transient failures in embedding generation
  3. Consider implementing a circuit breaker for external service calls

Would you like me to provide an example implementation of these patterns?

package.json (1)

62-62: Lock the langsmith dependency version.

The caret (^) in the version constraint allows automatic updates to newer minor/patch versions, which could introduce compatibility issues. Consider locking to a specific version:

-    "langsmith": "^0.1.66",
+    "langsmith": "0.1.66",
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3448c26 and 1f0a3a6.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (7)
  • package.json (2 hunks)
  • src/data.readers/knowledge.base.ts (1 hunks)
  • src/data.readers/space.ts (1 hunks)
  • src/embed.documents.ts (3 hunks)
  • src/event.bus/connection.ts (3 hunks)
  • src/event.bus/events/ingest.body.of.knowledge.result.ts (1 hunks)
  • src/process.callouts.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/event.bus/events/ingest.body.of.knowledge.result.ts
  • src/data.readers/knowledge.base.ts
🔇 Additional comments (7)
src/data.readers/space.ts (2)

1-9: LGTM! Imports are well-organized and necessary.

All imports are being used and properly structured.


10-13: LGTM! Function signature is well-defined.

The types are explicit and the signature aligns with the new body of knowledge ingestion model.

src/process.callouts.ts (1)

1-6: LGTM! Imports are well-organized.

All imports are necessary and properly used in the implementation.

src/event.bus/connection.ts (1)

3-6: LGTM: Import and type changes align with the new ingestion model.

The transition from space-centric types to body of knowledge types is well-structured.

src/embed.documents.ts (2)

12-13: LGTM!

The import changes correctly reflect the transition from space-centric to body of knowledge types.


20-23: Verify callers of embedDocuments function.

The function signature has changed significantly:

  1. Changed from default to named export
  2. First parameter type changed to BodyOfKnowledgeReadResult
  3. Purpose type changed to IngestionPurpose

Let's verify all callers have been updated:

✅ Verification successful

All callers of embedDocuments have been properly updated

The only caller in src/embed.body.of.knowledge.ts correctly uses:

  • Named import syntax
  • Updated parameter types and order
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining usages of the old import style and function signature
rg -l "import embedDocuments from.*embed\.documents" || echo "No default imports found"
rg -A 5 "embedDocuments\(" || echo "No direct calls found"

Length of output: 467


Script:

#!/bin/bash
# Check import statement and full context of the call
echo "=== Import statement ==="
rg "import.*embedDocuments" src/embed.body.of.knowledge.ts || echo "No import found"
echo -e "\n=== Function call context ==="
rg -B 10 -A 5 "embedDocuments\(" src/embed.body.of.knowledge.ts || echo "No context found"

Length of output: 656

package.json (1)

2-3: Consider bumping the major version.

The package rename from @alkemio/space-ingest to @alkemio/ingest-alkemio-data is a breaking change that affects package imports. According to semver, this should trigger a major version bump (1.0.0) rather than a minor version bump (0.11.0).

src/data.readers/space.ts Show resolved Hide resolved
src/process.callouts.ts Show resolved Hide resolved
src/event.bus/connection.ts Show resolved Hide resolved
src/embed.documents.ts Show resolved Hide resolved
@valeksiev valeksiev merged commit 1a6c2fa into develop Jan 23, 2025
2 checks passed
@valeksiev valeksiev deleted the ingest-space-63/ingest-knowledge-base branch January 23, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ingest KnowledgeBase
2 participants