Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Fix ConfluenceLoader load() failure caused by deleted pages #29232

Merged

Conversation

zenoengine
Copy link
Contributor

@zenoengine zenoengine commented Jan 15, 2025

Description

This PR modifies the is_public_page function in ConfluenceLoader to prevent exceptions caused by deleted pages during the execution of ConfluenceLoader.process_pages().

Example scenario:
Consider the following usage of ConfluenceLoader:

import os
from langchain_community.document_loaders import ConfluenceLoader

loader = ConfluenceLoader(
        url=os.getenv("BASE_URL"),
        token=os.getenv("TOKEN"),
        max_pages=1000,
        cql=f'type=page and lastmodified >= "2020-01-01 00:00"',
        include_restricted_content=False,
)

# Raised Exception : HTTPError: Outdated version/old_draft/trashed? Cannot find content Please provide valid ContentId.
documents = loader.load()

If a deleted page exists within the query result, the is_public_page function would previously raise an exception when calling get_all_restrictions_for_content, causing the loader.load() process to fail for all pages.

By adding a pre-check for the page's "current" status, unnecessary API calls to get_all_restrictions_for_content for non-current pages are avoided.

This fix ensures that such pages are skipped without affecting the rest of the loading process.

Issue

N/A (No specific issue number)

Dependencies

No new dependencies are introduced with this change.

Twitter handle

@zenoengine

…in is_public_page

- Added a pre-check for page status in is_public_page to ensure the page is "current" before calling get_all_restrictions_for_content.
- Prevents access exceptions when get_all_restrictions_for_content is called for deleted pages.
- Resolves an issue where such exceptions caused the entire loader.load() process to fail during process_pages execution.
- Optimizes by skipping unnecessary API calls for non-current pages.
Copy link

vercel bot commented Jan 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jan 15, 2025 11:41am

@zenoengine zenoengine changed the title community: Prevent ConfluenceLoader load() failure caused by deleted pages community: Prevent(fix) ConfluenceLoader load() failure caused by deleted pages Jan 15, 2025
@zenoengine zenoengine changed the title community: Prevent(fix) ConfluenceLoader load() failure caused by deleted pages community: Fix ConfluenceLoader load() failure caused by deleted pages Jan 15, 2025
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Jan 15, 2025
@ccurme ccurme merged commit 0555426 into langchain-ai:master Jan 15, 2025
19 checks passed
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jan 15, 2025
@zenoengine zenoengine deleted the fix/is_public_page-skip-invalid-status branch January 15, 2025 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging. size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants