Skip to content

Commit

Permalink
text-splitters: fix state persistence issue in ExperimentalMarkdownSy…
Browse files Browse the repository at this point in the history
…ntaxTextSplitter (#28373)

- **Description:** 
This PR resolves an issue with the
`ExperimentalMarkdownSyntaxTextSplitter` class, which retains the
internal state across multiple calls to the `split_text` method. This
behaviour caused an unintended accumulation of chunks in `self`
variables, leading to incorrect outputs when processing multiple
Markdown files sequentially.

- Modified `libs\text-splitters\langchain_text_splitters\markdown.py` to
reset the relevant internal attributes at the start of each `split_text`
invocation. This ensures each call processes the input independently.
- Added unit tests in
`libs\text-splitters\tests\unit_tests\test_text_splitters.py` to verify
the fix and ensure the state does not persist across calls.

- **Issue:**  
Fixes [#26440](#26440).

- **Dependencies:**
No additional dependencies are introduced with this change.


- [x] Unit tests were added to verify the changes.
- [x] Updated documentation where necessary.  
- [x] Ran `make format`, `make lint`, and `make test` to ensure
compliance with project standards.

---------

Co-authored-by: Angel Chen <angelchen396@gmail.com>
Co-authored-by: Chester Curme <chester.curme@gmail.com>
  • Loading branch information
3 people authored Dec 18, 2024
1 parent 7c8f977 commit 3256b5d
Show file tree
Hide file tree
Showing 2 changed files with 406 additions and 0 deletions.
5 changes: 5 additions & 0 deletions libs/text-splitters/langchain_text_splitters/markdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,11 @@ def split_text(self, text: str) -> List[Document]:
chunks of the input text. If `return_each_line` is enabled, each line
is returned as a separate `Document`.
"""
# Reset the state for each new file processed
self.chunks.clear()
self.current_chunk = Document(page_content="")
self.current_header_stack.clear()

raw_lines = text.splitlines(keepends=True)

while raw_lines:
Expand Down
Loading

0 comments on commit 3256b5d

Please sign in to comment.