Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make json output parser handle newlines inside markdown code blocks #8682

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions libs/langchain/langchain/output_parsers/json.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,30 @@
from langchain.schema import BaseOutputParser, OutputParserException


def replace_new_line(match: re.Match[str]) -> str:
value = match.group(2)
value = re.sub(r"\n", r"\\n", value)
value = re.sub(r"\r", r"\\r", value)
value = re.sub(r"\t", r"\\t", value)
value = re.sub('"', r"\"", value)

return match.group(1) + value + match.group(3)


def custom_parser(multiline_string: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind marking these as private functions?
Could you add a bit more documentation to explain what it means to "handle new lines and other special characters" correctly (i.e., explaining that special characters need to be escape properly, so this is searching for instances of special characters that aren't being escaped inside the action input etc...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a glance at the code it's hard to tell whether the parser is exhaustive or whether there are more scenarios that will not be handled properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eyurtsev sure - updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: exhaustivity - I'm not sure. There may be other scenarios but I'm not aware of them. The main thing I was running into was single escaped \n and unescaped " breaking JSON parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eyurtsev ping.

if isinstance(multiline_string, (bytes, bytearray)):
multiline_string = multiline_string.decode()

multiline_string = re.sub(
r'("action_input"\:\s*")(.*)(")',
replace_new_line,
multiline_string,
flags=re.DOTALL,
)

return multiline_string


def parse_json_markdown(json_string: str) -> dict:
"""
Parse a JSON string from a Markdown string.
Expand All @@ -31,6 +55,9 @@ def parse_json_markdown(json_string: str) -> dict:
# Strip whitespace and newlines from the start and end
json_str = json_str.strip()

# handle newlines and other special characters inside the returned value
json_str = custom_parser(json_str)

# Parse the JSON string into a Python dictionary
parsed = json.loads(json_str)

Expand Down
16 changes: 15 additions & 1 deletion libs/langchain/tests/unit_tests/output_parsers/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,13 @@
}
```"""

JSON_WITH_MARKDOWN_CODE_BLOCK_AND_NEWLINES = """```json
{
"action": "Final Answer",
"action_input": "```bar\n<div id="1" class=\"value\">\n\ttext\n</div>```"
}
```"""

NO_TICKS = """{
"foo": "bar"
}"""
Expand Down Expand Up @@ -114,6 +121,13 @@ def test_parse_json(json_string: str) -> None:
assert parsed == {"foo": "bar"}


def test_parse_json_with_code_block() -> None:
def test_parse_json_with_code_blocks() -> None:
parsed = parse_json_markdown(JSON_WITH_MARKDOWN_CODE_BLOCK)
assert parsed == {"foo": "```bar```"}

parsed = parse_json_markdown(JSON_WITH_MARKDOWN_CODE_BLOCK_AND_NEWLINES)

assert parsed == {
"action": "Final Answer",
"action_input": '```bar\n<div id="1" class="value">\n\ttext\n</div>```',
}