Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JSONSelection] Track location information for parsed structures using Ranged<T> trait and WithRange<T> wrapper #5987

Merged
merged 39 commits into from
Sep 12, 2024

Conversation

benjamn
Copy link
Member

@benjamn benjamn commented Sep 10, 2024

In order to provide more helpful parsing and error messages for JSONSelection, and to enable IDE/LSP features like syntax highlighting and tooltip hovering in the future, the JSONSelection parser needs to track where each parsed substructure came from within the original source string.

The nom_locate crate was helpful for replacing raw &str inputs with the type Span<'a> = LocatedSpan<&'a str> type, which nom (mostly) seamlessly accepts everywhere &str was previously expected. However, while the LocatedSpan type makes it easy to determine the starting offset of a given AST structure, the ending offset typically needs to be derived from parsing the structure's children. To bootstrap this parsing from the bottom up, I found it helpful to add a parsed_span combinator to match specific concrete tokens and capture their range information.

This would all have been considerably easier if every AST structure was a struct, because then I could simply have added a range: Option<(usize, usize)> (aka range: OffsetRange) field to each struct, and then make sure that field gets populated during parsing. Where possible, that's what I did, but there are a number of AST structures that make more sense as enum types, and we also need to keep track of location information for primitive/string types that cannot store their own ranges.

To work with this variety of AST structures, I introduced a Ranged<T> trait, which is implemented by all the structures that can provide (or cheaply derive) their own ranges. For those that cannot, there's a WithRange<T> wrapper struct that implements Ranged<T>. I considered using WithRange<T> (note: fka Parsed<T>) for everything, but that proved cumbersome and redundant, so the later commits in this PR walk back the use of WithRange<T> to just the types that need it, leaving the rest to implement Ranged<T> directly, without a wrapper.

To make this range information immediately useful, I am now including it as one of the fields of the ApplyToError struct, making runtime apply_to_path errors easier to debug. I would also like to include range information (along with more descriptive error messages) in parse errors generated by nom, but I haven't found a good way to replace the opaque nom::error::Error objects (which only store the input string and an error code) with errors that allow custom messages and range information, so that will have to wait for a subsequent PR.

You can review this PR commit-by-commit if you like, but I changed my mind a few times (e.g. renaming Parsed<T> to WithRange<T> after adding Range<T>), so it might be easier to review the whole set of changes together.

@router-perf
Copy link

router-perf bot commented Sep 10, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

@benjamn benjamn marked this pull request as draft September 10, 2024 21:56
@benjamn

This comment was marked as resolved.

@benjamn benjamn force-pushed the benjamn/JSONSelection-parsing-and-error-ranges branch from 18dffb2 to ce36e85 Compare September 10, 2024 21:59
@benjamn benjamn marked this pull request as ready for review September 10, 2024 22:47
Copy link
Contributor

@lennyburdette lennyburdette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't go over this with a fine-toothed comb, but the overall structure looks sound and the tests look good. i'm inclined to merge this early, probably right after our next release, so we can iterate on diagnostics for the language server. thank you for this!

@benjamn benjamn force-pushed the benjamn/JSONSelection-parsing-and-error-ranges branch from 13d41b3 to 8a35874 Compare September 11, 2024 15:46
@benjamn

This comment was marked as resolved.

@benjamn benjamn force-pushed the benjamn/JSONSelection-parsing-and-error-ranges branch from 351f0a6 to 845e2c7 Compare September 11, 2024 16:18
@benjamn benjamn force-pushed the benjamn/JSONSelection-parsing-and-error-ranges branch from 845e2c7 to b57decb Compare September 11, 2024 18:29
Now that not every T::parse method returns an IResult<Span, Parsed<T>>,
it makes sense to name the wrapper struct WithRange<T> to reflect why
only some AST structures need to be wrapped (that is, they would have
difficulty storing or otherwise deriving their own ranges).
Note that this commit also changes the hash to include error.range
(along with error.message and error.path), which practically guarantees
no two ApplyToError objects will ever be deduplicated, so we might not
even need to implement Hash if we move away from using IndexSet to
accumulate the errors.

#5987 (comment)
…ath.

This method isn't actually used anywhere yet, but changing the return
type seems like a good idea:
#5987 (comment)
Previously, the parser would always consume both leading and trailing
spaces and comments when parsing (most) AST structures. This was a
convenient convention because it meant the caller of a given ::parse
method typically did not have to worry about consuming trailing spaces
immediately after calling the child ::parse method, since they would
already have been consumed.

Now that we are tracking location information with the help of
nom_locate, it's relatively easy to determine the start offset of a
given AST structure (typically input.location_offset()), but it's
trickier to determine the end offset, because the input text represents
the entire remaining suffix of the original input, with no helpful
indication where the end of the current AST node might be.

If we shift our conventions to stop consuming trailing spaces and
comments, then the caller of a given ::parse method can rely on the end
offset returned by the child ::parse method to be the end of the child
AST node, and can typically use that information to determine the end
offset of the parent AST node, with help from merge_ranges.

This change also allows us to store useful range information for the
PathList::Empty variant, which previously always had `range: None`. Even
though these ranges are empty (start == end), we can simplify the
full_range calculation in PathList::parse_with_depth if we don't have to
check whether matches!(rest.node(), PathList::Empty) and handle that
case specially. Instead, we can typically compute full_range from
merge_ranges(first_child.range(), rest.range()) without having to fall
back to other ranges when rest is PathList::Empty (which previously
meant rest.range() would be None). This empty range now captures the
location where parsing of the previous AST nodes actually stopped, so
rest.range() is always usable (not None) and accurate (no trailing
spaces_or_comments included).
@benjamn benjamn force-pushed the benjamn/JSONSelection-parsing-and-error-ranges branch from 1454e59 to 725afd8 Compare September 12, 2024 20:13
@nicholascioli
Copy link
Contributor

Thanks for working through all of the comments! The only thing left is if you want to add a quick comment about the trailing space. Otherwise, feel free to squash and merge!

The inclusion of "array" and "string" in the error message felt
confusing to me, since it's not obvious that the array/string in
question is the input value that came before the `->`.

#5987 (comment)
@benjamn benjamn merged commit 9a34ca3 into next Sep 12, 2024
10 of 12 checks passed
@benjamn benjamn deleted the benjamn/JSONSelection-parsing-and-error-ranges branch September 12, 2024 21:45
@benjamn benjamn restored the benjamn/JSONSelection-parsing-and-error-ranges branch September 12, 2024 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants