-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ai-form-recognizer] Lazy iterator for words of a line #18444
[ai-form-recognizer] Lazy iterator for words of a line #18444
Conversation
API changes have been detected in API changes + words?: () => IterableIterator<DocumentWord>; |
API changes have been detected in API changes + words: () => IterableIterator<DocumentWord>; |
/azp run js - ai-form-recognizer - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall but I left a few comments that I believe worth considering before merging.
Also, could you please run the release script?
* @returns a convenience layer DocumentLine | ||
*/ | ||
function toDocumentLineFromGenerated( | ||
generated: GeneratedDocumentLine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: perhaps the generated
param can be typed as DocumentLine
instead of being cast to it twice in the body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't satisfy DocumentLine
until after the words
method is added, so I'm reluctant to do this.
it("search finds correct index", () => { | ||
for (const datum of TEST_DATA) { | ||
const testSpan = { offset: datum.span.offset, length: 1 }; | ||
assert.strictEqual( | ||
(iteratorFromFirstMatchBinarySearch(testSpan, TEST_DATA).next().value as TestData)?.id, | ||
naiveFindFirst(testSpan, TEST_DATA)?.id | ||
); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: did you consider a quickcheck-style tests? I am not sure if JS has a good lib for it. Found https://github.com/gampleman/quick_check.js/ but not sure how good it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't consider this, but it's an interesting idea. When we support more generic kinds of element relationships, I'd like to add some fuzzing to this test matrix, but I will explore that during MQ.
sdk/formrecognizer/ai-form-recognizer/test/private/getChildren.spec.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
* [ai-form-recognizer] Lazy iterator for words of a line * Use method instead of property * Regenerate API * Polished, wrote changelog, added some more tests, samples * Updated API MD * Improved docs * Apply changes from review
This PR adds support for accessing the
words
of a DocumentLine. The computation is lazy, supported by a fast algorithm that assumes the words and spans are sorted. When the first element is iterated, a binary search is performed to find the first possible child element n the words array. Each subsequent iteration is nearly constant-time, as the elements are sorted and the algorithm optimizes the iteration by only comparing spans/words for as long as they are possible candidates for inclusion. The tests compare the results of this algorithm with a naive implementation.It also adds samples and a CHANGELOG entry.