-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network and Hybrid parsers #153
base: master
Are you sure you want to change the base?
Conversation
Drop EOL Python 2 support. Resolve unit test discrepancies. Update unit tests to pass in Travis across all supported Py. Linting.
Move common code to base class to reduce duplication Stream plots display pdf background for better context
Refactor parsers by moving common code to the base class Maintain Python 3.5 compatibility by removing f"{}"
Move common parse error stats computation to base parser Move copy_spanning_text logic to the table
* plot info passed through debug_info * display each text edge
* Display regions and areas rectangles
Accept cells if they're at least 50% within the table's bounds.
Create hybrid parser leverage both lattice and network techniques. Simplify plotting of pdf in lattice case. Rename "parser.table_bbox" into "parser.table_bbox_parses", since it represents not a bbox but a dict of bbox to corresponding parsing data. Still missing: more unit tests, plotting of steps.
Fix first split merge issue
Improve parser comparison notebook to flag identical parses, display multiple tables correctly Fix tolerance parameter inclusion for hybrid.
Trim empty cols and lines
… into hybrid-parser
* If Travis uses pytest-cov >= 2.10, it also needs pytest >= 4.6
* Clean up the parser comparison notebook * Address issue where hybrid didn't honor the columns parameter * Fix dropping of empty rows/columns in hybrid * Hybrid learns table y-dimensions from lattice
Thanks for submitting this PR! It's a large change so give me some time to go through it. I'll start by trying out the new flavors, using the Jupyter notebooks. I see that there are some other changes to config files for Travis / Deepsource etc. I propose that we do those in a separate PR, and keep this PR just for the network and hybrid flavor code. |
Thanks Vinayak! Certainly, I can remove the changes you think don't belong from this PR and create a separate one for the rest if needed. I will wait for your overall feedback to get a sense of what to split. |
That would be awesome! I'm sending you an email about this. |
If it works better than stream, then it might make sense to update stream with the enhancements from network, as introducing a whole new text-based parser might make things confusing for the user. More choices, more confusion. I'm still going through the code for network and hybrid. I'll also compare network and stream outputs on the stream tests. |
I think the changes in the following files can be removed from this PR, so that it contains changes only for the new parsers.
I included the last 3 files as I mostly saw code style changes, please correct me if they also include changes required for the new parsers. |
* Improve explanations of network, hybrid, and lattice parsers * Remove dead code from parser comparison notebook * Clean-up notebook variables to reduce size and make diffs cleaner * Revert changes that were peripheral to the core changes
Thank you for the review Vinayak, and for the good chat this morning. Based on both, I have made the following changes: I've maintained the changes in I suggest that at least b) and c) are re-added in a future separate commit. |
@FrancoisHuet Thanks for explaining how the parsers work on the call and spending time on this!
Thanks! Over the weekend, I'll run the notebooks and all the test suite PDFs and compare stream / network / hybrid outputs. I'll also go through the code to understand the implementation, and come up with a plan to add that stream deprecation warning that we discussed.
Got it. I've opened #158 and #159 to track these future additions. |
Codecov Report
@@ Coverage Diff @@
## master #153 +/- ##
==========================================
- Coverage 88.12% 86.51% -1.61%
==========================================
Files 13 16 +3
Lines 1524 2180 +656
Branches 347 500 +153
==========================================
+ Hits 1343 1886 +543
- Misses 128 226 +98
- Partials 53 68 +15
Continue to review full report at Codecov.
|
Now that I've finished the large change I was working on, I can finally get back to this. Sorry for the delay here. |
@vinayak-mehta any chance to merge this? |
Introduce two new parsers, network and hybrid
This pull request also introduce a Jupyter notebook to visualize the different parser results side-by-side.