Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network and Hybrid parsers #153

Draft
wants to merge 129 commits into
base: master
Choose a base branch
from

Conversation

FrancoisHuet
Copy link

Introduce two new parsers, network and hybrid

  • Network is text-based like stream, but differs by:
    • relying on connected alignments x-/y- rather than text edges.
    • handling headers with a separate heuristic as opposed to hard coded expansion upwards
  • Hybrid performs both a layout- and a network-based analysis, and merges the results. It is different from the "guess" strategy discussed in issue 19 since it's not just an "or"

This pull request also introduce a Jupyter notebook to visualize the different parser results side-by-side.

Frh added 30 commits April 18, 2020 17:25
Drop EOL Python 2 support. Resolve unit test discrepancies.
Update unit tests to pass in Travis across all supported Py.
Linting.
Move common code to base class to reduce duplication
Stream plots display pdf background for better context
Refactor parsers by moving common code to the base class
Maintain Python 3.5 compatibility by removing f"{}"
Move common parse error stats computation to base parser
Move copy_spanning_text logic to the table
* plot info passed through debug_info
* display each text edge
* Display regions and areas rectangles
Accept cells if they're at least 50% within the table's bounds.
Frh added 14 commits June 11, 2020 17:20
Create hybrid parser leverage both lattice and network techniques.
Simplify plotting of pdf in lattice case.
Rename "parser.table_bbox" into "parser.table_bbox_parses", since it
represents not a bbox but a dict of bbox to corresponding parsing data.

Still missing: more unit tests, plotting of steps.
Fix first split merge issue
Improve parser comparison notebook to flag identical parses, display
multiple tables correctly
Fix tolerance parameter inclusion for hybrid.
* If Travis uses pytest-cov >= 2.10, it also needs pytest >= 4.6
* Clean up the parser comparison notebook
* Address issue where hybrid didn't honor the columns parameter
* Fix dropping of empty rows/columns in hybrid
* Hybrid learns table y-dimensions from lattice
@vinayak-mehta vinayak-mehta self-requested a review June 18, 2020 09:04
@vinayak-mehta
Copy link
Member

Thanks for submitting this PR!

It's a large change so give me some time to go through it. I'll start by trying out the new flavors, using the Jupyter notebooks.

I see that there are some other changes to config files for Travis / Deepsource etc. I propose that we do those in a separate PR, and keep this PR just for the network and hybrid flavor code.

@FrancoisHuet
Copy link
Author

Thanks Vinayak!  Certainly, I can remove the changes you think don't belong from this PR and create a separate one for the rest if needed.  I will wait for your overall feedback to get a sense of what to split.
Feel free to reach out directly if discussing the change over video / screen sharing helps.
Best,
François

@vinayak-mehta
Copy link
Member

Feel free to reach out directly if discussing the change over video / screen sharing helps.

That would be awesome! I'm sending you an email about this.

@vinayak-mehta
Copy link
Member

  • Network is text-based like stream, but differs by:
    • relying on connected alignments x-/y- rather than text edges.
    • handling headers with a separate heuristic as opposed to hard coded expansion upwards

If it works better than stream, then it might make sense to update stream with the enhancements from network, as introducing a whole new text-based parser might make things confusing for the user. More choices, more confusion.

I'm still going through the code for network and hybrid. I'll also compare network and stream outputs on the stream tests.

@vinayak-mehta
Copy link
Member

Certainly, I can remove the changes you think don't belong from this PR and create a separate one for the rest if needed. I will wait for your overall feedback to get a sense of what to split.

I think the changes in the following files can be removed from this PR, so that it contains changes only for the new parsers.

  • .deepsource.toml
  • .gitignore
  • .travis.yml
  • camelot/handlers.py
  • camelot/image_processing.py
  • camelot/io.py

I included the last 3 files as I mostly saw code style changes, please correct me if they also include changes required for the new parsers.

* Improve explanations of network, hybrid, and lattice parsers
* Remove dead code from parser comparison notebook
* Clean-up notebook variables to reduce size and make diffs cleaner
* Revert changes that were peripheral to the core changes
@FrancoisHuet
Copy link
Author

Thank you for the review Vinayak, and for the good chat this morning. Based on both, I have made the following changes:
a) cleaned-up the two notebooks (comparison, and algorithms descriptions).
b) removed the warning suppressions on test files from deepsource
c) restored the deprecated "sudo: true" in the travis config files
d) removed changes from .gitignore
e) removed formatting fixes in image_plotting.py

I've maintained the changes in handlers.py (necessary to allow the png generation from pdf to be performed only once for speed), and io.py (introduces a new debug parameter that helps trace through the algorithm).

I suggest that at least b) and c) are re-added in a future separate commit.

@vinayak-mehta
Copy link
Member

@FrancoisHuet Thanks for explaining how the parsers work on the call and spending time on this!

Thank you for the review Vinayak, and for the good chat this morning. Based on both, I have made the following changes:
a) cleaned-up the two notebooks (comparison, and algorithms descriptions).
b) removed the warning suppressions on test files from deepsource
c) restored the deprecated "sudo: true" in the travis config files
d) removed changes from .gitignore
e) removed formatting fixes in image_plotting.py

I've maintained the changes in handlers.py (necessary to allow the png generation from pdf to be performed only once for speed), and io.py (introduces a new debug parameter that helps trace through the algorithm).

Thanks! Over the weekend, I'll run the notebooks and all the test suite PDFs and compare stream / network / hybrid outputs. I'll also go through the code to understand the implementation, and come up with a plan to add that stream deprecation warning that we discussed.

I suggest that at least b) and c) are re-added in a future separate commit.

Got it. I've opened #158 and #159 to track these future additions.

@codecov-io
Copy link

Codecov Report

Merging #153 (42f8321) into master (5efbcdc) will decrease coverage by 1.60%.
The diff coverage is 86.89%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #153      +/-   ##
==========================================
- Coverage   88.12%   86.51%   -1.61%     
==========================================
  Files          13       16       +3     
  Lines        1524     2180     +656     
  Branches      347      500     +153     
==========================================
+ Hits         1343     1886     +543     
- Misses        128      226      +98     
- Partials       53       68      +15     
Impacted Files Coverage Δ
camelot/utils.py 74.67% <64.51%> (-6.60%) ⬇️
camelot/plotting.py 77.12% <77.04%> (-14.96%) ⬇️
camelot/cli.py 85.62% <84.78%> (-1.05%) ⬇️
camelot/core.py 86.20% <88.13%> (-1.33%) ⬇️
camelot/parsers/base.py 92.96% <92.74%> (-7.04%) ⬇️
camelot/parsers/stream.py 88.52% <92.85%> (-1.48%) ⬇️
camelot/parsers/hybrid.py 93.18% <93.18%> (ø)
camelot/parsers/network.py 95.48% <95.48%> (ø)
camelot/handlers.py 91.30% <100.00%> (+1.06%) ⬆️
camelot/image_processing.py 94.52% <100.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b8ce1d...42f8321. Read the comment docs.

@vinayak-mehta
Copy link
Member

vinayak-mehta commented Jul 14, 2021

Now that I've finished the large change I was working on, I can finally get back to this. Sorry for the delay here.

@varunlalan
Copy link

@vinayak-mehta any chance to merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants