-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated documentations and refractored codes. #126
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…um into benchmark_2024_04_26 Merged remote changes and local changes. Should have ran git pull constantly.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ML-Herbarium Spring 2024 Summary Report
1. Overview
The Spring 2024 Team continues the machine-learning based approach to digitalize and mobilize Asian herbarium collections, our work is guided by our clients Professor Charles Davis, professor Thomas Gardos and Solution Engineer Michelle Voong (via NSF Grant with Prof. Charles Davis and with BU Spark!).
We built a pipeline with commercial OCR + LLM, achieving formidable accuracy result with over 62.3% on Taxon names, 98.5% on Collection Locations (Province/Country), 89.2% on Collector, and 80.4% on Collection Date and opened doors for potential collaborations with Chinese Virtual Herbarium (CVH).
2. Achievements
2.1 Pipeline and Performance
We have built a Highly accurate pipeline with sufficient benchmark testing on 1000 samples scraped and randomly selected from the 15,000 samples we collected from CVH dataset. The performance result below leveraged Document AI and GPT-4-Turbo.
The accuracy metrics are calculated this way:
Taxon name preprocessing example: (e.g. Lysimachia fortunei Maxim. --> Lysimachia fortunei)
Demo could be found at: https://huggingface.co/spaces/spark-ds549/TrOCR
2.2 Benchmark
On accuracy side, while last semesters' works mainly focuses on
but both have shown significant drawbacks. Approach 1's CV models were not fine-tuned for botanics tasks and the first step (DETR) has pruned 30% of the labeled 1,000 samples creating significant drawback on downstream tasks, while TaxoNERD (a NER model for herberia) also only performs on English texts. Approach 2 have seen significant low accuracies on Chinese and Cyrillic texts.
On cost and time, our benchmark results:
The time performance was calculated under one linear thread for Document AI and GPT-4-Turbo (Input $10.00 / 1M tokens, Output $30.00 / 1M token), while one manual labeler takes around 8 ~ 16 hours and roughly $50 ~ $150 from an outsourcing service provider (source1, source2, source3), while not guarantee the accuracies.
Furthermore, if future team seek to recreate Approach 1, please refer to Refer to README.md under /trocr for detailed instructions. If problem arise (likely), please refer to the github issue or the huggingface discussions section[https://huggingface.co/spark-ds549/detr-label-detection/discussions/3].
Benchmark pipeline: /ml-herbarium/Spring2024/benchmark_spring2024.ipynb
2.3 CVH Scraper
During our quest for training and validation datasets, we located Chinese Virtual Herbarium's dataset, the largest herbarium in China, collected around 10 million samples with 2.8 million samples hand-labeled by identifier over 20 years. A typical example they host often contains:
It is worth noting that most modern samples contain printed Chinese and English labels created by both identifier and collector, with high contrast white background and black font.
Example 1: Anaphalis margaritacea (L.) Benth. & Hook. f. https://www.cvh.ac.cn/spms/detail.php?id=e6e73365
However, for older samples, it may contain handwritten Chinese and English labels with a darker, harder to identify background by collector, while also likely containing a printed label by identifier. Thus when performing OCR precision, it is extremely important to identify which label (older handwritten label by collector or newer printed label by identifier) we are extracting from.
Example 1: Symplocos Jacq. https://www.cvh.ac.cn/spms/detail.php?id=e82ce487
The webpage has a dynamic layout with php thus a selenium automation script was produced to scrape the results.
Please refer to ml-herbarium/Spring2024/scraper/README_scraper.md for instructions.
2.4 Collaboration with Chinese Virtual Herbarium.
I (Handi Xie, @PalmPalm7) have successfully established communication with CVH and CVH hope to collaborate with BU Spark! and our work are mutually beneficial.
Detailed transactions could be found at DS 549 - SP24 - Harvard Herberia - Communication with CVH.
Summary of the communication:
Note: CVH's 8 million datasets could be highly beneficial for a multimodal model with a herberia domain focus.
3. Words to future team
All the past developers are more than happy to guide and discuss the future of this amazing project! You could reach out to us at: