Updated documentations and refractored codes. #126

PalmPalm7 · 2024-05-08T03:43:01Z

ML-Herbarium Spring 2024 Summary Report

1. Overview

The Spring 2024 Team continues the machine-learning based approach to digitalize and mobilize Asian herbarium collections, our work is guided by our clients Professor Charles Davis, professor Thomas Gardos and Solution Engineer Michelle Voong (via NSF Grant with Prof. Charles Davis and with BU Spark!).

We built a pipeline with commercial OCR + LLM, achieving formidable accuracy result with over 62.3% on Taxon names, 98.5% on Collection Locations (Province/Country), 89.2% on Collector, and 80.4% on Collection Date and opened doors for potential collaborations with Chinese Virtual Herbarium (CVH).

2. Achievements

2.1 Pipeline and Performance

We have built a Highly accurate pipeline with sufficient benchmark testing on 1000 samples scraped and randomly selected from the 15,000 samples we collected from CVH dataset. The performance result below leveraged Document AI and GPT-4-Turbo.

Highly accurate
- Taxon name: 62.3%
- Collection locations (Province/Country): 98.5%
- Collector name: 89.2%
- Collection Date: 80.4%
Cost effective:
- 351 Min. / 1000 Samples
- $ 66.5 / 1000 Samples

The accuracy metrics are calculated this way:

Taxon Accuracy metrics definition: Exact matching after both groundtruth and extraction of Taxon name are preprocessed, mainly getting rid of scholar name
Taxon name preprocessing example: (e.g. Lysimachia fortunei Maxim. --> Lysimachia fortunei)
Taxon Accuracy metrics explanation: There is even discrepancies between groundtruth (scraped from website) and groundtruth.
Location Accuracy metrics definition: Exact matching of Province / Municiple name. (Required by Charles). Groundtruth also holds similar geographical granularity, so the metrics finer granularity (e.g. city, village, road)
Collector Accuracy metrics definition: Exact matching of collectors. Groundtruth often hides second authors (et.al.)
Collection Date Accuracy metrics definition: Exact matching of YYYYMMDD timestamp.

Demo could be found at: https://huggingface.co/spaces/spark-ds549/TrOCR

2.2 Benchmark

On accuracy side, while last semesters' works mainly focuses on

Approach 1: open-source models (DETR, CRAFT, TrOCR, TaxoNERD) with GBIF datasets (SU23 and prior) and
Approach 2 Commercial OCR/ ViT + LLM (FA23),

but both have shown significant drawbacks. Approach 1's CV models were not fine-tuned for botanics tasks and the first step (DETR) has pruned 30% of the labeled 1,000 samples creating significant drawback on downstream tasks, while TaxoNERD (a NER model for herberia) also only performs on English texts. Approach 2 have seen significant low accuracies on Chinese and Cyrillic texts.

On cost and time, our benchmark results:

351 Min. / 1000 Samples
$ 66.5 / 1000 Samples

The time performance was calculated under one linear thread for Document AI and GPT-4-Turbo (Input $10.00 / 1M tokens, Output $30.00 / 1M token), while one manual labeler takes around 8 ~ 16 hours and roughly $50 ~ $150 from an outsourcing service provider (source1, source2, source3), while not guarantee the accuracies.

Furthermore, if future team seek to recreate Approach 1, please refer to Refer to README.md under /trocr for detailed instructions. If problem arise (likely), please refer to the github issue or the huggingface discussions section[https://huggingface.co/spark-ds549/detr-label-detection/discussions/3].

Benchmark pipeline: /ml-herbarium/Spring2024/benchmark_spring2024.ipynb

2.3 CVH Scraper

During our quest for training and validation datasets, we located Chinese Virtual Herbarium's dataset, the largest herbarium in China, collected around 10 million samples with 2.8 million samples hand-labeled by identifier over 20 years. A typical example they host often contains:

High resolution of the image.
- Image of dry plant collection
- Label created by collector documenting:
  - Taxon name
  - Collector
  - Collection date
  - Collection locality
  - Habitat
- Label created by identifier documenting:
  - Identified taxon name
  - Identifier name
  - Identified date
CVH's digitalized documentations, containing:
- Taxonomy
- Taxon name
- Scientific Name
- Chinese Name
- Identified By
- Identification Date
- Collector
- Collector's No.
- Collection Date
- Locality
- Elevation
- Habitat
- Life Form
- Reproductive Condition

It is worth noting that most modern samples contain printed Chinese and English labels created by both identifier and collector, with high contrast white background and black font.

Example 1: Anaphalis margaritacea (L.) Benth. & Hook. f. https://www.cvh.ac.cn/spms/detail.php?id=e6e73365

However, for older samples, it may contain handwritten Chinese and English labels with a darker, harder to identify background by collector, while also likely containing a printed label by identifier. Thus when performing OCR precision, it is extremely important to identify which label (older handwritten label by collector or newer printed label by identifier) we are extracting from.

Example 1: Symplocos Jacq. https://www.cvh.ac.cn/spms/detail.php?id=e82ce487

The webpage has a dynamic layout with php thus a selenium automation script was produced to scrape the results.

Please refer to ml-herbarium/Spring2024/scraper/README_scraper.md for instructions.

2.4 Collaboration with Chinese Virtual Herbarium.

I (Handi Xie, @PalmPalm7) have successfully established communication with CVH and CVH hope to collaborate with BU Spark! and our work are mutually beneficial.
Detailed transactions could be found at DS 549 - SP24 - Harvard Herberia - Communication with CVH.

Summary of the communication:

CVH is willing to provide us the necessary datasets in exchange of authorships in the final academic output
CVH could provide expert labelers but these resources are demanding.
Edgecases CVH have discovered:
- Localities are prone to many errors due to
  - Transcriber's manual errors (Same tone, different word in Chinese results in vast differences)
  - Vague Description (300 meters from village A, turn right to road B for 50 meters, collections were found under bridge)
Detailed collaboration methods are awaiting to be discussed.

Note: CVH's 8 million datasets could be highly beneficial for a multimodal model with a herberia domain focus.

3. Words to future team

All the past developers are more than happy to guide and discuss the future of this amazing project! You could reach out to us at:

(SP24) Andy Xie handi.xie.beintouch@gmail.com
(SP24) George Trammell gtram@bu.edu
(SP24) Max Karambelas mkaramb@bu.edu
(FA23) Smriti Suresh smritis@bu.edu
(SP23 and SU23) Kabilan Mohanraj kabilanm@bu.edu

…um into benchmark_2024_04_26 Merged remote changes and local changes. Should have ran git pull constantly.

…860bcb0f44.json

review-notebook-app · 2024-05-08T03:43:13Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Handi Xie and others added 20 commits April 26, 2024 07:17

checkout git repo on SCC

41ba10e

checkout git repo on SCC

077d2db

fixed directory errors

9117830

Fixing duplicates

a87d620

1000 randomly selected samples for further benchmark testings

73556e2

Merge branch 'benchmark_2024_04_26' of github.com:BU-Spark/ml-herbari…

66dad52

…um into benchmark_2024_04_26 Merged remote changes and local changes. Should have ran git pull constantly.

added datasets samples

555be2e

temp commit, unfinished

f738a2e

demo folder

aa1bf54

Add files via upload

e70b430

Delete Spring2024/demo/dpcumentai_batch_processing_app directory

0835530

Add files via upload

d868b01

Update README.md

a13dc10

Update README.md

f2a3c54

Update README.md

4f47f0b

Update README.md

9b41232

Delete Spring2024/demo/documentai_batch_processing_app/herbaria-ai-3c…

8a67df5

…860bcb0f44.json

Update README_scraper.md

461fefe

Updates on README.md

2f3f20c

Final updates

a56c0d5

PalmPalm7 requested review from trgardos, WilliamLee101 and funkyvoong May 8, 2024 03:43

PalmPalm7 changed the base branch from main to dev May 13, 2024 17:40

funkyvoong merged commit cecf8e0 into dev Sep 24, 2024

funkyvoong deleted the benchmark_2024_04_26 branch February 5, 2025 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated documentations and refractored codes. #126

Updated documentations and refractored codes. #126

PalmPalm7 commented May 8, 2024

review-notebook-app bot commented May 8, 2024

Updated documentations and refractored codes. #126

Updated documentations and refractored codes. #126

Conversation

PalmPalm7 commented May 8, 2024

ML-Herbarium Spring 2024 Summary Report

1. Overview

2. Achievements

2.1 Pipeline and Performance

2.2 Benchmark

2.3 CVH Scraper

2.4 Collaboration with Chinese Virtual Herbarium.

3. Words to future team

review-notebook-app bot commented May 8, 2024