Relevant Documentation
Commands
Manually Annotating Data
Preparing Data For Model 2
Run data.py
Run train.py
Train dhSegment (or use existing model)
Run grad_search.py
Evaluate Accuracy of Predictions
Training dhSegment: https://hackmd.io/@ericke/B1G1fEJ9L#Training
Read The Docs: https://dhsegment.readthedocs.io/en/latest/
GitHub: https://github.com/dhlab-epfl/dhSegment
What is dhSegment: https://dhlab-epfl.github.io/dhSegment/
Check if the gpu’s are available:
nvidia-smi
Running dhSegment
View current containers + status:
docker container ls -a
If the docker container doesn’t exist, create it:
To run the docker:
docker start dhSegment
docker exec -it dhSegment bash
tensorboard --logdir .
Exit the docker using ctrl d
and stop the container after use:
docker stop dhSegment
Transkribus Labeling Tool can be found here.
- Collect all good y’s (xml files) into a single folder and the images for each y into another folder
a. For EEBO, images are atjason/datasets/EEBO
and xml files are atjason/datasets/EEBO/team_page
b. For READ_BAD, the images are in the “input” folder atjason/datasets/READ_BAD_MASTER/Train&Test Simple
and xml files are in thepage-gt
folder
c. Note: These folders can be found on the sakura server - Run Aneesha & Lynn’s perturbation script
a. Found insrc/bad_baseline_generator.py
b. Usage:
python bad_baseline_generator.py [input directory] [output directory]
c. In this case, the input directory will be the good y’s from the dataset, and the output directory will contain bad y’s with perturbations applied - Label both folders separately (this step can take awhile)
a. Found insrc/label_updated.py
b. Usage:c. Should now have two folders calledpython label_updated.py [good y’s folder] [original .jpg images] [labeled_good] BASELINE python label_updated.py [bad y’s folder] [original .jpg images] [labeled_bad] BASELINE
labeled_good
andlabeled_bad
with .png labeled images - Now, split the folders into
train/val/test
a. Download the split_folders package here
b. Placelabeled_good
andlabeled_bad
as subdirectories of a new folder calledinput
c. Run split_folder commandd. You should now have a folder calledsplitfolders --ratio .7 .2 .1 -- input
output
that contains test, train, and val data
- data.py turns the images into numpy array representations that are labeled as good or bad y’s
- Usage:
python data.py
- The input directory is the output directory from the last section that contains test, train, and val data
- The input directory is to be hard-coded by the individual who is programming (currently “model_data”)
- This step trains Model 2
- Usage:
python train.py
- The input directory is the output directory from the last section that contains test, train, and val numpy arrays
- The
/model_data
directory should contain 6 numpy arrays (2 Train, 2 Test, 2 Val) where each pair of numpy arrays is a pair of Training Labels and Binarized Baseline Prediction Images that have been resized to 1000x1000 pixels
- Existing dhSegment models can be found at
projects/jason/dhSegment/READ_BAD/page_modelX
on the Sakura server
a. The best-performing pre-trained dhSegment model for our purposes was arbitrarily named ‘page_model4’ - Details on training a new dhSegment model can be found here
- This step gets the baseline predictions from the trained dhSegment and BASEDNet models
- Usage:
python grad_search.py
- The input directory is the directory that contains test numpy arrays
- Similar to our other scripts, the grad_search script takes in the original .jpg images from our historical document archive. Within the script, the grad_search will first generate baseline predictions from the historical document, and with these predictions, BASEDNet will provide a holistic score for the overall quality of the baseline predictions. From this, given that a set of baselines is scored as ‘bad’ by BASEDNet, we optimize the baselines over several iterations of gradient-based optimization.
- Download the following evaluation tool: https://github.com/Transkribus/TranskribusBaseLineEvaluationScheme
- Usage (from inside the target folder):
java -jar TranskribusBaseLineEvaluationScheme-0.1.3-jar-with-dependencies.jar [truth.lst] [reco.lst]
- The .lst files should contains the list of names of the ground truth and output predicted files