-
Michelle Weidling authoredMichelle Weidling authored
QuiVer Benchmarks
This repository holds everything you need for automatically executing different OCR-D workflows on images and evaluating the outcomes. It creates benchmarks for (your) OCR-D data in a containerized environment. You can run QuiVer Benchmarks either locally on your machine or in an automized workflow, e.g. in a CI/CD environment.
QuiVer Benchmarks is based on ocrd/all:maximum
and has all OCR-D processors at hand that a workflow might use.
Prerequisites
- Docker >= 23.0.0
To speed up QuiVer Benchmarks you can mount already downloaded text recognition models to /usr/local/share/ocrd-resources/
in docker-compose.yml
by adding
- path/to/your/models:/usr/local/share/ocrd-resources/
to the volumes
section.
Otherwise the tool will download all ocrd-tesserocr-recognize
models as well as ocrd-calamari-recognize qurator-gt4histocr-1.0
on each run.
Usage
- clone this repository
- customize QuiVer Benchmarks according to your needs
- run
docker compose build && docker compose up
- the benchmarks and the evaluation results will be available at
data/workflows.json
on your host system
Benchmarks Considered
The relevant benchmarks gathed by QuiVer Benchmarks are defined in OCR-D's Quality Assurance specification and comprise
- CER (per page and document wide), incl.
- median
- minimum and maximum CER
- standard deviation
- WER (per page and document wide)
- CPU time
- wall time
- processed pages per minute
Custom Workflows and Data
The default behaviour of QuiVer Benchmarks is to collect OCR-D's sample Ground Truth workspaces (currently stored in quiver-data), executing the recommended standard workflows on these and obtaining the relevant benchmarks for each workflow.
You can, however, customize QuiVer Benchmarks to run your own workflows on the sample workspaces or your own OCR-D workspaces.
Adding New OCR-D Workflows
Add new OCR-D workflows to the directory workflows/ocrd_worflows
according to the following conventions:
- OCR workflows have to end with
_ocr.txt
, evaluation workflows with_eval.txt
. The files will be converted by OtoN to Nextflow files after the container has started. - workflows have to be TXT files
- all workflows have to use
ocrd process
You can then either rebuild the Docker image via docker compose build
or mount the directory to the container via
- ./workflows/ocrd_workflows:/app/workflows/ocrd_workflows
in the volumes
section and spin up a new run with docker compose up
.
Removing OCR-D Workflows
Delete the respective TXT files from workflows/ocrd_workflows
and either rebuild the image or mount the directory as volume as described above.
Using Custom Data
+++ TODO +++
Development
Outlook
License
See LICENSE