From afd301df27ebb0fae5b2bde818ab32d499ccecd2 Mon Sep 17 00:00:00 2001
From: Konstantin Baierer <unixprog@gmail.com>
Date: Tue, 22 Oct 2019 10:59:14 +0200
Subject: [PATCH] update keraslm

---
 repos.json | 85 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 75 insertions(+), 10 deletions(-)

diff --git a/repos.json b/repos.json
index 6f0b921..1771325 100644
--- a/repos.json
+++ b/repos.json
@@ -107,16 +107,81 @@
         "files": {
             "Dockerfile": null,
             "README.md": "# ocrd_keraslm\n    character-level language modelling using Keras\n\n\n## Introduction\n\nThis is a tool for statistical _language modelling_ (predicting text from context) with recurrent neural networks. It models probabilities not on the word level but the _character level_ so as to allow open vocabulary processing (avoiding morphology, historic orthography and word segmentation problems). It manages a vocabulary of mapped characters, which can be easily extended by training on more text. Above that, unmapped characters are treated with underspecification.\n\nIn addition to character sequences, (meta-data) context variables can be configured as extra input. \n\n### Architecture\n\nThe model consists of:\n\n0. an input layer: characters are represented as indexes from the vocabulary mapping, in windows of a number `length` of characters,\n1. a character embedding layer: window sequences are converted into dense vectors by looking up the indexes in an embedding weight matrix,\n2. a context embedding layer: context variables are converted into dense vectors by looking up the indexes in an embedding weight matrix, \n3. character and context vector sequences are concatenated,\n4. a number `depth` of hidden layers: each with a number `width` of hidden recurrent units of _LSTM cells_ (Long Short-term Memory) connected on top of each other,\n5. an output layer derived from the transposed character embedding matrix (weight tying): hidden activations are projected linearly to vectors of dimensionality equal to the character vocabulary size, then softmax is applied returning a probability for each possible value of the next character, respectively.\n\n![model graph depiction](model-graph.png \"graph with 1 context variable\")\n\nThe model is trained by feeding windows of text in index representation to the input layer, calculating output and comparing it to the same text shifted backward by 1 character, and represented as unit vectors (\"one-hot coding\") as target. The loss is calculated as the (unweighted) cross-entropy between target and output. Backpropagation yields error gradients for each layer, which is used to iteratively update the weights (stochastic gradient descent).\n\nThis is implemented in [Keras](https://keras.io) with [Tensorflow](https://www.tensorflow.org/) as backend. It automatically uses a fast CUDA-optimized LSTM implementation (Nividia GPU and Tensorflow installation with GPU support, see below), both in learning and in prediction phase, if available.\n\n\n### Modes of operation\n\nNotably, this model (by default) runs _statefully_, i.e. by implicitly passing hidden state from one window (batch of samples) to the next. That way, the context available for predictions can be arbitrarily long (above `length`, e.g. the complete document up to that point), or short (below `length`, e.g. at the start of a text). (However, this is a passive perspective above `length`, because errors are never back-propagated any further in time during gradient-descent training.) This is favourable to stateless mode because all characters can be output in parallel, and no partial windows need to be presented during training (which slows down).\n\nBesides stateful mode, the model can also be run _incrementally_, i.e. by explicitly passing hidden state from the caller. That way, multiple alternative hypotheses can be processed together. This is used for generation (sampling from the model) and alternative decoding (finding the best path through a sequence of alternatives).\n\n### Context conditioning\n\nEvery text has meta-data like time, author, text type, genre, production features (e.g. print vs typewriter vs digital born rich text, OCR version), language, structural element (e.g. title vs heading vs paragraph vs footer vs marginalia), font family (e.g. Antiqua vs Fraktura) and font shape (e.g. bold vs letter-spaced vs italic vs normal) etc. \n\nThis information (however noisy) can be very useful to facilitate stochastic modelling, since language has an extreme diversity and complexity. To that end, models can be conditioned on extra inputs here, termed _context variables_. The model learns to represent these high-dimensional discrete values as low-dimensional continuous vectors (embeddings), also entering the recurrent hidden layers (as a form of simple additive adaptation).\n\n### Underspecification\n\nIndex zero is reserved for unmapped characters (unseen contexts). During training, its embedding vector is regularised to occupy a center position of all mapped characters (all other contexts), and the hidden layers get to see it every now and then by random degradation. At runtime, therefore, some unknown character (some unknown context) represented as zero does not disturb follow-up predictions too much.\n\n\n## Installation\n\nRequired Ubuntu packages:\n\n* Python (``python`` or ``python3``)\n* pip (``python-pip`` or ``python3-pip``)\n* virtualenv (``python-virtualenv`` or ``python3-virtualenv``)\n\nCreate and activate a virtualenv as usual.\n\nIf you need a custom version of ``keras`` or ``tensorflow`` (like [GPU support](https://www.tensorflow.org/install/install_sources)), install them via `pip` now.\n\nTo install Python dependencies and this module, then do:\n```shell\nmake deps install\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements.txt\npip install -e .\n```\n\nUseful environment variables are:\n- ``TF_CPP_MIN_LOG_LEVEL`` (set to `1` to suppress most of Tensorflow's messages\n- ``CUDA_VISIBLE_DEVICES`` (set empty to force CPU even in a GPU installation)\n\n\n## Usage\n\nThis packages has two user interfaces:\n\n### command line interface `keraslm-rate`\n\nTo be used with string arguments and plain-text files.\n\n```shell\nUsage: keraslm-rate [OPTIONS] COMMAND [ARGS]...\n\nOptions:\n  --help  Show this message and exit.\n\nCommands:\n  train                           train a language model\n  test                            get overall perplexity from language model\n  apply                           get individual probabilities from language model\n  generate                        sample characters from language model\n  print-charset                   Print the mapped characters\n  prune-charset                   Delete one character from mapping\n  plot-char-embeddings-similarity\n                                  Paint a heat map of character embeddings\n  plot-context-embeddings-similarity\n                                  Paint a heat map of context embeddings\n  plot-context-embeddings-projection\n                                  Paint a 2-d PCA projection of context embeddings\n```\n\nExamples:\n```shell\nkeraslm-rate train --width 64 --depth 4 --length 256 --model model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/*.tcf.txt\nkeraslm-rate generate -m model_dta_64_4_256.h5 --number 6 \"f\u00fcr die Wi\u017f\u017fen\"\nkeraslm-rate apply -m model_dta_64_4_256.h5 \"so sch\u00e4dlich ist es Borkickheile zu pflanzen\"\nkeraslm-rate test -m model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/grimm_*.tcf.txt\n```\n\n### [OCR-D processor](https://github.com/OCR-D/core) interface `ocrd-keraslm-rate`\n\nTo be used with [PageXML](https://www.primaresearch.org/tools/PAGELibraries) documents in an [OCR-D](https://github.com/OCR-D/spec/) annotation workflow. Input could be anything with a textual annotation (`TextEquiv` on the given `textequiv_level`). The LM rater could be used for both quality control (without alternative decoding, using only each first index `TextEquiv`) and part of post-correction (with `alternative_decoding=True`, finding the best path among `TextEquiv` indexes).\n\n```json\n  \"tools\": {\n    \"ocrd-keraslm-rate\": {\n      \"executable\": \"ocrd-keraslm-rate\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"Rate elements of the text with a character-level LSTM language model in Keras\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\",\n        \"OCR-D-COR-CIS\",\n        \"OCR-D-COR-ASV\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-LM\"\n      ],\n      \"parameters\": {\n        \"model_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for model trained with keraslm\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"region\", \"line\", \"word\", \"glyph\"],\n          \"default\": \"glyph\",\n          \"description\": \"PAGE XML hierarchy level to evaluate TextEquiv sequences on\"\n        },\n        \"alternative_decoding\": {\n          \"type\": \"boolean\",\n          \"description\": \"whether to process all TextEquiv alternatives, finding the best path via beam search, and delete each non-best alternative\",\n          \"default\": true\n        },\n        \"beam_width\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"maximum number of best partial paths to consider during search with alternative_decoding\",\n          \"default\": 100\n        }\n      }\n    }\n  }\n```\n\nExamples:\n```shell\nmake deps-test # installs ocrd_tesserocr\nmake test/assets # downloads GT, imports PageXML, builds workspaces\nocrd workspace clone -a test/assets/kant_aufklaerung_1784/mets.xml ws1\ncd ws1\nocrd-tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK\nocrd-tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE\nocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-WORD -p '{ \"textequiv_level\" : \"word\", \"model\" : \"Fraktur\" }'\nocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-GLYPH -p '{ \"textequiv_level\" : \"glyph\", \"model\" : \"deu-frak\" }'\n# get confidences and perplexity:\nocrd-keraslm-rate -I OCR-D-OCR-TESS-WORD -O OCR-D-OCR-LM-WORD -p '{ \"model_file\": \"model_dta_64_4_256.h5\", \"textequiv_level\": \"word\", \"alternative_decoding\": false }'\n# also get best path:\nocrd-keraslm-rate -I OCR-D-OCR-TESS-GLYPH -O OCR-D-OCR-LM-GLYPH -p '{ \"model_file\": \"model_dta_64_4_256.h5\", \"textequiv_level\": \"glyph\", \"alternative_decoding\": true, \"beam_width\": 10 }'\n```\n\n## Testing\n\n```shell\nmake deps-test test\n```\nWhich is the equivalent of:\n```shell\npip install -r requirements_test.txt\ntest -e test/assets || test/prepare_gt.bash test/assets\ntest -f model_dta_test.h5 || keraslm-rate train -m model_dta_test.h5 test/assets/*.txt\nkeraslm-rate test -m model_dta_test.h5 test/assets/*.txt\npython -m pytest test $(PYTEST_ARGS)\n```\n\nSet `PYTEST_ARGS=\"-s --verbose\"` to see log output (`-s`) and individual test results (`--verbose`).\n",
-            "ocrd-tool.json": null,
-            "setup.py": "# -*- coding: utf-8 -*-\n\"\"\"\nInstalls:\n    - keraslm-rate\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nwith codecs.open('README.md', encoding='utf-8') as f:\n    README = f.read()\n\nsetup(\n    name='ocrd_keraslm',\n    version='0.3.1',\n    description='character-level language modelling in Keras',\n    long_description=README,\n    author='Konstantin Baierer, Kay-Michael W\u00fcrzner',\n    author_email='unixprog@gmail.com, wuerzner@gmail.com',\n    url='https://github.com/OCR-D/ocrd_keraslm',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=[\n        'ocrd >= 0.15.2',\n        'keras',\n        'click',\n        'numpy',\n        'tensorflow',\n        'h5py',\n        'networkx',\n    ],\n    extras_require={\n        'plotting': [\n            'sklearn',\n            'matplotlib',\n            ]\n    },\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'keraslm-rate=ocrd_keraslm.scripts.run:cli',\n            'ocrd-keraslm-rate=ocrd_keraslm.wrapper.cli:ocrd_keraslm_rate',\n        ]\n    },\n)\n"
+            "ocrd-tool.json": "{\n  \"git_url\": \"https://github.com/OCR-D/ocrd_keraslm\",\n  \"version\": \"0.3.1\",\n  \"tools\": {\n    \"ocrd-keraslm-rate\": {\n      \"executable\": \"ocrd-keraslm-rate\",\n      \"categories\": [\n        \"Text recognition and optimization\"\n      ],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"Rate elements of the text with a character-level LSTM language model in Keras\",\n      \"input_file_grp\": [\n        \"OCR-D-OCR-TESS\",\n        \"OCR-D-OCR-KRAK\",\n        \"OCR-D-OCR-OCRO\",\n        \"OCR-D-OCR-CALA\",\n        \"OCR-D-OCR-ANY\",\n        \"OCR-D-COR-CIS\",\n        \"OCR-D-COR-ASV\"\n      ],\n      \"output_file_grp\": [\n        \"OCR-D-COR-LM\"\n      ],\n      \"parameters\": {\n        \"model_file\": {\n          \"type\": \"string\",\n          \"format\": \"uri\",\n          \"content-type\": \"application/x-hdf;subtype=bag\",\n          \"description\": \"path of h5py weight/config file for model trained with keraslm\",\n          \"required\": true,\n          \"cacheable\": true\n        },\n        \"textequiv_level\": {\n          \"type\": \"string\",\n          \"enum\": [\"region\", \"line\", \"word\", \"glyph\"],\n          \"default\": \"glyph\",\n          \"description\": \"PAGE XML hierarchy level to evaluate TextEquiv sequences on\"\n        },\n        \"alternative_decoding\": {\n          \"type\": \"boolean\",\n          \"description\": \"whether to process all TextEquiv alternatives, finding the best path via beam search, and delete each non-best alternative\",\n          \"default\": true\n        },\n        \"beam_width\": {\n          \"type\": \"number\",\n          \"format\": \"integer\",\n          \"description\": \"maximum number of best partial paths to consider during search with alternative_decoding\",\n          \"default\": 10\n        },\n        \"lm_weight\": {\n          \"type\": \"number\",\n          \"format\": \"float\",\n          \"description\": \"share of the LM scores over the input confidences\",\n          \"default\": 0.5\n        }\n      }\n    }\n  }\n}\n",
+            "setup.py": "# -*- coding: utf-8 -*-\n\"\"\"\nInstalls:\n    - keraslm-rate\n    - ocrd-keraslm-rate\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nwith codecs.open('README.md', encoding='utf-8') as f:\n    README = f.read()\n\nsetup(\n    name='ocrd_keraslm',\n    version='0.3.1',\n    description='character-level language modelling in Keras',\n    long_description=README,\n    author='Konstantin Baierer, Kay-Michael W\u00fcrzner',\n    author_email='unixprog@gmail.com, wuerzner@gmail.com',\n    url='https://github.com/OCR-D/ocrd_keraslm',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=open('requirements.txt').read().split('\\n'),\n    extras_require={\n        'plotting': [\n            'sklearn',\n            'matplotlib',\n            ]\n    },\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'keraslm-rate=ocrd_keraslm.scripts.run:cli',\n            'ocrd-keraslm-rate=ocrd_keraslm.wrapper.cli:ocrd_keraslm_rate',\n        ]\n    },\n)\n"
         },
         "git": {
-            "last_commit": "Fri Jul 19 13:01:17 2019 +0200",
-            "number_of_commits": "75"
+            "last_commit": "Tue Oct 22 10:57:38 2019 +0200",
+            "number_of_commits": "81"
         },
         "name": "ocrd_keraslm",
-        "ocrd_tool": "",
-        "ocrd_tool_validate": "NO ocrd-tool.json",
+        "ocrd_tool": {
+            "git_url": "https://github.com/OCR-D/ocrd_keraslm",
+            "tools": {
+                "ocrd-keraslm-rate": {
+                    "categories": [
+                        "Text recognition and optimization"
+                    ],
+                    "description": "Rate elements of the text with a character-level LSTM language model in Keras",
+                    "executable": "ocrd-keraslm-rate",
+                    "input_file_grp": [
+                        "OCR-D-OCR-TESS",
+                        "OCR-D-OCR-KRAK",
+                        "OCR-D-OCR-OCRO",
+                        "OCR-D-OCR-CALA",
+                        "OCR-D-OCR-ANY",
+                        "OCR-D-COR-CIS",
+                        "OCR-D-COR-ASV"
+                    ],
+                    "output_file_grp": [
+                        "OCR-D-COR-LM"
+                    ],
+                    "parameters": {
+                        "alternative_decoding": {
+                            "default": true,
+                            "description": "whether to process all TextEquiv alternatives, finding the best path via beam search, and delete each non-best alternative",
+                            "type": "boolean"
+                        },
+                        "beam_width": {
+                            "default": 10,
+                            "description": "maximum number of best partial paths to consider during search with alternative_decoding",
+                            "format": "integer",
+                            "type": "number"
+                        },
+                        "lm_weight": {
+                            "default": 0.5,
+                            "description": "share of the LM scores over the input confidences",
+                            "format": "float",
+                            "type": "number"
+                        },
+                        "model_file": {
+                            "cacheable": true,
+                            "content-type": "application/x-hdf;subtype=bag",
+                            "description": "path of h5py weight/config file for model trained with keraslm",
+                            "format": "uri",
+                            "required": true,
+                            "type": "string"
+                        },
+                        "textequiv_level": {
+                            "default": "glyph",
+                            "description": "PAGE XML hierarchy level to evaluate TextEquiv sequences on",
+                            "enum": [
+                                "region",
+                                "line",
+                                "word",
+                                "glyph"
+                            ],
+                            "type": "string"
+                        }
+                    },
+                    "steps": [
+                        "recognition/text-recognition"
+                    ]
+                }
+            },
+            "version": "0.3.1"
+        },
+        "ocrd_tool_validate": "<report valid=\"false\">\n  <error>[tools.ocrd-keraslm-rate.parameters.model_file.content-type] 'application/x-hdf;subtype=bag' does not match '^[a-z0-9\\\\._-]+/[A-Za-z0-9\\\\._\\\\+-]+$'</error>\n</report>",
         "org_plus_name": "OCR-D/ocrd_keraslm",
         "python": {
             "author": "Konstantin Baierer, Kay-Michael W\u00fcrzner",
@@ -130,12 +195,12 @@
         "files": {
             "Dockerfile": "FROM ocrd/core\nMAINTAINER OCR-D\nENV DEBIAN_FRONTEND noninteractive\nENV PYTHONIOENCODING utf8\nENV LC_ALL C.UTF-8\nENV LANG C.UTF-8\n\nWORKDIR /build-ocrd\nCOPY setup.py .\nCOPY requirements.txt .\nRUN apt-get update && \\\n    apt-get -y install --no-install-recommends \\\n    ca-certificates \\\n    make \\\n    git\nCOPY ocrd_kraken ./ocrd_kraken\nRUN pip3 install --upgrade pip\nRUN pip3 install .\n\nENTRYPOINT [\"/bin/sh\", \"-c\"]\n",
             "README.md": "# ocrd_kraken\n\n> Wrapper for the kraken OCR engine\n\n[![image](https://travis-ci.org/OCR-D/ocrd_kraken.svg?branch=master)](https://travis-ci.org/OCR-D/ocrd_kraken)\n[![Docker Automated build](https://img.shields.io/docker/automated/ocrd/kraken.svg)](https://hub.docker.com/r/ocrd/kraken/tags/)\n[![image](https://circleci.com/gh/OCR-D/ocrd_kraken.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_kraken)\n",
-            "ocrd-tool.json": "{\n  \"git_url\": \"https://github.com/OCR-D/ocrd_kraken\",\n  \"version\": \"0.0.2\",\n  \"tools\": {\n    \"ocrd-kraken-binarize\": {\n      \"executable\": \"ocrd-kraken-binarize\",\n      \"input_file_grp\": \"OCR-D-IMG\",\n      \"output_file_grp\": \"OCR-D-IMG-BIN\",\n      \"categories\": [\n        \"Image preprocessing\"\n      ],\n      \"steps\": [\n        \"preprocessing/optimization/binarization\"\n      ],\n      \"description\": \"Binarize images with kraken\",\n      \"parameters\": {\n        \"level-of-operation\": {\n          \"type\": \"string\",\n          \"default\": \"page\",\n          \"enum\": [\"page\", \"block\", \"line\"]\n        }\n      }\n    },\n    \"ocrd-kraken-segment\": {\n      \"executable\": \"ocrd-kraken-segment\",\n      \"categories\": [\n        \"Layout analysis\"\n      ],\n      \"steps\": [\n        \"layout/segmentation/region\"\n      ],\n      \"description\": \"Block segmentation with kraken\",\n      \"parameters\": {\n        \"text_direction\": {\n          \"type\": \"string\",\n          \"description\": \"Sets principal text direction\",\n          \"enum\": [\"horizontal-lr\", \"horizontal-rl\", \"vertical-lr\", \"vertical-rl\"],\n          \"default\": \"horizontal-lr\"\n        },\n        \"script_detect\": {\n          \"type\": \"boolean\",\n          \"description\": \"Enable script detection on segmenter output\",\n          \"default\": false\n        },\n        \"maxcolseps\": {\"type\": \"number\", \"format\": \"integer\", \"default\": 2},\n        \"scale\": {\"type\": \"number\", \"format\": \"float\", \"default\": null},\n        \"black_colseps\": {\"type\": \"boolean\", \"default\": false},\n        \"white_colseps\": {\"type\": \"boolean\", \"default\": false}\n      }\n    },\n    \"ocrd-kraken-ocr\": {\n      \"executable\": \"ocrd-kraken-ocr\",\n      \"categories\": [\"Text recognition and optimization\"],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"OCR with kraken\",\n      \"parameters\": {\n        \"lines-json\": {\n          \"type\": \"string\",\n          \"format\": \"url\",\n          \"required\": \"true\",\n          \"description\": \"URL to line segmentation in JSON\"\n        }\n      }\n    }\n\n  }\n}\n",
+            "ocrd-tool.json": "{\n  \"git_url\": \"https://github.com/OCR-D/ocrd_kraken\",\n  \"version\": \"0.0.2\",\n  \"tools\": {\n    \"ocrd-kraken-binarize\": {\n      \"executable\": \"ocrd-kraken-binarize\",\n      \"input_file_grp\": \"OCR-D-IMG\",\n      \"output_file_grp\": \"OCR-D-IMG-BIN\",\n      \"categories\": [\n        \"Image preprocessing\"\n      ],\n      \"steps\": [\n        \"preprocessing/optimization/binarization\"\n      ],\n      \"description\": \"Binarize images with kraken\",\n      \"parameters\": {\n        \"level-of-operation\": {\n          \"type\": \"string\",\n          \"default\": \"page\",\n          \"enum\": [\"page\", \"block\", \"line\"]\n        }\n      }\n    },\n    \"ocrd-kraken-segment\": {\n      \"executable\": \"ocrd-kraken-segment\",\n      \"categories\": [\n        \"Layout analysis\"\n      ],\n      \"steps\": [\n        \"layout/segmentation/region\"\n      ],\n      \"description\": \"Block segmentation with kraken\",\n      \"parameters\": {\n        \"text_direction\": {\n          \"type\": \"string\",\n          \"description\": \"Sets principal text direction\",\n          \"enum\": [\"horizontal-lr\", \"horizontal-rl\", \"vertical-lr\", \"vertical-rl\"],\n          \"default\": \"horizontal-lr\"\n        },\n        \"script_detect\": {\n          \"type\": \"boolean\",\n          \"description\": \"Enable script detection on segmenter output\",\n          \"default\": false\n        },\n        \"maxcolseps\": {\"type\": \"number\", \"format\": \"integer\", \"default\": 2},\n        \"scale\": {\"type\": \"number\", \"format\": \"float\", \"default\": 0},\n        \"black_colseps\": {\"type\": \"boolean\", \"default\": false},\n        \"white_colseps\": {\"type\": \"boolean\", \"default\": false}\n      }\n    },\n    \"ocrd-kraken-ocr\": {\n      \"executable\": \"ocrd-kraken-ocr\",\n      \"categories\": [\"Text recognition and optimization\"],\n      \"steps\": [\n        \"recognition/text-recognition\"\n      ],\n      \"description\": \"OCR with kraken\",\n      \"parameters\": {\n        \"lines-json\": {\n          \"type\": \"string\",\n          \"format\": \"url\",\n          \"required\": \"true\",\n          \"description\": \"URL to line segmentation in JSON\"\n        }\n      }\n    }\n\n  }\n}\n",
             "setup.py": "# -*- coding: utf-8 -*-\n\"\"\"\nInstalls two binaries:\n\n    - ocrd-kraken-binarize\n    - ocrd-kraken-segment\n\"\"\"\nimport codecs\n\nfrom setuptools import setup, find_packages\n\nsetup(\n    name='ocrd_kraken',\n    version='0.1.1',\n    description='kraken bindings',\n    long_description=codecs.open('README.md', encoding='utf-8').read(),\n    long_description_content_type='text/markdown',\n    author='Konstantin Baierer, Kay-Michael W\u00fcrzner',\n    author_email='unixprog@gmail.com, wuerzner@gmail.com',\n    url='https://github.com/OCR-D/ocrd_kraken',\n    license='Apache License 2.0',\n    packages=find_packages(exclude=('tests', 'docs')),\n    install_requires=[\n        'ocrd >= 1.0.0a4',\n        'kraken == 0.9.16',\n        'click >= 7',\n    ],\n    package_data={\n        '': ['*.json', '*.yml', '*.yaml'],\n    },\n    entry_points={\n        'console_scripts': [\n            'ocrd-kraken-binarize=ocrd_kraken.cli:ocrd_kraken_binarize',\n            'ocrd-kraken-segment=ocrd_kraken.cli:ocrd_kraken_segment',\n        ]\n    },\n)\n"
         },
         "git": {
-            "last_commit": "Mon Oct 21 20:19:00 2019 +0200",
-            "number_of_commits": "84"
+            "last_commit": "Mon Oct 21 20:52:26 2019 +0200",
+            "number_of_commits": "85"
         },
         "name": "ocrd_kraken",
         "ocrd_tool": {
@@ -199,7 +264,7 @@
                             "type": "number"
                         },
                         "scale": {
-                            "default": null,
+                            "default": 0,
                             "format": "float",
                             "type": "number"
                         },
-- 
GitLab