Implement (flexible mechanism for) automatic checks
We envisage the curation process to be a combination of manual AND automatic tasks which inform each other.
The SSHOCMP Spec v2.0 DRAFT-Curation Component contains some ideas on automatic checks (though this is starting points at best). This should be further fleshed out in the Editorial guidelines.
Besides the question of the specific checks that can be automatized, there is the question of the general mechanism for implementing the checks.
Points to consider:
- these checks should be flexible, i.e. they shouldn't be hard-coded as part of the backend, so that rebuilding and redeployment of the whole backend would be required for every change in the checks. Rather they should be considered a separate "component".
- at the same time, the checks need to be systematic and the results of the checks need to be integral part of the MP-data, so that the human moderators can operate on a consistent information set.
The first point is very well satisfied by some scripts (e.g. a python notebook) that work against the API, as e.g. Cesare already did when checking ingest of Tapor #7: https://gitlab.gwdg.de/sshoc/data-ingestion/-/blob/master/repositories/tapor/tools/TAPoRCheck.ipynb While this approach doesn't satisfy the second point, it could be extended with script writing against the API, similarly to wiki-bots. This would give us full flexibility "playing around" with the data, be able to do all kinds of analysis (with python libraries) and then act upon these.
"Write operations" or "editing of the data" doesn't necessarily mean that individual property values will be changed (though this would be possible too and would still be subject to versioning), it could also just be setting certain flags informing moderators of issues to look at. (Admittedly, "setting flags" will boil down to changing values of some dedicated "technical" properties, but you know what I mean.) To give an example:
If a "check-script" identifies a broken-link in accessibleAt
, it will not (be able to) correct the link, but it can add a dynamic property: link-status=404
, or so.
We would need to ensure a) security, i.e. only authorized personnel being able to perform write-operations against the API, and b) the continuous running of the checks (not sure how easily one can run a python notebook via a cronjob), but that is not an urgent issue, because at least in the initial period the scripts should be running in a "semi-supervised" mode, anyhow.
Comments, critique, or any other ideas on how to solve the automatic checks are welcome.
(notify: @mkozak, @klaus.illmayer, @ymoranv, @stefan.probst, @lbarbot, @frank.fischer01, @tparkola, @swolarz, @sotiris.karampatakis )