Dear Gitlab users, due to maintenance reasons, Gitlab will not be available on Thursday 30.09.2021 from 5:00 pm to approximately 5:30 pm.

Commit 442a8247 authored by Andreas Wagner's avatar Andreas Wagner
Browse files

Update documentation in README.md.

parent 96c8824c
......@@ -4,16 +4,167 @@
[![Go Doc](https://img.shields.io/badge/godoc-reference-blue.svg?style=flat-square)](http://godoc.org/gitlab.gwdg.de/rg-mpg-de/tei2zenodo)
[![Release](https://img.shields.io/gitlab.gwdg.de/rg-mpg-de/tei2zenodo.svg?style=flat-square)](https://gitlab.gwdg.de/rg-mpg-de/tei2zenodo/releases/latest)
This is the TEI to Zenodo service developed at the [Max Planck Institute for European Legal History](http://www.rg.mpg.de/). It is meant to provide a means to quickly push TEI XML files to zenodo deposits, thereby assigning them a DOI identifier and committing them to long-term archival. Files can be uploaded with a REST POST command or by calling a webhook that will retrieve the file(s).
This is the "TEI to Zenodo service" developed at the [Max Planck Institute for European Legal History](http://www.rg.mpg.de/). It is meant to provide a means to quickly push [TEI-encoded XML files](https://tei-c.org/guidelines/p5/) to [Zenodo](https://about.zenodo.org/) deposits, thereby assigning them a DOI identifier and committing them to long-term archival. The software's REST API accepts direct file uploads with POST requests, or webhook calls via POST requests to another one of its endpoints. The idea is that you configure a webhook in your git repository that calls the software, which then looks up and retrieves the TEI files from the repository and creates individual zenodo deposits for each of them.
This software:
- accepts and processes [github webhooks](https://developer.github.com/webhooks/) (currently of the `push` type only)
- identifies all files that have been modified in the action that triggered the webhook
- is capable of filtering by user-defined phrases that must appear in the commit message for the commit's files to be eligible
- retrieves all relevant files
- parses all TEI files and assigns values to the various [metadata fields that zenodo accepts/requires](https://developers.zenodo.org/#entities). It does this using a user-specified configuration based on (simple) [XPath](https://www.w3.org/TR/1999/REC-xpath-19991116/) expressions.
- creates a new [zenodo](https://about.zenodo.org/) deposit with a new DOI, adds the DOI to the TEI file and uploads the file to the deposit
- is capable of uploading the deposit to zenodo and *not* publishing it yet if another user-defined phrase is present; publishes the deposit otherwise
- is capable of looking up an existing deposit (if the TEI file mentions its own zenodo DOI entry) and creating a new version of it. This new version will have a new DOI, so the software deletes the old DOI and adds the new one before uploading the file to zenodo
<img align="left" style="margin-right:10px;" src="https://upload.wikimedia.org/wikipedia/commons/d/d1/Emblem-notice.svg"/>
Note that, since the XPath library that this software uses only supports basic XPath functions, you cannot really parse or manipulate the values via configuration settings. This means that tei2zenodo presupposes that you use some of zenodo's controlled vocabulary in your TEI markup. For example, the license name or the editor roles in your TEI files are expected to be compatible with zenodo. You could, for example, use the `@n`-attribute of TEI's `<editor>` element to hold the required string like "cc-by" or use zenodo's controlled vocabulary for contributor types to specify the TEI `<editor>`'s `@role`-attribute...
## Installation \& Setup
There are several ways of obtaining the software. It is not necessary to install it in a particular place, it just needs to find a configuration file (see below) and have sufficient privileges to bind to the port specified in the config. So, if you have it installed and configured, just call `t2zd` and that's it.
1. The default way of getting the software is downloading an asset on the [releases page](-/releases). There are precompiled binaries zipped together with a configuration template on that page.
2. The package is maintained as a [Go](https://golang.org/) repository, so if you have Go installed, you can use it to compile and install the software in one single step: with the command `go get -u gitlab.gwdg.de/rg-mpg-de/tei2zenodo`. This will put the executable in the `$GOPATH/bin` directory, so that, on a standard Go installation, it can be found automatically from whatever directory you're in. (I recommend to put the configuration file in a `.t2z` subdirectory of your home directory. *Note: I will have to figure out how this recommendation should be formulated on windows/mac systems.*)
3. If you want to compile the source code manually yourself, you need the [Go compiler](https://golang.org/) as well. Then, after retrieving the source code, either by cloning the git repository or by using one of the "download source code" options, you can compile it in its main directory with the command `go build -ldflags "-s -w" -o t2zd cmd/t2zd/main.go`. You can even skip the optimization switches (`-ldflags ...`) or the naming part (`-o t2zd`) and just say `go build cmd/t2zd/main.go` (and use `main` as the command to launch the server), but the longer command is the one I am using most of the time.
Interestingly, it is trivial to cross-compile code with Go, in other words, you can compile the software for many different target systems on your system. For instance, you could compile on your Windows Desktop, copy the executable over to your Linux server and run the software there. There is a good description of the process and of the various possible combinations of platform and architecture over at [Digitalocean](https://www.digitalocean.com/community/tutorials/how-to-build-go-executables-for-multiple-platforms-on-ubuntu-16-04#step-4-%E2%80%94-building-executables-for-different-architectures).
## Configuration
This service is configured via a `config.json` file residing either in the current directory, in the `configs` directory below the current directory or the `.t2z` directory below the current user's $home directory.
This service is configured via a `config.json` file residing either in the current working directory, in the `configs` directory below the current directory or in the `.t2z` directory below the current user's `$home` directory.
In this file, you can specify the listening port for this service, the API endpoints you want to be active (if you want to disable one, just set the value to the empty string `""`), the zenodo connection, the git context allowed to post to the webhook, and how to parse the XML files that this service processes into zenodo metadata fields. For an example, have a look at the [./configs/config.json.tpl](./configs/config.json.tpl) template file.
For both the github and the zenodo connections, you need personal tokens that you can create at the respective site. Github has a [description of how to create such a token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line) and the [zenodo page for doing so](https://zenodo.org/account/settings/applications/tokens/new/) is really simple to understand, too. (In zenodo, your token needs "deposit:actions" and "deposit:write" privileges, in github your token needs "repo" scope.)
The configuration file is a json file that has several sections:
### Root elements
- `ListenSpec` *numeric*: This allows you to specify which port the t2z daemon should be listening on. It defaults to 8081.
- `Verbose` *boolean*: This switches between terse and verbose mode for the console output. It defaults to false.
- `APIRoot` *string*: This allows you to specify a path below which the various api endpoints can be reached. It defaults to `/api/v1`.
- `FileAPI` *string*: Subfolder of `APIRoot` where the API endpoint that receives direct file upload POST requests can be reached. Setting this to an empty string disables the file upload API endpoint. It defaults to `/file`.
- `WebhookAPI` *string*: Subfolder of `APIRoot` where the API endpoint that receives git webhook event notifications can be reached. Setting this to an empty string disables the webhook API endpoint. Currently only POST requests are accepted. It defaults to `/hooks/receivers/github/events/`.
Then, there are subsections at these keys: `Log`, `Zenodo`, `Git` and `Metadata` (except for the `ListenSpec` and `Verbose` settings mentioned above, and some object fields described below in the metadata section, all fields and values are *string* types.):
### Log configuration
The `Log` sections contains a single `File` key-value pair, specifying the file that tei2zenodo should be writing its log entries to:
```json
"Log": {
"file": "t2z.log"
}
```
### Zenodo service configuration
For various checks, you need to specify zenodo's DOI pattern - the DOI up to and including the full stop - in the `prefix` key.
Note that zenodo offers a dedicated server for testing `https://sandbox.zenodo.org` and that this provides DOIs different from the "real" zenodo DOIs at `https://zenodo.org`: The "fake" DOIs have a prefix of `10.5072/zenodo.`, the "true" DOIs have `10.5281/zenodo.`.
Here is a complete zenodo config section:
```json
"Zenodo": {
"prefix": "10.5072/zenodo.",
"host": "https://sandbox.zenodo.org",
"token": "aBcDeFgHiJkLmNoPqRsTuVwXyZ"
}
```
In this file, you can specify the listening port for this service, the API endpoints you want to be alive (if you want to disable one, just set the value to the empty string ""), the zenodo connection (host, port, token, DOI prefix), the git context allowed to post to the webhook, and how to parse the XML files that this service processes into zenodo metadata fields. Each of these XML parsing entries consists of the name of zenodo's "receiving" field, and either an xpath, xpath expression, or the combination of xpath and subfields (that consist of fieldnames, xpath/xexpression fields in turn). For details, have a look at the [./configs/config.json.tpl](./configs/config.json.tpl) template file.
### Git repository configuration
In the `Git` config section, besides the obvious `host` and `token` keys, there are several keys that allow you to control what is being handled and how:
Since the software can receive hooks from no matter where, and relies on information contained in the hook in order to "follow its nose" and retrieve files that it then uploads to zenodo, you can specify a repository the hooks of which will be processed exclusively. If hooks from other repositories come in, they will be ignored. If you leave the `repo` key empty, on the other hand, they will be processed as well.
The same holds for the `user` key: If you specify a value here, only hooks initiated by this github user will be processed. (In the case of "push" webhooks, the payload's `pusher.name` field is compared to this config value.)
The processing is triggered by a hook that can comprise several commits, and each of the commits can affect several files. If you specify something in the `commit_phrase`, only files of commits the messages of which contain the specified phrase will be processed. Leave empty to process all commits.
On the other hand, if you specify something in the `commit_dontpublish_phrase` key, processing happens as normal, but the deposits of all the files affected by commits with messages that contain this phrase will not be finally published. When you log in to zenodo and click on the "Upload" button, you will see your unpublished Uploads, waiting for you to inspect and finally publish them...
Here is a complete github config section:
```json
"Git": {
"host": "https://api.github.com",
"token": "aBcDeFgHiJkLmNoPqRsTuVwXyZ",
"user": "foobar",
"repo": "octocat/hello-world",
"commit_phrase": "",
"commit_dontpublish_phrase": "test"
}
```
### Metadata configuration
The metadata section is where you specify how to fill the various metadata fields that zenodo expects. You can specify pairs of metadata fields and simple XPaths at which to retrieve the text going in the field like this:
```json
{
"field": "title",
"xpath": "//titleStmt//title[@type='main']"
},
{
"field": "keywords",
"xpath": "//teiHeader/profileDesc/textClass/keywords/term"
}
```
As an alternative to an XPath, you can specify an XPath expression if you want to use XPath functions. [Here](https://github.com/antchfx/xpath#supported-features) you can see which XPath patterns and functions are supported.
```json
{
"field": "description",
"xexpression": "string('One of the seminal works published in the context of the project XYZ.')"
}
```
Finally, some fields are not plain text fields but rather (lists of) nested objects. This applies to the following zenodo metadata fields: `creators`, `contributors`, `thesis_supervisors`, `subjects`, `related_identifiers`, `communities`, `grants`, `locations`, `dates`.
For these, you specify the object name in the `field` value, the XPath pattern that identifies all instances of the object in your TEI file in the `xpath` value, and then, under the key `subfields`, a list of `field`/`xpath` or `field`/`xexpression` pairs. The latter are interpreted relative to the main XPath pattern:
```json
{
"field": "contributors",
"xpath": "//titleStmt/editor",
"subfields": [
{
"field": "name",
"xpath": "."
},
{
"field": "type",
"xpath": "@role"
},
{
"field": "orcid",
"xpath": "@ref"
}
]
}
```
For a documentation of all of zenodo's metadata fields, see [zenodo's developer page](https://developers.zenodo.org/?python#entities). For any zenodo upload, the following fields are mandatory: `title`, `creators`, `upload_type`, `publication_type` *(if `upload_type` is `publication`)*, `publication_date` *(in ISO8601 format, i.e. YYYY-MM-DD)*, `description`, `access_right`, `license` *(if `access_right` is `open` or `embargoed`)*, `embargo_date` *(if `access_right` is `embargoed`)*, `access_conditions` *(if `access_right` is `restricted`)*. Some of the fields (e.g. `contributors.type`, `access_right`, or `license`) are using a controlled vocabulary that you can look up over at [zenodo](https://developers.zenodo.org/?python#representation).
Again, remember that you probably need to meet tei2zenodo half-way by using some of zenodo's controlled vocabulary in your TEI markup.
For a full example, have a look at the [template file](./configs/config.json.tpl).
## Set up github webhook
To set up a [github webhook](https://developer.github.com/webhooks/) that triggers a zenodo upload automatically with every push, go to your repository's settings in github, click on the "Webhooks" option in the left menu, and then on the "Add Webhook" button. In the "Payload URL" field, specify your tei2zenodo daemon url, e.g. `http://123.45.123.45:8081/api/v1/hooks/receivers/github/events/`. For "Content type", select `application/json` and select to have "Just the *push* event" trigger the webhook. As soon as you save this, a ping event is sent to your server, so if your tei2zenodo daemon is listening, it should respond and give some output on its console.
## API endpoints
By default, this software listens at the following API endpoints:
- /api/v1/file (POST, content-type: application/xml - receives a TEI file and a ?doPublish=(False|True) url query parameter)
- /api/v1/hooks/receivers/github/events/ (POST)
......@@ -21,7 +172,9 @@ In this file, you can specify the listening port for this service, the API endpo
## Development
This service has been written in [Go](https://golang.org/) by [Andreas Wagner](https://orcid.org/0000-0003-1835-1653).
This service has been written in [Go](https://golang.org/) by [Andreas Wagner](https://orcid.org/0000-0003-1835-1653) (twitter: [anwagnerdreas](https://twitter.com/anwagnerdreas)). Suggestions, error reports and other feedback are welcome. You are invited to open an issue at <https://gitlab.gwdg.de/rg-mpg-de/tei2zenodo/-/issues>. There are some [issues that need discussion](https://gitlab.gwdg.de/rg-mpg-de/tei2zenodo/-/issues?label_name%5B%5D=discussion), where I would be particularly thankful for feedback.
Enjoy!
## Licence
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment