This document describes the requirements and the implementation plan of a web-based service to harvest metadata records of submissions to the European Nucleotide Archive (ENA) (https://bit.ly/3pPEdv0). The harvested records are gathered on a regular basis. The service allows a conversion of the collected records into other formats using XSLT transformations. The service stores the results in a database to decouple metadata provision from the availability of the original repository (e.g. ENA) index and to speed up response times. Finally the service provides access to collected resources via an OAI-PMH compatible API for client applications to consume the records.
Use Case
Target audience
Repositories and data providers with an API not compatible to OAI yet.
As a repository I want to provide external services to consume my content. I can use the service described here to standardize metadata provision.
Search providers
As a search provider or broker I am interested in harvesting data and metadata from different repositories. To streamline the ingestion of items I can use the service described here as a proxy or middleman taking care of the heavy lifting.
Example
One is the PANGEA search index. It implements a harvester of OAI-PMH records. On the other hand we have the ENA index which is not OAI-PMH compliant. We take this as a first instance or example for setting things up with this service before generalizing it for others.
Requirements
ckerized (Cloud ready)
Configurable (e.g. query to API, transformations)
Query metadata API’s (configurable)
Asynchronous (Cache the items)
Deliver records on request
OAI-PMH compliant protocol (minimal working implementation first)
Email on errors (To admins, e.g. on lHTTP 40x, HTTP 50x )
Authorization and Authentication (api token)
Implementation
Baseline
We use python as the language for the backend. We make use of django and the django rest framework to provide the service endpoints.
Test strategy
We roll with a test driven strategy with unit and integration tests. The normal endpoints of our API can be tested e.g. with “django.test and TestCase”. For the OAI-PMH we can validate the responses of our service in this XML based protocol against the OAI.xsd schema.
Scheduling
We want to regularly fetch (e.g. every night) items from an external API (our use case here is ENA). In order to get the job done we can use the django job scheduling framework. On the other hand celery seems to be more flexible overall. However it brings additional dependencies as celery requires a broker like e.g. redis.
https://bit.ly/3qUmFz8 (Django Jobs)
https://bit.ly/37Jky9y (Django Celery)
https://bit.ly/3qTvlpb (Django + Redis + Celery)
Storage
We handle XML based metadata. Thus we could store XML directly in our database. PostgreSQL allows storing XML documents in a special database field. This functionality however is only available if the database is built with libxml support.
API
We build a headless system for now only providing an api for the interaction.
Django rest (Implementing the rest API)
Handle users, authorization and authentication
https://bit.ly/37MjW39
OAI-PMH (XML over HTTP query standard)
First get a minimal implementation (https://bit.ly/2NChVja)
Then extend it if needed...
Documentation
We can go with Sphinx to create normal documentation. If we are using the django rest framework we can use its self describing API documentation. If there are other requirements to API documentation popping up we could also switch to swagger/openapi.
https://bit.ly/3sqbRcq (Sphinx)
https://bit.ly/2NAYpU4 (API documentation)
Templating
We might benefit from templating with cookiecutter. I would select an appropriate recipe, fork it and adapt it to our needs. That way the whole application can be created and recreated in a transparent way. That might also be interesting for other projects which might like to get involved. They can simply set up a local test instance then...
Capacity building
This section describes what I need to learn and dig through before I can build this up. I have to make way through several resources listed below. I focus on quick starting guides first and then approach it in piecemeal fashion in a practical context: Learning by doing.
Understand Django
In progress
Getting started with DjangoRest
Pending
Check out testing in django
Pending
Read the OAI-PMH documentation
In progress
Next Steps
Organize code in “our” GitLab instance
Project …
Issues/Tickets
Wiki (transfer this document)
Organize code reviews
Resources:
There are already generic implementations of an oai client and sever. We may be able to just use them.
https://github.com/infrae/pyoai
Requirements
Endpoint allows “last_harvest_date” to determine when was the last time the metadata was harvested
Basic metadata in the beginning (see code example of Ivo from 2018). Later, get also sample and experiment metadata and aggregate important parameters, e.g. number of samples, type of sequencing, MIxS-compliance