Claas-Thido Pfaff · c130d930
--- a/home.md
+++ b/home.md
+{{Description
+This document describes the requirements and the implementation plan of a web-based service to harvest metadata records of submissions to the European Nucleotide Archive (ENA) (https://bit.ly/3pPEdv0). The harvested records are gathered on a regular basis. The service allows a conversion of the collected records into other formats using XSLT transformations. The service stores the results in a database to decouple metadata provision from the availability of the original repository (e.g. ENA) index and to speed up response times. Finally the service provides access to collected resources via an OAI-PMH compatible API for client applications to consume the records.
+Use Case
+Target audience
+Repositories and data providers with an API not compatible to OAI yet. 
+As a repository I want to provide external services to consume my content. I can use the service described here to standardize metadata provision.
+
+Search providers 
+As a search provider or broker I am interested in harvesting data and metadata from different repositories. To streamline the ingestion of items I can use the service described here as a proxy or middleman taking care of the heavy lifting.
+Example 
+One is the PANGEA search index. It implements a harvester of OAI-PMH records. On the other hand we have the ENA index which is not OAI-PMH compliant. We take this as a first instance or example for setting things up with this service before generalizing it for others.
+
+Requirements
+ckerized (Cloud ready)
+Configurable (e.g. query to API, transformations)
+Query metadata API’s (configurable)
+Asynchronous (Cache the items)
+Deliver records on request
+OAI-PMH compliant protocol (minimal working implementation first)
+Email on errors (To admins, e.g. on lHTTP 40x, HTTP 50x )
+Authorization and Authentication (api token)
+
+Implementation
+Baseline
+We use python as the language for the backend. We make use of django and the django rest framework to provide the service endpoints.  
+Test strategy
+We roll with a test driven strategy with unit and integration tests. The normal endpoints of our API can be tested e.g. with “django.test and TestCase”.  For the OAI-PMH we can validate the responses of our service in this XML based protocol against the OAI.xsd schema. 
+Scheduling
+We want to regularly fetch (e.g. every night) items from an external API (our use case here is ENA). In order to get the job done we can use the django job scheduling framework. On the other hand celery seems to be more flexible overall. However it brings additional dependencies as celery requires a broker like e.g. redis.
+
+https://bit.ly/3qUmFz8 (Django Jobs)
+https://bit.ly/37Jky9y (Django Celery)
+https://bit.ly/3qTvlpb (Django + Redis + Celery)
+Storage
+We handle XML based metadata. Thus we could store XML directly in our database. PostgreSQL allows storing XML documents in a special database field. This functionality however is only available if the database is built with libxml support.
+API
+We build a headless system for now only providing an api for the interaction. 
+
+Django rest (Implementing the rest API)
+Handle users, authorization and authentication
+https://bit.ly/37MjW39
+OAI-PMH (XML over HTTP query standard)
+First get a minimal implementation (https://bit.ly/2NChVja)
+Then extend it if needed...
+Documentation
+We can go with Sphinx to create normal documentation. If we are using the django rest framework we can use its self describing API documentation. If there are other requirements to API documentation popping up we could also switch to swagger/openapi.
+
+https://bit.ly/3sqbRcq (Sphinx)
+https://bit.ly/2NAYpU4 (API documentation)
+Templating
+We might benefit from templating with cookiecutter. I would select an appropriate recipe, fork it and adapt it to our needs. That way the whole application can be created and recreated in a transparent way. That might also be interesting for other projects which might like to get involved. They can simply set up a local test instance then...
+Capacity building
+This section describes what I need to learn and dig through before I can build this up. I have to make way through several resources listed below. I focus on quick starting guides first and then approach it in piecemeal fashion in a practical context: Learning by doing.
+
+Understand Django
+In progress
+Getting started with DjangoRest
+Pending
+Check out testing in django
+Pending
+Read the OAI-PMH documentation 
+In progress
+
+Next Steps
+
+Organize code in “our” GitLab instance
+Project … 
+Issues/Tickets
+Wiki (transfer this document)
+Organize code reviews
+Resources:
+There are already generic implementations of an oai client and sever. We may be able to just use them. 
+
+https://github.com/infrae/pyoai
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Requirements
+Endpoint allows “last_harvest_date” to determine when was the last time the metadata was harvested
+Basic metadata in the beginning (see code example of Ivo from 2018). Later, get also sample and experiment metadata  and aggregate important parameters, e.g. number of samples, type of sequencing, MIxS-compliance
+
+
+
+
+
+}}
\ No newline at end of file