ingest data from CLARIN resource families
This issue is meant to track progress and record technical issues for the ingestion of the CLARIN Resource families.
The data can be taken from github where each CSV file has to be parsed, but they all have the same structure.
The mapping can be found here.
New harvester in DACE should be created. It is to be decided if it should be general and support any csv that is given in the input or be specific to CLARING resource families source - and first go over files tree in github to extract file links and then start harvesting the records. (it would be good to at least make it possible to extract plain csv harvesting in a later stage).
In case of Clarin resouce families source processing has to consist of following steps:
- go over files tree
- for each leaf read the file and extract records
- convert the record to JSON and save in DB.
- send message to Kafka topic (
sshoc-records-to-process
)
Edited by Aleksandra Nowak