Rule: Candidate concepts in ingestion pipeline
Problem description
Sources often do not fully cover the vocabulary fields on the MP side. To give an example: we use for activities the TaDiRAH2 vocabulary but this vocabulary is only sometimes used by sources. Therefore we take the values from the source that fit our vocabulary. We currently don't have a general rule for the values that do not fit into the vocabulary.
Workflow
In the mapping process we try to identify how much coverage we get from a source field when mapping it to an MP vocabulary driven field (type CONCEPT). We may introduce a value mapping table to support an ingestion pipeline (currently this is done in the mapping spreadsheet but needs to be moved to a machine readable format which can be also edited and evolved by curators - there will be a dedicated issue on this).
Based on the source values and on the mapping table we hopefully have a high coverage when ingesting data from a source. But there will be probabily values that were not covered or that are so much apart the MP vocabulary that they qualify for a new concept value (or maybe covered better by a different vocabulary). To make the whole thing even more complicated: we do have vocabularies were we don't like to add new concepts (usually controlled vocabularies coming from an external authority where new concepts should be introduced by this authority and not through us, e.g. TaDiRAH).
In the backend we do have a vocabulary management implemented (sshoc-marketplace-backend#86 (closed)) that is able to handle candidate concepts. It allows to add new concepts to a vocabulary. This mechanism should be used for newly discovered concepts at the source but with consideration of afore mentioned closed vocabularies.
General rule to apply
If a concept is discovered at the source that can't be found in the MP vocabulary of the concerned MP field (usually a dynamic property) and where the value of the concept is also not defined in the mapping table, then this concept becomes a candidate concept. Such candidate concepts should be added to the MP sshoc-keyword
vocabulary (API call api/vocabularies/sshoc-keyword
) as long as the mapping document does not say something different. To give an example: if for the MP activity
property a new concept is discovered at the source which qualifies as a candidate concept, then this candidate concept should be added to the MP keyword
property (and not to the activity
property). Additionally, there should be a logfile created, which informs about this newly found candidate concepts (we like to have for curation the information from where candidate concepts are coming and we hope to solve this with such logfiles).
Questions @sotiris.karampatakis @anowak ?
Informing @matej.durco @lbarbot @edward.gray about this rule