Rule: Disambiguation of actors in ingest pipelines
Problem description
We started by trying to disambiguate actors already in the ingestion pipeline by using the API endpoint GET /api/actor-search?q={actor name}
. After reviewing the results this approach does not work as hoped. Due to giving back fuzzy results and due to bad data quality regarding names of actors in many of the sources, a lot of false positives are generated. This leads to wrong data where we do not have a chance to find out, that it is a wrong disambiguation. Some examples are discuess in DARIAH Campus (#19 (comment 384054)) and TaDiRAH (see #7 (comment 371233) and #7 (comment 371672)).
New rule
Based on this experience it was decided in the Curation Task Force meeting on 10.09.2021 that disambiguation of actors in the ingest pipeline across the already ingested data should not be done. Instead such kind of disambiguation will be done by a dedicated curation workflow (see as one part of this workflow: https://gitlab.gwdg.de/sshoc/curation/-/issues/4). This means for the ingest pipelines:
- Still identify the same actors in one source by identifers, and create out of it one actor in MP.
- If there is no way to identify same actors in the source due to the lack of identifiers, create for every actor from the source one actor in MP.
- Don't do a string separation and a string compare (looking for identical person name strings) to identify same actors expect it is guaranteed that the source follows a strict actor naming convention (which is usually not the case). The general rule: use identifiers, if there are no, don't try to identify actors (only if advised different based on a rule mechanism).
- Do not look for the same actor in the list of already ingested actors in MP (that are coming from other sources). But always put all available identifiers to a new created actor in the MP. Such external identifiers like ORCID will support the curation module to find same actors across sources.
- Cross-source actor disambiguation is handled by curation not by ingestion.
Notify @sotiris.karampatakis and @anowak about this change for the ingestion pipelines.
Notify @lbarbot @matej.durco @edward.gray @cesare.concordia about this change for additional clarificaiton and curation tasks.