data-ingestion issueshttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues2022-10-26T13:40:36Zhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/100item labels should trim leading whitespace2022-10-26T13:40:36ZStefan Probstitem labels should trim leading whitespaceitem `label`s should trim leading whitespace (so they are sorted correctly). not sure if this should be handled by backend, or by ingestion.
example (note the leading space):
```
curl "https://marketplace-api.sshopencloud.eu/api/tools-s...item `label`s should trim leading whitespace (so they are sorted correctly). not sure if this should be handled by backend, or by ingestion.
example (note the leading space):
```
curl "https://marketplace-api.sshopencloud.eu/api/tools-services/qKSdjk" | jq '.label'
" Automatic Verification Tool (AVT)"
```Aleksandra NowakAleksandra Nowakhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/98Deploy DACE on ACDH-CH servers2022-11-04T08:53:19ZMatej DurcoDeploy DACE on ACDH-CH serversWe want to be able to run DACE ingestion pipelines on the ACDH-CH servers (too).
@anowak and @p.dpancic, please coordinate regarding the deployment.
Basically it should work using docker-compose configurations.
(notify: @matej.durco, @l...We want to be able to run DACE ingestion pipelines on the ACDH-CH servers (too).
@anowak and @p.dpancic, please coordinate regarding the deployment.
Basically it should work using docker-compose configurations.
(notify: @matej.durco, @lbarbot, @klaus.illmayer)Dalibor PancicAleksandra NowakDalibor Pancichttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/97Package tool extraction components as Helm chart2022-03-29T21:27:31ZSeung-Bin YimPackage tool extraction components as Helm charthttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/96NER Training: A/B Testing for old/new models2022-04-05T14:59:59ZSeung-Bin YimNER Training: A/B Testing for old/new modelsA/B Testing capability should be integrated into the NER model training pipeline.
The better version should be exported with a new version numberA/B Testing capability should be integrated into the NER model training pipeline.
The better version should be exported with a new version numberhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/95NER Training: Automate loading of new datasets2022-04-05T14:59:52ZSeung-Bin YimNER Training: Automate loading of new datasetsManual evaluated annotations will be stored in postgresql database and will be exported as a new version of trainset.
This should be integrated into the NER model training pipeline
* The dataset should be split into train/testset. Expli...Manual evaluated annotations will be stored in postgresql database and will be exported as a new version of trainset.
This should be integrated into the NER model training pipeline
* The dataset should be split into train/testset. Explicit testset is needed for A/B Testinghttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/94Model Monitoring: Calculate precision based on human evaluation via prodigy2022-07-21T14:37:43ZSeung-Bin YimModel Monitoring: Calculate precision based on human evaluation via prodigy* annotations should be loaded from the database.
* Precision should be calculated* annotations should be loaded from the database.
* Precision should be calculatedhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/93Deploy Extraction Pipeline on ACDH Infrastructure2022-04-19T09:52:05ZSeung-Bin YimDeploy Extraction Pipeline on ACDH Infrastructurehttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/92Setup Prodigy with the sentences to be analysed2022-03-29T21:18:31ZSeung-Bin YimSetup Prodigy with the sentences to be analysedhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/91(extraction pipeline) Store all sentences with tool suggestions in jsonl format2022-03-29T21:11:55ZSeung-Bin Yim(extraction pipeline) Store all sentences with tool suggestions in jsonl formathttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/90Test run Tool Extraction docker2022-03-18T10:39:16ZSeung-Bin YimTest run Tool Extraction dockerhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/89Handle records updates in DACE properly2022-07-26T14:00:13ZAleksandra NowakHandle records updates in DACE properlyContinues ingest should work fine in Marketplace. So even if the data will be harvested by DACE twice it should not create duplicates in the MP (thanks to MP's internal mechanisms).
However, it does create duplicate records in DACE's dat...Continues ingest should work fine in Marketplace. So even if the data will be harvested by DACE twice it should not create duplicates in the MP (thanks to MP's internal mechanisms).
However, it does create duplicate records in DACE's database.
In this issue SSHOC related record harvesters should be changed not to create duplicate records (if possible).
Duplicates will not be created if DACE will be updating records, and it happens when we upload a record with `internalId` that already exists for this source in DACE.
If possible harvesters should know the `internalIds` of records and when the record is harvested for the second time it should get the same internalId as for the previous harvest. So probably the `internalId` should be based on an identifier exposed by the data source.Aleksandra NowakAleksandra Nowakhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/88Move training/testset data to data registries2022-03-09T09:00:56ZSeung-Bin YimMove training/testset data to data registriesTraining/testset data should be zipped and stored in a separate repository using data registries patternTraining/testset data should be zipped and stored in a separate repository using data registries patternhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/87Convert Tool List Google Spreadsheet to Prodigy compatible Jsonl file2022-03-02T15:09:11ZSeung-Bin YimConvert Tool List Google Spreadsheet to Prodigy compatible Jsonl filehttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/86Dockerize Tool Extraction Pipeline2022-03-18T10:06:02ZSeung-Bin YimDockerize Tool Extraction Pipelinehttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/84Ingest TAPOR using DACE2022-06-13T08:36:03ZAleksandra NowakIngest TAPOR using DACEDetails in mapping table: https://docs.google.com/spreadsheets/d/17E6oJ_mXZFXNhQWhvOMmPLu5reJKImwI5OP44yudfe8/edit#gid=1327093263 and in the original task: #7Details in mapping table: https://docs.google.com/spreadsheets/d/17E6oJ_mXZFXNhQWhvOMmPLu5reJKImwI5OP44yudfe8/edit#gid=1327093263 and in the original task: #7Eliza KalataEliza Kalatahttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/101Test continuous ingest2022-10-26T13:41:49ZMatej DurcoTest continuous ingestThe problem of resolving diferences between versions should be solved server-side with update conflict mechanism https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/issues/85
Now we need to test, if it does what we want it to.
For ...The problem of resolving diferences between versions should be solved server-side with update conflict mechanism https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/issues/85
Now we need to test, if it does what we want it to.
For this we decided following procedure:
1. make a dummy source `S` for one of our ingest pipelines
2. run first ingest of `S` into dev
3. do manual changes `x` on some items `I` from `S` in the marketplace
4. do different changes `y` to the same items `I` directly in `S`.
5. do a reingest of source `S` using methods described in https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/issues/85
6. evaluate the responses of the methods
7. evaluate what happens to items `I`
(join the evaluation party: @anowak, @klaus.illmayer, @lbarbot, @matej.durco )Now or neverAleksandra NowakAleksandra Nowakhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/82Inspect model & pipeline2022-07-21T14:54:22ZSeung-Bin YimInspect model & pipelineThe trained NER model doesn't perform good on unseen 'real world' data.
While the model performance on evaluation data is > 0.90 (f1-score), the performance on real world data is significantly lower.
This might indicate some issues in t...The trained NER model doesn't perform good on unseen 'real world' data.
While the model performance on evaluation data is > 0.90 (f1-score), the performance on real world data is significantly lower.
This might indicate some issues in the dataset and should be analyzedExtraction of MP data from publicationsSeung-Bin YimSeung-Bin Yimhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/80Implement custom value mapping2022-07-22T13:52:32ZMatej DurcoImplement custom value mappingSometimes we would like to change an value encountered at the source to a different value to be stored in the MP.
Ideally, the mapping from source value to target value should be easily injectable into the ingestion pipeline.
If this co...Sometimes we would like to change an value encountered at the source to a different value to be stored in the MP.
Ideally, the mapping from source value to target value should be easily injectable into the ingestion pipeline.
If this concerns only a small number of values, this may be still expressed in the JOLT-configuration file, however the mechanism is quite cumbersome and would make it difficult to adjust dynamically.
Thus, there is a suggestion to implement a `custom value mapping` mechanism, which would take an external csv-file (either next to the JOLT configuration or on a different git-repository) in the form:
| `source_value` | `target_value` |
| ------ | ------ |
| DH | Digital Humanities |
| Digit. Scholarsh. Humanit. | Digital Scholarship Humanities|
This could be ideally implemented as a custom API-method that can be called from the JOLT-configuration.
(notify: @lbarbot, @klaus.illmayer, @anowak )Now or neverAleksandra NowakAleksandra Nowakhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/78Rule: Candidate concepts in ingestion pipeline2022-10-17T13:13:23ZKlaus IllmayerRule: Candidate concepts in ingestion pipeline# Problem description
Sources often do not fully cover the vocabulary fields on the MP side. To give an example: we use for activities the TaDiRAH2 vocabulary but this vocabulary is only sometimes used by sources. Therefore we take the v...# Problem description
Sources often do not fully cover the vocabulary fields on the MP side. To give an example: we use for activities the TaDiRAH2 vocabulary but this vocabulary is only sometimes used by sources. Therefore we take the values from the source that fit our vocabulary. We currently don't have a general rule for the values that do not fit into the vocabulary.
# Workflow
In the mapping process we try to identify how much coverage we get from a source field when mapping it to an MP vocabulary driven field (type CONCEPT). We may introduce a value mapping table to support an ingestion pipeline (currently this is done in the mapping spreadsheet but needs to be moved to a machine readable format which can be also edited and evolved by curators - there will be a dedicated issue on this).
Based on the source values and on the mapping table we hopefully have a high coverage when ingesting data from a source. But there will be probabily values that were not covered or that are so much apart the MP vocabulary that they qualify for a new concept value (or maybe covered better by a different vocabulary). To make the whole thing even more complicated: we do have vocabularies were we don't like to add new concepts (usually controlled vocabularies coming from an external authority where new concepts should be introduced by this authority and not through us, e.g. TaDiRAH).
In the backend we do have a vocabulary management implemented (https://gitlab.gwdg.de/sshoc/sshoc-marketplace-backend/-/issues/86) that is able to handle candidate concepts. It allows to add new concepts to a vocabulary. This mechanism should be used for newly discovered concepts at the source but with consideration of afore mentioned closed vocabularies.
# General rule to apply
If a concept is discovered at the source that can't be found in the MP vocabulary of the concerned MP field (usually a dynamic property) and where the value of the concept is also not defined in the mapping table, then this concept becomes a candidate concept. Such candidate concepts should be added to the MP `sshoc-keyword` vocabulary (API call `api/vocabularies/sshoc-keyword`) as long as the mapping document does not say something different. To give an example: if for the MP `activity` property a new concept is discovered at the source which qualifies as a candidate concept, then this candidate concept should be added to the MP `keyword` property (and not to the `activity` property). Additionally, there should be a logfile created, which informs about this newly found candidate concepts (we like to have for curation the information from where candidate concepts are coming and we hope to solve this with such logfiles).
Questions @sotiris.karampatakis @anowak ?
Informing @matej.durco @lbarbot @edward.gray about this ruleNow or neverhttps://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/77Rule: Disambiguation of actors in ingest pipelines2022-10-17T13:13:33ZKlaus IllmayerRule: Disambiguation of actors in ingest pipelines# Problem description
We started by trying to disambiguate actors already in the ingestion pipeline by using the API endpoint `GET /api/actor-search?q={actor name}`. After reviewing the results this approach does not work as hoped. Due t...# Problem description
We started by trying to disambiguate actors already in the ingestion pipeline by using the API endpoint `GET /api/actor-search?q={actor name}`. After reviewing the results this approach does not work as hoped. Due to giving back fuzzy results and due to bad data quality regarding names of actors in many of the sources, a lot of false positives are generated. This leads to wrong data where we do not have a chance to find out, that it is a wrong disambiguation. Some examples are discuess in DARIAH Campus (https://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/19#note_384054) and TaDiRAH (see https://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/7#note_371233 and https://gitlab.gwdg.de/sshoc/data-ingestion/-/issues/7#note_371672).
# New rule
Based on this experience it was decided in the Curation Task Force meeting on 10.09.2021 that disambiguation of actors in the ingest pipeline across the already ingested data should not be done. Instead such kind of disambiguation will be done by a dedicated curation workflow (see as one part of this workflow: https://gitlab.gwdg.de/sshoc/curation/-/issues/4). This means for the ingest pipelines:
* Still identify the same actors in one source by identifers, and create out of it one actor in MP.
* If there is no way to identify same actors in the source due to the lack of identifiers, create for every actor from the source one actor in MP.
* Don't do a string separation and a string compare (looking for identical person name strings) to identify same actors expect it is guaranteed that the source follows a strict actor naming convention (which is usually not the case). The general rule: use identifiers, if there are no, don't try to identify actors (only if advised different based on a rule mechanism).
* Do not look for the same actor in the list of already ingested actors in MP (that are coming from other sources). But always put all available identifiers to a new created actor in the MP. Such external identifiers like ORCID will support the curation module to find same actors across sources.
* Cross-source actor disambiguation is handled by curation not by ingestion.
Notify @sotiris.karampatakis and @anowak about this change for the ingestion pipelines.
Notify @lbarbot @matej.durco @edward.gray @cesare.concordia about this change for additional clarificaiton and curation tasks.Ongoing curationAleksandra NowakAleksandra Nowak