Run a re-ingest on development
@sotiris.karampatakis We like to run a complete re-ingst on development due to adaptions in the data model regarding dynamic properties and some other issues.
I try to summarize some of the changes to be aware about, when doing the re-ingest:
- There are some naming adaptions at the API, see sshoc-marketplace-backend#65 (comment 297037) (items POST: relatedItems uses now persistentId instead of objectId - not sure if this is relevant for you) and sshoc-marketplace-backend#65 (comment 297423) (also not sure if this is relevant for you). You may also spot some other changes as there was a lot of activity in the last weeks. If you can't find a relevant issue, please use this issue for inquiries.
- We would like to have a sanitization of the label field for items, there should be no linebreak in the label, see: #57 (closed) - can you adapt the ingest script so that linebreaks are kicked out?
- Then we have a long list of changes in the properties, I try to sum them up:
- I already deleted some properties without values in the database, maybe you can have a look if you already don't use it anymore: sshoc-marketplace#66 (see my last two comments)
- We like to change all of the url-fields to type URL instead of STRING (this includes: accesspolicy-url, helpdesk-url, privacypolicy-url, service-level-url, termsofuse-url, usermanual-url). I propose to delete the current properties with the STRING type and then re-create the properties with the URL type. This means we lose the current values and after re-ingst we will have them back. I hope, that there are no values with invalid URLs. Do you agree on this proposal? If yes, I need to check if some of the manual created items use one of the url-properties so that I can move the values of them manually.
- termsofuse-url needs a special treatment, as I like to rename it to terms-of-use-url to be in line with the naming of terms-of-use (as they rely on each other). I would create the new property and delete the old one.
- We like to move the property "doi" to external-id (having "doi" as identifierService code and "DOI" as identifierService lab el). It is currently used in DBLP and Zotero-427927.
- The property "methodica-link" should be moved to the property "see-also" (used in TaPoR). Unfortunately I have at least one example where this have an invalid URL: https://sshoc-marketplace.acdh-dev.oeaw.ac.at/tool-or-service/HB2RAR - do you see a chance to only extract the URL from there (there are only 4 items that use this property so we can also move handling of invalid values to curation).
- The property "repository-url" should be also moved to external-id where depending on the repository the identifierService is either GitLab (code: gitlab) or GitHub (code: github). I only found GitHub values in the database (sources: Programming Historian, SSK, TAPoR).
- For "language" we only want to use the vocabulary "iso-639-3-v2". Currently 702 items are using "iso-639-3". Can you change this to the v2-vocabulary? In the long term we like to get rid of "iso-639-3" and only use the comprehensive "iso-639-3-v2" but will then rename this vocabulary to "iso-639-3" (as v2 then makes no sense anymore). Maybe we first guarantee that there is only iso-639-3-v2 is in use and in a follow-up step delete the iso-639-3 and rename iso-639-3-v2. What do you think?
- Quite similiar is "activity" where we only like to use "tadirah2". But as I see it, there is no action necessary from your side, as you already mapped everything to tadirah2 (we do have some manually created items that use the tadirah_activity, I will change this).
- Then there is the issue with the already curated items, some of them coming from ingestion sources: sshoc-marketplace#77 (closed), e.g. https://sshoc-marketplace.acdh-dev.oeaw.ac.at/tool-or-service/iQdKk2 is coming from TaPoR and we manually enriched it. Is there a way to handle this in ingestion: to leave the changes untouched?
Maybe @sotiris.karampatakis you can give a short feedback if all this changes can be done from your side. I guess besides the last point it should hopefully no problem. Regarding the last point: what do you think about it? Is it feasible to have a mechanism in your ingest pipeline to identify curation changes and leave them as they are? Otherwise, I guess we can manually enrich these items once again, as there are currently not that many. We just need to agree and then to prepare such manually work. For some of the points on the dynamic properties I still would need to look into the manually created items, if it is necessary to move values manually. Therefore, I would appreciate if you can give at first a short feedback, when you are able to do the changes on the ingest and before starting the re-ingest please contact me, if I'm already ready with my manual homework.
Notifying @matej.durco and @lbarbot about this issue.