Skip to content
Snippets Groups Projects
Commit fdc72e13 authored by Christian Boulanger's avatar Christian Boulanger
Browse files

Updates to the way references are modelled

parent 81d6220d
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:6f9eb711429fb6cd tags: %% Cell type:markdown id:6f9eb711429fb6cd tags:
# Extract information from a Wikipedia page and upload to Wikidata # Extract information from a Wikipedia page and upload to Wikidata
This notebook takes an excerpt from a Wikipedia page about a scholar and extracts biographical information from it to upload the infromation to the WikiData enrty on that person. The steps are as follows: This notebook takes an excerpt from a Wikipedia page about a scholar and extracts biographical information from it to upload the infromation to the WikiData enrty on that person. The steps are as follows:
1. send the excerpt to the OpenAi API (GPT-4), using a custom prompt that instructs the model to extract CSV data that can easily be arranged into statements and qualifiers 1. send the excerpt to the OpenAi API (GPT-4), using a custom prompt that instructs the model to extract CSV data that can easily be arranged into statements and qualifiers
2. manually edit the data by correcting wrongly inferred information and adding missing triple data 2. manually edit the data by correcting wrongly inferred information and adding missing triple data
3. upload the data using pywikibot 3. upload the data using pywikibot
%% Cell type:markdown id:f36d8c1c925d1e0e tags: %% Cell type:markdown id:f36d8c1c925d1e0e tags:
## Definining the prompt ## Definining the prompt
%% Cell type:code id:27d869b6191fa004 tags: %% Cell type:code id:27d869b6191fa004 tags:
``` python ``` python
prompt = ''' prompt = '''
Your task is to extract data from the text and to output it in a format that is suitable as a data source for adding triples to Wikidata. Your task is to extract data from the text and to output it in a format that is suitable as a data source for adding triples to Wikidata.
The text is about "{fullName}" with the QID {qid}. It consists of one or more sections separated by "-----". The sections begin with a standalone URL followed by an excerpt of the content that can be found at this URL. The text is about "{fullName}" with the QID {qid}. It consists of one or more sections separated by "-----". The sections begin with a standalone URL followed by an excerpt of the content that can be found at this URL.
Arrange the extracted information into a table with the following columns: subject-label, subject-qid, predicate, pid, object, object-qid, start_time, end_time, reference_url. Arrange the extracted information into a table with the following columns: subject-label, subject-qid, predicate, pid, object, object-qid, start_time, end_time, reference_url.
Insert data into the columns as per the following rules: Insert data into the columns as per the following rules:
- subject-label/subject-qid: In general, the subject is "{fullName}" with the QID {qid}. However, refining/qualifying statements can also be made about other entities, as with the academic degree (P512) item below. Also, in the case of P112, subject and object must be reversed - subject-label/subject-qid: In general, the subject is "{fullName}" with the QID {qid}. However, refining/qualifying statements can also be made about other entities, as with the academic degree (P512) item below. Also, in the case of P112, subject and object must be reversed
- predicate/pid: - predicate/pid:
- educated at (P69): Institutions at which the person studied - educated at (P69): Institutions at which the person studied
- student of (P1066): If supervisors of doctoral theses and habilitations are specified - student of (P1066): If supervisors of doctoral theses and habilitations are specified
- employer (P108): is the organization that pays the salary of a person (this can be a company, and institution or the university) - employer (P108): is the organization that pays the salary of a person (this can be a company, and institution or the university)
- academic appointment (P8413): usually the department of a university, if this or its QID are not known, like P108 - academic appointment (P8413): usually the department of a university, if this or its QID are not known, like P108
- student (P802): persons contained in WikiData who were educated by the subject - student (P802): persons contained in WikiData who were educated by the subject
- member of (P463): Organizations and associations to which the person belongs (excluding P108) - member of (P463): Organizations and associations to which the person belongs (excluding P108)
- affiliation (P1416): Organization that the subject is affiliated with (not member of or employed by) - affiliation (P1416): Organization that the subject is affiliated with (not member of or employed by)
- academic degree (P512): some instance of academic degree (Q189533). After making this claim, add further triples to refine the P512 statement with triples on "conferred by" (P1027) and on "point in time" (P585). - academic degree (P512): some instance of academic degree (Q189533). After making this claim, add further triples to refine the P512 statement with triples on "conferred by" (P1027) and on "point in time" (P585).
- editor (P98): add information on memberships in editorial boards of academic journals - editor (P98): add information on memberships in editorial boards of academic journals
- founded by (P112): add information on journals, associations or other organizations that the subject helped to establish. When adding this claim, YOU MUST switch subject and object to express the reverse relationship - founded by (P112): add information on journals, associations or other organizations that the subject helped to establish. When adding this claim, YOU MUST switch subject and object to express the reverse relationship
- object-label/object-qid: here the English labels and, if known, the QIDs for the institutions and persons who are the objects of the triple. If you are not absolutely sure, leave blank - object-label/object-qid: here the English labels and, if known, the QIDs for the institutions and persons who are the objects of the triple. If you are not absolutely sure, leave blank
- start_time: the date/year from which the triple statement is true. Leave blank if the date is not specified or cannot be inferred, or the triple involves P585 - start_time: the date/year from which the triple statement is true. Leave blank if the date is not specified or cannot be inferred, or the triple involves P585
- end_time: the date/year up to which the triple statement is true. If it is an event, identical to start_time - end_time: the date/year up to which the triple statement is true. If it is an event, identical to start_time
- reference_url: this is the source URL of the text from which the information was extracted. - reference_url: this is the source URL of the text from which the information was extracted.
Return information as a comma-separated values (CSV). Include the column headers. Surround the values with quotes. If values contain quotes, properly escape them. Return information as a comma-separated values (CSV). Include the column headers. Surround the values with quotes. If values contain quotes, properly escape them.
DO NOT, UNDER ANY CIRCUMSTANCES, provide any commentary or explanations, just return the raw data. Do not make anything up that is not in the source material. DO NOT, UNDER ANY CIRCUMSTANCES, provide any commentary or explanations, just return the raw data. Do not make anything up that is not in the source material.
----- -----
{website_text} {website_text}
''' '''
``` ```
%% Cell type:markdown id:19d4b0c25a7a8a89 tags: %% Cell type:markdown id:19d4b0c25a7a8a89 tags:
## Data from Wikipedia (or any other website) ## Data from Wikipedia (or any other website)
%% Cell type:code id:37687f2fd256a439 tags: %% Cell type:code id:37687f2fd256a439 tags:
``` python ``` python
website_text = ''' website_text = '''
https://de.wikipedia.org/wiki/Erhard_Blankenburg https://de.wikipedia.org/wiki/Erhard_Blankenburg
Blankenburg belegte ein Studium der Philosophie, Soziologie und Germanistik an der Universität Freiburg und FU Berlin. Es folgten Graduate Studies und eine Tätigkeit als Forschungsassistent am Department of Sociology der University of Oregon. Ein Studium der Soziologie und Wirtschaftswissenschaft an der Universität Basel beendete er mit dem Abschluss Master of Arts 1965. Blankenburg belegte ein Studium der Philosophie, Soziologie und Germanistik an der Universität Freiburg und FU Berlin. Es folgten Graduate Studies und eine Tätigkeit als Forschungsassistent am Department of Sociology der University of Oregon. Ein Studium der Soziologie und Wirtschaftswissenschaft an der Universität Basel beendete er mit dem Abschluss Master of Arts 1965.
Seine Promotion zum Dr. phil. erfolgte an der Universität Basel 1966. Als Assistent am Institut für Soziologie der Universität Freiburg arbeitete er von 1966 bis 1968. Von 1969 bis 1971 war er Organisationsberater beim Quickborner Team, Hamburg. Danach arbeitete Blankenburg in Basel als Senior Projektleiter bei der Prognos in Basel. 1973/1974 war er wissenschaftlicher Mitarbeiter am Max-Planck-Institut für ausländisches und internationales Strafrecht in Freiburg. Die Habilitation für das Fach Soziologie erwarb er 1974 an der Universität Freiburg. Blankenburg war von 1975 bis 1980 Mitglied des Wissenschaftszentrums Berlin, Internationales Institut für Management und Verwaltung. Seine Promotion zum Dr. phil. erfolgte an der Universität Basel 1966. Als Assistent am Institut für Soziologie der Universität Freiburg arbeitete er von 1966 bis 1968. Von 1969 bis 1971 war er Organisationsberater beim Quickborner Team, Hamburg. Danach arbeitete Blankenburg in Basel als Senior Projektleiter bei der Prognos in Basel. 1973/1974 war er wissenschaftlicher Mitarbeiter am Max-Planck-Institut für ausländisches und internationales Strafrecht in Freiburg. Die Habilitation für das Fach Soziologie erwarb er 1974 an der Universität Freiburg. Blankenburg war von 1975 bis 1980 Mitglied des Wissenschaftszentrums Berlin, Internationales Institut für Management und Verwaltung.
1980 bekam er einen Ruf auf den Lehrstuhl für Rechtssoziologie der Vrije Universiteit Amsterdam. Gemeinsam mit Wolfgang Kaupen spielte er eine wichtige Rolle bei der Neubegründung der Deutschen Rechtssoziologie in den 70er-Jahren (Raiser 1998), ebenso, mit Volkmar Gessner, bei der Gründung des International Institute for the Sociology of Law. Er gehörte auch zu den Initiatoren und zu den Gründungsherausgebern der Zeitschrift für Rechtssoziologie. Gemeinsam mit Bill Felstiner organisierte er 1991 in Amsterdam das erste gemeinsame Treffen der beiden bedeutenden Vereinigungen der Rechtssoziologie (LSA und RCSL). Seine Beschäftigung mit rechtssoziologischen Themen war ungewöhnlich breit, reichte von der Soziologie der Kriminalität über die des Staatsapparates bis zu der des Zivilrechts. Blankenburg war primär Empiriker und Methodiker (vgl. seine Empirische Rechtssoziologie). Seine wichtigsten Beiträge zur rechtssoziologischen Theorie betreffen die Begriffe der "Mobilisierung des Rechts" und der "Rechtskultur(en)". Vor allem aber wirkte er als Koordinator, Organisator und als Vermittler zwischen Wissenschaft und Praxis: "Er bemühte sich nicht, eine 'Schule' zu gründen, ihm fiel es leicht, in stets wechselnden Teams mit wechselnden Wissenschaftlern zusammenzuarbeiten. Wie kein anderer Rechtssoziologe vermochte er, erfolgreich Tagungen zu organisieren, kompetente Referenten zu gewinnen und die Veranstaltungen mit Autorität und zugleich locker zu leiten" (Theo Rasehorn 1998, 23). 1980 bekam er einen Ruf auf den Lehrstuhl für Rechtssoziologie der Vrije Universiteit Amsterdam. Gemeinsam mit Wolfgang Kaupen spielte er eine wichtige Rolle bei der Neubegründung der Deutschen Rechtssoziologie in den 70er-Jahren (Raiser 1998), ebenso, mit Volkmar Gessner, bei der Gründung des International Institute for the Sociology of Law. Er gehörte auch zu den Initiatoren und zu den Gründungsherausgebern der Zeitschrift für Rechtssoziologie. Gemeinsam mit Bill Felstiner organisierte er 1991 in Amsterdam das erste gemeinsame Treffen der beiden bedeutenden Vereinigungen der Rechtssoziologie (LSA und RCSL). Seine Beschäftigung mit rechtssoziologischen Themen war ungewöhnlich breit, reichte von der Soziologie der Kriminalität über die des Staatsapparates bis zu der des Zivilrechts. Blankenburg war primär Empiriker und Methodiker (vgl. seine Empirische Rechtssoziologie). Seine wichtigsten Beiträge zur rechtssoziologischen Theorie betreffen die Begriffe der "Mobilisierung des Rechts" und der "Rechtskultur(en)". Vor allem aber wirkte er als Koordinator, Organisator und als Vermittler zwischen Wissenschaft und Praxis: "Er bemühte sich nicht, eine 'Schule' zu gründen, ihm fiel es leicht, in stets wechselnden Teams mit wechselnden Wissenschaftlern zusammenzuarbeiten. Wie kein anderer Rechtssoziologe vermochte er, erfolgreich Tagungen zu organisieren, kompetente Referenten zu gewinnen und die Veranstaltungen mit Autorität und zugleich locker zu leiten" (Theo Rasehorn 1998, 23).
''' '''
``` ```
%% Cell type:markdown id:2362d9d97adbcbbf tags: %% Cell type:markdown id:2362d9d97adbcbbf tags:
%% Cell type:markdown id:a5800fe8919e19c4 tags: %% Cell type:markdown id:a5800fe8919e19c4 tags:
## Query the OpenAI API (GPT-4) ## Query the OpenAI API (GPT-4)
%% Cell type:code id:b276d407b1a723fb tags: %% Cell type:code id:b276d407b1a723fb tags:
``` python ``` python
import io import io
from langchain_core.prompts import ChatPromptTemplate from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser from langchain_core.output_parsers import StrOutputParser
import pandas as pd import pandas as pd
from dotenv import load_dotenv from dotenv import load_dotenv
load_dotenv() load_dotenv()
def use_model(model, template, debug=False, **params): def use_model(model, template, debug=False, **params):
prompt = ChatPromptTemplate.from_template(template) prompt = ChatPromptTemplate.from_template(template)
parser = StrOutputParser() parser = StrOutputParser()
chain = ( prompt | model | parser ) chain = ( prompt | model | parser )
response = chain.invoke(params) response = chain.invoke(params)
if debug: if debug:
print(response) print(response)
data = io.StringIO(response) data = io.StringIO(response)
return pd.read_csv(data, dtype={'start_time': str, 'end_time': str}) return pd.read_csv(data, dtype={'start_time': str, 'end_time': str})
``` ```
%% Cell type:markdown id:9442427185ae2a72 tags: %% Cell type:markdown id:9442427185ae2a72 tags:
## Run example ## Run example
%% Cell type:code id:717d713e38598c57 tags: %% Cell type:code id:717d713e38598c57 tags:
``` python ``` python
fullName = "Erhard Blankenburg" fullName = "Erhard Blankenburg"
qid="Q51595283" qid="Q51595283"
model = ChatOpenAI(model_name="gpt-4") model = ChatOpenAI(model_name="gpt-4")
df = use_model(model, prompt, fullName=fullName, qid=qid, website_text=website_text) df = use_model(model, prompt, fullName=fullName, qid=qid, website_text=website_text)
df.to_csv(f'data/{fullName}-chatgpt.csv', index=False) df.to_csv(f'data/{fullName}-chatgpt.csv', index=False)
df df
``` ```
%% Output %% Output
subject-label subject-qid \ subject-label subject-qid \
0 Erhard Blankenburg Q51595283 0 Erhard Blankenburg Q51595283
1 Erhard Blankenburg Q51595283 1 Erhard Blankenburg Q51595283
2 Erhard Blankenburg Q51595283 2 Erhard Blankenburg Q51595283
3 Erhard Blankenburg Q51595283 3 Erhard Blankenburg Q51595283
4 Erhard Blankenburg Q51595283 4 Erhard Blankenburg Q51595283
5 Master of Arts NaN 5 Master of Arts NaN
6 Erhard Blankenburg Q51595283 6 Erhard Blankenburg Q51595283
7 Erhard Blankenburg Q51595283 7 Erhard Blankenburg Q51595283
8 Erhard Blankenburg Q51595283 8 Erhard Blankenburg Q51595283
9 Erhard Blankenburg Q51595283 9 Erhard Blankenburg Q51595283
10 Erhard Blankenburg Q51595283 10 Erhard Blankenburg Q51595283
11 Habilitation NaN 11 Habilitation NaN
12 Erhard Blankenburg Q51595283 12 Erhard Blankenburg Q51595283
13 Erhard Blankenburg Q51595283 13 Erhard Blankenburg Q51595283
14 International Institute for the Sociology of Law Q1570309 14 International Institute for the Sociology of Law Q1570309
15 Zeitschrift für Rechtssoziologie NaN 15 Zeitschrift für Rechtssoziologie NaN
predicate pid \ predicate pid \
0 educated at P69 0 educated at P69
1 educated at P69 1 educated at P69
2 educated at P69 2 educated at P69
3 educated at P69 3 educated at P69
4 academic degree P512 4 academic degree P512
5 conferred by P1027 5 conferred by P1027
6 employer P108 6 employer P108
7 employer P108 7 employer P108
8 employer P108 8 employer P108
9 employer P108 9 employer P108
10 academic degree P512 10 academic degree P512
11 conferred by P1027 11 conferred by P1027
12 member of P463 12 member of P463
13 academic appointment P8413 13 academic appointment P8413
14 founded by P112 14 founded by P112
15 founded by P112 15 founded by P112
object object-qid start_time \ object object-qid start_time \
0 University of Freiburg NaN NaN 0 University of Freiburg NaN NaN
1 Free University of Berlin NaN NaN 1 Free University of Berlin NaN NaN
2 University of Oregon NaN NaN 2 University of Oregon NaN NaN
3 University of Basel NaN NaN 3 University of Basel NaN NaN
4 Master of Arts NaN 1965 4 Master of Arts NaN 1965
5 University of Basel NaN 1965 5 University of Basel NaN 1965
6 University of Freiburg NaN 1966 6 University of Freiburg NaN 1966
7 Quickborner Team NaN 1969 7 Quickborner Team NaN 1969
8 Prognos NaN NaN 8 Prognos NaN NaN
9 Max-Planck-Institut für ausländisches und inte... NaN 1973 9 Max-Planck-Institut für ausländisches und inte... NaN 1973
10 Habilitation NaN 1974 10 Habilitation NaN 1974
11 University of Freiburg NaN 1974 11 University of Freiburg NaN 1974
12 Wissenschaftszentrums Berlin, Internationales ... NaN 1975 12 Wissenschaftszentrums Berlin, Internationales ... NaN 1975
13 Vrije Universiteit Amsterdam NaN 1980 13 Vrije Universiteit Amsterdam NaN 1980
14 Erhard Blankenburg Q51595283 NaN 14 Erhard Blankenburg Q51595283 NaN
15 Erhard Blankenburg Q51595283 NaN 15 Erhard Blankenburg Q51595283 NaN
end_time \ end_time \
0 https://de.wikipedia.org/wiki/Erhard_Blankenburg 0 https://de.wikipedia.org/wiki/Erhard_Blankenburg
1 https://de.wikipedia.org/wiki/Erhard_Blankenburg 1 https://de.wikipedia.org/wiki/Erhard_Blankenburg
2 https://de.wikipedia.org/wiki/Erhard_Blankenburg 2 https://de.wikipedia.org/wiki/Erhard_Blankenburg
3 https://de.wikipedia.org/wiki/Erhard_Blankenburg 3 https://de.wikipedia.org/wiki/Erhard_Blankenburg
4 1965 4 1965
5 1965 5 1965
6 1968 6 1968
7 1971 7 1971
8 https://de.wikipedia.org/wiki/Erhard_Blankenburg 8 https://de.wikipedia.org/wiki/Erhard_Blankenburg
9 1974 9 1974
10 1974 10 1974
11 1974 11 1974
12 1980 12 1980
13 NaN 13 NaN
14 NaN 14 NaN
15 NaN 15 NaN
reference_url reference_url
0 NaN 0 NaN
1 NaN 1 NaN
2 NaN 2 NaN
3 NaN 3 NaN
4 https://de.wikipedia.org/wiki/Erhard_Blankenburg 4 https://de.wikipedia.org/wiki/Erhard_Blankenburg
5 https://de.wikipedia.org/wiki/Erhard_Blankenburg 5 https://de.wikipedia.org/wiki/Erhard_Blankenburg
6 https://de.wikipedia.org/wiki/Erhard_Blankenburg 6 https://de.wikipedia.org/wiki/Erhard_Blankenburg
7 https://de.wikipedia.org/wiki/Erhard_Blankenburg 7 https://de.wikipedia.org/wiki/Erhard_Blankenburg
8 NaN 8 NaN
9 https://de.wikipedia.org/wiki/Erhard_Blankenburg 9 https://de.wikipedia.org/wiki/Erhard_Blankenburg
10 https://de.wikipedia.org/wiki/Erhard_Blankenburg 10 https://de.wikipedia.org/wiki/Erhard_Blankenburg
11 https://de.wikipedia.org/wiki/Erhard_Blankenburg 11 https://de.wikipedia.org/wiki/Erhard_Blankenburg
12 https://de.wikipedia.org/wiki/Erhard_Blankenburg 12 https://de.wikipedia.org/wiki/Erhard_Blankenburg
13 https://de.wikipedia.org/wiki/Erhard_Blankenburg 13 https://de.wikipedia.org/wiki/Erhard_Blankenburg
14 https://de.wikipedia.org/wiki/Erhard_Blankenburg 14 https://de.wikipedia.org/wiki/Erhard_Blankenburg
15 https://de.wikipedia.org/wiki/Erhard_Blankenburg 15 https://de.wikipedia.org/wiki/Erhard_Blankenburg
%% Cell type:markdown id:38be8467270ebc58 tags: %% Cell type:markdown id:38be8467270ebc58 tags:
## Manual correction ## Manual correction
The data has now be downloaded to `data/<name>-chatgpt.csv`. It needs to be cleaned and augmented before upload, for example by loading it into OpenRefine and reconciling the `object` column via the WikiData Reconciliation service. Afterward, remove the object-qid column and recreate it via the "add column based on this column" function using `ucell.recon.match.id` GREL expression. The data has now be downloaded to `data/<name>-chatgpt.csv`. It needs to be cleaned and augmented before upload, for example by loading it into OpenRefine and reconciling the `object` column via the WikiData Reconciliation service. Afterward, remove the object-qid column and recreate it via the "add column based on this column" function using `ucell.recon.match.id` GREL expression.
Otherwise, you can also just look up the terms and fill out the object-qid column manually. Otherwise, you can also just look up the terms and fill out the object-qid column manually.
When done, rename the CSV file by removing the "-chatgpt" infix. When done, rename the CSV file by removing the "-chatgpt" infix.
%% Cell type:markdown id:b110b2b14114ad05 tags: %% Cell type:markdown id:b110b2b14114ad05 tags:
## Upload data to WikiData ## Upload data to WikiData
%% Cell type:code id:bdb602fb42b562df tags: %% Cell type:code id:bdb602fb42b562df tags:
``` python ``` python
# based on code written by GPT-4 # based on code written by GPT-4
import csv import csv
from pywikibot import Claim, WbTime, ItemPage, PropertyPage, Site from pywikibot import Claim, WbTime, ItemPage, PropertyPage, Site
from datetime import datetime from datetime import datetime
def claim_to_string(claim): def claim_to_string(claim):
subject_qid = claim.on_item.id subject_qid = claim.on_item.id
predicate_pid = claim.getID() predicate_pid = claim.getID()
# Object QID, assuming the target is a Wikidata item # Object QID, assuming the target is a Wikidata item
# Note: This simplification assumes the claim's target is an item. # Note: This simplification assumes the claim's target is an item.
# For other target types (e.g., quantities, strings), additional handling is needed. # For other target types (e.g., quantities, strings), additional handling is needed.
if isinstance(claim.getTarget(), ItemPage): if isinstance(claim.getTarget(), ItemPage):
object_qid = claim.getTarget().id object_qid = claim.getTarget().id
else: else:
# Placeholder or additional logic for non-ItemPage targets # Placeholder or additional logic for non-ItemPage targets
object_qid = 'N/A' # This could be expanded to handle other types of targets object_qid = 'N/A' # This could be expanded to handle other types of targets
return f"({subject_qid})-[{predicate_pid}]-({object_qid})" return f"({subject_qid})-[{predicate_pid}]-({object_qid})"
# Function to check if a specific time qualifier exists # Function to check if a specific time qualifier exists
def time_qualifier_exists(claim, qualifier_pid, year_value): def time_qualifier_exists(claim, qualifier_pid, year_value):
for qualifier in claim.qualifiers.get(qualifier_pid, []): for qualifier in claim.qualifiers.get(qualifier_pid, []):
qualifier_date = qualifier.getTarget() qualifier_date = qualifier.getTarget()
if qualifier_date.year == year_value: if qualifier_date.year == year_value:
print(f'Time qualifier {qualifier_pid} with value {year_value} already exists on {claim_to_string(claim)}.') print(f'Time qualifier {qualifier_pid} with value {year_value} already exists on {claim_to_string(claim)}.')
return True return True
return False return False
def add_time_qualifiers(repo, claim, start_time, end_time): def add_time_qualifiers(repo, claim, start_time, end_time):
qualifiers = [] qualifiers = []
if (start_time and end_time) and (start_time == end_time): if (start_time and end_time) and (start_time == end_time):
if not time_qualifier_exists(claim, 'P585', int(start_time)): if not time_qualifier_exists(claim, 'P585', int(start_time)):
point_in_time_qualifier = Claim(repo, 'P585') point_in_time_qualifier = Claim(repo, 'P585')
point_in_time_qualifier.setTarget(WbTime(year=int(start_time))) point_in_time_qualifier.setTarget(WbTime(year=int(start_time)))
claim.addQualifier(point_in_time_qualifier, summary='Adding point in time') claim.addQualifier(point_in_time_qualifier, summary='Adding point in time')
print(f'Added point_in_time qualifier to {claim_to_string(claim)}') print(f'Added point_in_time qualifier to {claim_to_string(claim)}')
qualifiers.append(point_in_time_qualifier) qualifiers.append(point_in_time_qualifier)
else: else:
if start_time and not time_qualifier_exists(claim, 'P580', int(start_time)): if start_time and not time_qualifier_exists(claim, 'P580', int(start_time)):
start_time_qualifier = Claim(repo, 'P580') start_time_qualifier = Claim(repo, 'P580')
start_time_qualifier.setTarget(WbTime(year=int(start_time))) start_time_qualifier.setTarget(WbTime(year=int(start_time)))
claim.addQualifier(start_time_qualifier, summary='Adding start time') claim.addQualifier(start_time_qualifier, summary='Adding start time')
print(f'Added start_time qualifier to {claim_to_string(claim)}') print(f'Added start_time qualifier to {claim_to_string(claim)}')
qualifiers.append(start_time_qualifier) qualifiers.append(start_time_qualifier)
if end_time and not time_qualifier_exists(claim, 'P582', int(end_time)): if end_time and not time_qualifier_exists(claim, 'P582', int(end_time)):
end_time_qualifier = Claim(repo, 'P582') end_time_qualifier = Claim(repo, 'P582')
end_time_qualifier.setTarget(WbTime(year=int(end_time))) end_time_qualifier.setTarget(WbTime(year=int(end_time)))
claim.addQualifier(end_time_qualifier, summary='Adding end time') claim.addQualifier(end_time_qualifier, summary='Adding end time')
print(f'Added end_time qualifier to {claim_to_string(claim)}') print(f'Added end_time qualifier to {claim_to_string(claim)}')
qualifiers.append(end_time_qualifier) qualifiers.append(end_time_qualifier)
return qualifiers return qualifiers
# Function to check if a reference with the given URL already exists on the claim # Function to check if a reference with the given URL already exists on the claim
def reference_url_exists(claim, url): def reference_url_exists(claim, url):
for source in claim.getSources(): for source in claim.getSources():
if 'P4656' in source or 'P854' in source: # Check both Wikimedia import URL and reference URL if 'P4656' in source or 'P854' in source: # Check both Wikimedia import URL and reference URL
for prop in source.get('P4656', []) + source.get('P854', []): for prop in source.get('P4656', []) + source.get('P854', []):
if prop.getTarget() == url: if prop.getTarget() == url:
print(f'Source URL {url} already exists on {claim_to_string(claim)}.') print(f'Source URL {url} already exists on {claim_to_string(claim)}.')
return True return True
return False return False
def qualifier_exists(claim, qualifier_property_id, target): def qualifier_exists(claim, qualifier_property_id, target):
for existing_qualifier in claim.qualifiers.get(qualifier_property_id, []): for existing_qualifier in claim.qualifiers.get(qualifier_property_id, []):
if existing_qualifier.getTarget() == target: if existing_qualifier.getTarget() == target:
print(f'Qualifier {qualifier_property_id} with value {target.getID()} already exists on {claim_to_string(claim)}.') print(f'Qualifier {qualifier_property_id} with value {target.getID()} already exists on {claim_to_string(claim)}.')
return True return True
return False return False
def add_reference(repo, claim, reference_url, retrieved_at_time, qualifiers = None): def add_reference(repo, claim, reference_url, retrieved_at_time, qualifiers = None):
sources=[] sources=[]
if reference_url and not reference_url_exists(claim, reference_url): if reference_url and not reference_url_exists(claim, reference_url):
# Determine whether the URL is a Wikipedia URL or another type of URL # Determine whether the URL is a Wikipedia URL or another type of URL
property_id = 'P4656' if 'wikipedia.org' in reference_url else 'P854' property_id = 'P4656' if 'wikipedia.org' in reference_url else 'P854'
# Create the reference claim # Create the reference claim
source_claim = Claim(repo, property_id) source_claim = Claim(repo, property_id)
source_claim.setTarget(reference_url) source_claim.setTarget(reference_url)
sources.append(source_claim) sources.append(source_claim)
# Create the 'retrieved at' claim # Create the 'retrieved at' claim
retrieved_at_claim = Claim(repo, 'P813') retrieved_at_claim = Claim(repo, 'P813')
retrieved_at_target = WbTime(year=retrieved_at_time.year, month=retrieved_at_time.month, day=retrieved_at_time.day) retrieved_at_target = WbTime(year=retrieved_at_time.year, month=retrieved_at_time.month, day=retrieved_at_time.day)
retrieved_at_claim.setTarget(retrieved_at_target) retrieved_at_claim.setTarget(retrieved_at_target)
sources.append(retrieved_at_claim) sources.append(retrieved_at_claim)
# If a qualifier has been passed for which this reference is the source, add it # If a qualifier has been passed for which this reference is the source, add it
if qualifiers: if qualifiers:
for qualifier in qualifiers: for qualifier in qualifiers:
supports_qualifier_claim = Claim(repo, 'P10551') # "supports qualifier" supports_qualifier_claim = Claim(repo, 'P10551') # "supports qualifier"
site = Site("wikidata", "wikidata") site = Site("wikidata", "wikidata")
property_page = PropertyPage(site, qualifier.getID()) property_page = PropertyPage(site, qualifier.getID())
if not qualifier_exists(claim, 'P10551', property_page): if not qualifier_exists(claim, 'P10551', property_page):
supports_qualifier_claim.setTarget(property_page) supports_qualifier_claim.setTarget(property_page)
sources.append(supports_qualifier_claim) sources.append(supports_qualifier_claim)
# Add the references to the claim # Add the references to the claim
if len(sources) > 0: if len(sources) > 0:
claim.addSources(sources, summary='Adding reference and retrieved at date') claim.addSources(sources, summary='Adding reference and retrieved at date')
print(f'Added references to {claim_to_string(claim)}') print(f'Added references to {claim_to_string(claim)}')
return sources return sources
# main function # main function
def update_wikidata(file_path): def update_wikidata(file_path):
site = Site("wikidata", "wikidata") site = Site("wikidata", "wikidata")
repo = site.data_repository() repo = site.data_repository()
previous_object_qid = None previous_object_qid = None
previous_claim = None previous_claim = None
with open(file_path, newline='', encoding='utf-8') as csvfile: with open(file_path, newline='', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile) reader = csv.DictReader(csvfile)
for row in reader: for row in reader:
print("----------") print("----------")
subject_qid = row['subject-qid'] subject_qid = row['subject-qid']
pid = row['pid'] pid = row['pid']
object_qid = row['object-qid'] object_qid = row['object-qid']
start_time = row['start_time'] start_time = row['start_time']
end_time = row['end_time'] end_time = row['end_time']
reference_url = row['reference_url'] reference_url = row['reference_url']
# If the new subject is identical to the old object, refine the previous claim # If the new subject is identical to the old object, refine the previous claim
if subject_qid == previous_object_qid and previous_claim: if subject_qid == previous_object_qid and previous_claim:
claim = previous_claim claim = previous_claim
print(f'Refining {claim_to_string(claim)}') print(f'Refining {claim_to_string(claim)}')
else: else:
item = ItemPage(repo, subject_qid) item = ItemPage(repo, subject_qid)
item.get() item.get()
# Check if the claim already exists # Check if the claim already exists
claim_exists = False claim_exists = False
for claim in item.claims.get(pid, []): for claim in item.claims.get(pid, []):
if claim.getTarget().id == object_qid: if claim.getTarget().id == object_qid:
claim_exists = True claim_exists = True
print(f'{claim_to_string(claim)} exists.') print(f'{claim_to_string(claim)} exists.')
break break
if not claim_exists: if not claim_exists:
claim = Claim(repo, pid) claim = Claim(repo, pid)
target = ItemPage(repo, object_qid) target = ItemPage(repo, object_qid)
claim.setTarget(target) claim.setTarget(target)
item.addClaim(claim) item.addClaim(claim)
print(f'Created {claim_to_string(claim)}') print(f'Created {claim_to_string(claim)}')
# start_time and end_time # start_time and end_time
qualifiers = add_time_qualifiers(repo, claim, start_time, end_time) qualifiers = add_time_qualifiers(repo, claim, start_time, end_time)
# references # references
retrieved_at_time = datetime.utcnow() retrieved_at_time = datetime.utcnow()
add_reference(repo, claim, reference_url, retrieved_at_time, qualifiers) add_reference(repo, claim, reference_url, retrieved_at_time, qualifiers)
# Remember the object and claim for the next iteration # Remember the object and claim for the next iteration
previous_object_qid = object_qid previous_object_qid = object_qid
previous_claim = claim previous_claim = claim
update_wikidata('data/Erhard Blankenburg.csv') update_wikidata('data/Erhard Blankenburg.csv')
``` ```
%% Output %% Output
---------- ----------
(Q51595283)-[P69]-(Q153987) exists. (Q51595283)-[P69]-(Q153987) exists.
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q153987). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q153987).
---------- ----------
(Q51595283)-[P69]-(Q153006) exists. (Q51595283)-[P69]-(Q153006) exists.
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q153006). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q153006).
---------- ----------
(Q51595283)-[P69]-(Q766145) exists. (Q51595283)-[P69]-(Q766145) exists.
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q766145). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q766145).
---------- ----------
(Q51595283)-[P69]-(Q372608) exists. (Q51595283)-[P69]-(Q372608) exists.
Time qualifier P582 with value 1965 already exists on (Q51595283)-[P69]-(Q372608). Time qualifier P582 with value 1965 already exists on (Q51595283)-[P69]-(Q372608).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q372608). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q372608).
---------- ----------
(Q51595283)-[P512]-(Q2091008) exists. (Q51595283)-[P512]-(Q2091008) exists.
Time qualifier P585 with value 1965 already exists on (Q51595283)-[P512]-(Q2091008). Time qualifier P585 with value 1965 already exists on (Q51595283)-[P512]-(Q2091008).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q2091008). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q2091008).
---------- ----------
(Q51595283)-[P512]-(Q752297) exists. (Q51595283)-[P512]-(Q752297) exists.
Time qualifier P585 with value 1966 already exists on (Q51595283)-[P512]-(Q752297). Time qualifier P585 with value 1966 already exists on (Q51595283)-[P512]-(Q752297).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q752297). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q752297).
---------- ----------
(Q51595283)-[P108]-(Q153987) exists. (Q51595283)-[P108]-(Q153987) exists.
Time qualifier P580 with value 1966 already exists on (Q51595283)-[P108]-(Q153987). Time qualifier P580 with value 1966 already exists on (Q51595283)-[P108]-(Q153987).
Time qualifier P582 with value 1968 already exists on (Q51595283)-[P108]-(Q153987). Time qualifier P582 with value 1968 already exists on (Q51595283)-[P108]-(Q153987).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q153987). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q153987).
---------- ----------
(Q51595283)-[P108]-(Q124866772) exists. (Q51595283)-[P108]-(Q124866772) exists.
Time qualifier P580 with value 1969 already exists on (Q51595283)-[P108]-(Q124866772). Time qualifier P580 with value 1969 already exists on (Q51595283)-[P108]-(Q124866772).
Time qualifier P582 with value 1971 already exists on (Q51595283)-[P108]-(Q124866772). Time qualifier P582 with value 1971 already exists on (Q51595283)-[P108]-(Q124866772).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q124866772). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q124866772).
---------- ----------
(Q51595283)-[P108]-(Q2112115) exists. (Q51595283)-[P108]-(Q2112115) exists.
Time qualifier P582 with value 1973 already exists on (Q51595283)-[P108]-(Q2112115). Time qualifier P582 with value 1973 already exists on (Q51595283)-[P108]-(Q2112115).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q2112115). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q2112115).
---------- ----------
(Q51595283)-[P108]-(Q832780) exists. (Q51595283)-[P108]-(Q832780) exists.
Time qualifier P580 with value 1973 already exists on (Q51595283)-[P108]-(Q832780). Time qualifier P580 with value 1973 already exists on (Q51595283)-[P108]-(Q832780).
Time qualifier P582 with value 1974 already exists on (Q51595283)-[P108]-(Q832780). Time qualifier P582 with value 1974 already exists on (Q51595283)-[P108]-(Q832780).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q832780). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q832780).
---------- ----------
(Q51595283)-[P512]-(Q308678) exists. (Q51595283)-[P512]-(Q308678) exists.
Time qualifier P585 with value 1974 already exists on (Q51595283)-[P512]-(Q308678). Time qualifier P585 with value 1974 already exists on (Q51595283)-[P512]-(Q308678).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q308678). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q308678).
---------- ----------
(Q51595283)-[P108]-(Q475602) exists. Created (Q51595283)-[P108]-(Q475602)
Time qualifier P580 with value 1975 already exists on (Q51595283)-[P108]-(Q475602).
Time qualifier P582 with value 1980 already exists on (Q51595283)-[P108]-(Q475602). Sleeping for 9.5 seconds, 2024-03-16 20:09:39
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q475602).
Added start_time qualifier to (Q51595283)-[P108]-(Q475602)
Sleeping for 9.5 seconds, 2024-03-16 20:09:49
Added end_time qualifier to (Q51595283)-[P108]-(Q475602)
Sleeping for 9.5 seconds, 2024-03-16 20:09:59
Added references to (Q51595283)-[P108]-(Q475602)
---------- ----------
(Q51595283)-[P8413]-(Q1065414) exists. (Q51595283)-[P8413]-(Q1065414) exists.
Time qualifier P580 with value 1980 already exists on (Q51595283)-[P8413]-(Q1065414). Time qualifier P580 with value 1980 already exists on (Q51595283)-[P8413]-(Q1065414).
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P8413]-(Q1065414). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P8413]-(Q1065414).
---------- ----------
(Q51595283)-[P1416]-(Q1459361) exists. (Q51595283)-[P1416]-(Q1459361) exists.
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P1416]-(Q1459361). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P1416]-(Q1459361).
---------- ----------
(Q51595283)-[P98]-(Q96335163) exists. (Q51595283)-[P98]-(Q96335163) exists.
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P98]-(Q96335163). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P98]-(Q96335163).
---------- ----------
(Q65972149)-[P112]-(Q51595283) exists. (Q65972149)-[P112]-(Q51595283) exists.
Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q65972149)-[P112]-(Q51595283). Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q65972149)-[P112]-(Q51595283).
---------- ----------
Refining (Q65972149)-[P112]-(Q51595283) Refining (Q65972149)-[P112]-(Q51595283)
Time qualifier P582 with value 2003 already exists on (Q65972149)-[P112]-(Q51595283). Time qualifier P582 with value 2003 already exists on (Q65972149)-[P112]-(Q51595283).
Source URL https://www.linkedin.com/in/erhard-blankenburg-63938058/ already exists on (Q65972149)-[P112]-(Q51595283). Source URL https://www.linkedin.com/in/erhard-blankenburg-63938058/ already exists on (Q65972149)-[P112]-(Q51595283).
%% Cell type:markdown id:b7757fcf6ea66320 tags: %% Cell type:markdown id:b7757fcf6ea66320 tags:
The result can be seen at https://www.wikidata.org/wiki/Q51595283 The result can be seen at https://www.wikidata.org/wiki/Q51595283
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment