Skip to content
Snippets Groups Projects
data-extraction.ipynb 38 KiB
Newer Older
  • Learn to ignore specific revisions
  • {
     "cells": [
      {
       "cell_type": "markdown",
       "source": [
        "# Extract information from a Wikipedia page and upload to Wikidata\n",
    
        "\n",
        "This notebook takes an excerpt from a Wikipedia page about a scholar and extracts biographical information from it to upload the infromation to the WikiData enrty on that person. The steps are as follows:\n",
        " \n",
        "1. send the excerpt to the OpenAi API (GPT-4), using a custom prompt that instructs the model to extract CSV data that can easily be arranged into statements and qualifiers\n",
        "2. manually edit the data by correcting wrongly inferred information and adding missing triple data\n",
        "3. upload the data using pywikibot  "
    
       ],
       "metadata": {
        "collapsed": false
       },
       "id": "6f9eb711429fb6cd"
      },
    
      {
       "cell_type": "markdown",
       "source": [
        "## Definining the prompt"
       ],
       "metadata": {
        "collapsed": false
       },
       "id": "f36d8c1c925d1e0e"
      },
    
       "execution_count": 1,
    
       "outputs": [],
       "source": [
        "prompt = '''\n",
        "Your task is to extract data from the text and to output it in a format that is suitable as a data source for adding triples to Wikidata.\n",
        "\n",
        "The text is about \"{fullName}\" with the QID {qid}. It consists of one or more sections separated by \"-----\". The sections begin with a standalone URL followed by an excerpt of the content that can be found at this URL. \n",
        "\n",
        "Arrange the extracted information into a table with the following columns: subject-label, subject-qid, predicate, pid, object, object-qid, start_time, end_time, reference_url.  \n",
        "\n",
        "Insert data into the columns as per the following rules:\n",
        "- subject-label/subject-qid: In general, the subject is \"{fullName}\" with the QID {qid}. However, refining/qualifying statements can also be made about other entities, as with the academic degree (P512) item below. Also, in the case of P112, subject and object must be reversed\n",
        "- predicate/pid:\n",
        "    - educated at (P69): Institutions at which the person studied \n",
        "    - student of (P1066): If supervisors of doctoral theses and habilitations are specified\n",
        "    - employer (P108): is the organization that pays the salary of a person (this can be a company, and institution or the university)\n",
        "    - academic appointment (P8413): usually the department of a university, if this or its QID are not known, like P108\n",
        "    - student (P802): persons contained in WikiData who were educated by the subject\n",
        "    - member of (P463): Organizations and associations to which the person belongs (excluding P108)\n",
        "    - affiliation (P1416): Organization that the subject is affiliated with (not member of or employed by)\n",
        "    - academic degree (P512): some instance of academic degree (Q189533). After making this claim, add further triples to refine the P512 statement with triples on \"conferred by\" (P1027) and on \"point in time\" (P585).\n",
    
        "    - field of work (P101): extract the main topics and themes the subject has worked and published on \n",
    
        "    - editor (P98): add information on memberships in editorial boards of academic journals\n",
        "    - founded by (P112): add information on journals, associations or other organizations that the subject helped to establish. When adding this claim, YOU MUST switch subject and object to express the reverse relationship\n",
        "- object-label/object-qid: here the English labels and, if known, the QIDs for the institutions and persons who are the objects of the triple. If you are not absolutely sure, leave blank\n",
        "- start_time: the date/year from which the triple statement is true. Leave blank if the date is not specified or cannot be inferred, or the triple involves P585 \n",
        "- end_time: the date/year up to which the triple statement is true. If it is an event, identical to start_time\n",
        "- reference_url: this is the source URL of the text from which the information was extracted.\n",
        "\n",
        "Return information as a comma-separated values (CSV). Include the column headers. Surround the values with quotes. If values contain quotes, properly escape them.\n",
        " \n",
        "DO NOT, UNDER ANY CIRCUMSTANCES, provide any commentary or explanations, just return the raw data. Do not make anything up that is not in the source material.\n",
        "-----\n",
        "{website_text}\n",
        "'''\n"
       ],
       "metadata": {
        "collapsed": false,
        "ExecuteTime": {
    
         "end_time": "2024-03-17T09:29:44.073226100Z",
         "start_time": "2024-03-17T09:29:44.073226100Z"
    
       "cell_type": "markdown",
    
       "source": [
        "## Data from Wikipedia (or any other website)"
       ],
       "metadata": {
        "collapsed": false
       },
    
       "id": "19d4b0c25a7a8a89"
    
       "execution_count": 2,
    
       "outputs": [],
       "source": [
        "website_text = '''\n",
        "https://de.wikipedia.org/wiki/Erhard_Blankenburg\n",
        "\n",
        "Blankenburg belegte ein Studium der Philosophie, Soziologie und Germanistik an der Universität Freiburg und FU Berlin. Es folgten Graduate Studies und eine Tätigkeit als Forschungsassistent am Department of Sociology der University of Oregon. Ein Studium der Soziologie und Wirtschaftswissenschaft an der Universität Basel beendete er mit dem Abschluss Master of Arts 1965.\n",
        "\n",
        "Seine Promotion zum Dr. phil. erfolgte an der Universität Basel 1966. Als Assistent am Institut für Soziologie der Universität Freiburg arbeitete er von 1966 bis 1968. Von 1969 bis 1971 war er Organisationsberater beim Quickborner Team, Hamburg. Danach arbeitete Blankenburg in Basel als Senior Projektleiter bei der Prognos in Basel. 1973/1974 war er wissenschaftlicher Mitarbeiter am Max-Planck-Institut für ausländisches und internationales Strafrecht in Freiburg. Die Habilitation für das Fach Soziologie erwarb er 1974 an der Universität Freiburg. Blankenburg war von 1975 bis 1980 Mitglied des Wissenschaftszentrums Berlin, Internationales Institut für Management und Verwaltung.\n",
        "\n",
        "1980 bekam er einen Ruf auf den Lehrstuhl für Rechtssoziologie der Vrije Universiteit Amsterdam. Gemeinsam mit Wolfgang Kaupen spielte er eine wichtige Rolle bei der Neubegründung der Deutschen Rechtssoziologie in den 70er-Jahren (Raiser 1998), ebenso, mit Volkmar Gessner, bei der Gründung des International Institute for the Sociology of Law. Er gehörte auch zu den Initiatoren und zu den Gründungsherausgebern der Zeitschrift für Rechtssoziologie. Gemeinsam mit Bill Felstiner organisierte er 1991 in Amsterdam das erste gemeinsame Treffen der beiden bedeutenden Vereinigungen der Rechtssoziologie (LSA und RCSL). Seine Beschäftigung mit rechtssoziologischen Themen war ungewöhnlich breit, reichte von der Soziologie der Kriminalität über die des Staatsapparates bis zu der des Zivilrechts. Blankenburg war primär Empiriker und Methodiker (vgl. seine Empirische Rechtssoziologie). Seine wichtigsten Beiträge zur rechtssoziologischen Theorie betreffen die Begriffe der \"Mobilisierung des Rechts\" und der \"Rechtskultur(en)\". Vor allem aber wirkte er als Koordinator, Organisator und als Vermittler zwischen Wissenschaft und Praxis: \"Er bemühte sich nicht, eine 'Schule' zu gründen, ihm fiel es leicht, in stets wechselnden Teams mit wechselnden Wissenschaftlern zusammenzuarbeiten. Wie kein anderer Rechtssoziologe vermochte er, erfolgreich Tagungen zu organisieren, kompetente Referenten zu gewinnen und die Veranstaltungen mit Autorität und zugleich locker zu leiten\" (Theo Rasehorn 1998, 23). \n",
    
        "\n",
        "https://www.linkedin.com/in/erhard-blankenburg-63938058/\n",
        "\n",
        "Erhard Blankenburg has been teaching sociology of law at the Vrije Universiteit Amsterdam from 1980 to 2003. \n",
        "He got a Master of Arts at the University of Oregon, a Doctors degree from Basel (Switzerland) and a Dr. habil. at Freiburg (Germany). \n",
        "After teaching sociology and sociology of law at Freiburg University 1965 -1970, he served as consultant with the QuickbornTeam Hamburg until 1972, as senior research fellow at the PrognosAG Basel until 1974, at the Max Planck Institut Freiburg 1974/75 and at the Science Centre Berlin until 1980.\n",
        "Since 1990 evaluating system renovation in EastGermany, South Africa, post communist countries in Central Europe. Book publications on comparative legal cultures, police, public prosecutors, civil courts, labour courts, legal aid and mobilization of law .\n",
    
        "'''"
       ],
       "metadata": {
        "collapsed": false,
        "ExecuteTime": {
    
         "end_time": "2024-03-17T09:29:51.010290900Z",
         "start_time": "2024-03-17T09:29:51.006986200Z"
    
      {
       "cell_type": "markdown",
       "source": [],
       "metadata": {
        "collapsed": false
       },
       "id": "2362d9d97adbcbbf"
      },
      {
       "cell_type": "markdown",
       "source": [
        "## Query the OpenAI API (GPT-4)\n"
       ],
       "metadata": {
        "collapsed": false
       },
       "id": "a5800fe8919e19c4"
      },
    
       "execution_count": 3,
    
       "outputs": [],
       "source": [
        "import io\n",
        "from langchain_core.prompts import ChatPromptTemplate\n",
        "from langchain_openai import ChatOpenAI\n",
        "from langchain_core.output_parsers import StrOutputParser\n",
        "import pandas as pd\n",
        "from dotenv import load_dotenv\n",
        "\n",
        "load_dotenv()\n",
        "\n",
        "def use_model(model, template, debug=False, **params):\n",
        "    prompt = ChatPromptTemplate.from_template(template)\n",
        "    parser = StrOutputParser()\n",
        "    chain = ( prompt | model | parser )\n",
        "    response = chain.invoke(params)\n",
        "    if debug:\n",
        "        print(response)\n",
        "    data = io.StringIO(response)\n",
        "    return pd.read_csv(data, dtype={'start_time': str, 'end_time': str})"
       ],
       "metadata": {
        "collapsed": false,
        "ExecuteTime": {
    
         "end_time": "2024-03-17T09:30:03.319060900Z",
         "start_time": "2024-03-17T09:29:54.835405100Z"
    
      {
       "cell_type": "markdown",
       "source": [
        "## Run example"
       ],
       "metadata": {
        "collapsed": false
       },
       "id": "9442427185ae2a72"
      },
    
       "execution_count": 4,
    
          "text/plain": "                  Erhard Blankenburg  Q51595283           educated at    P69  \\\n0                 Erhard Blankenburg  Q51595283           educated at    P69   \n1                 Erhard Blankenburg  Q51595283           educated at    P69   \n2                 Erhard Blankenburg  Q51595283           educated at    P69   \n3                 Erhard Blankenburg  Q51595283       academic degree   P512   \n4                 Erhard Blankenburg  Q51595283       academic degree   P512   \n5                 Erhard Blankenburg  Q51595283              employer   P108   \n6                 Erhard Blankenburg  Q51595283              employer   P108   \n7                 Erhard Blankenburg  Q51595283              employer   P108   \n8                 Erhard Blankenburg  Q51595283  academic appointment  P8413   \n9                 Erhard Blankenburg  Q51595283  academic appointment  P8413   \n10                Erhard Blankenburg  Q51595283            founded by   P112   \n11  Zeitschrift für Rechtssoziologie        NaN            founded by   P112   \n12                Erhard Blankenburg  Q51595283         field of work   P101   \n13                Erhard Blankenburg  Q51595283         field of work   P101   \n14                Erhard Blankenburg  Q51595283         field of work   P101   \n15                Erhard Blankenburg  Q51595283         field of work   P101   \n16                Erhard Blankenburg  Q51595283              employer   P108   \n17                Erhard Blankenburg  Q51595283  academic appointment  P8413   \n\n                               University of Freiburg Unnamed: 5  Unnamed: 6  \\\n0                                           FU Berlin        NaN         NaN   \n1                                University of Oregon        NaN         NaN   \n2                                 University of Basel        NaN         NaN   \n3                                      Master of Arts        NaN      1965.0   \n4                                           Dr. phil.        NaN      1966.0   \n5                           Quickborner Team, Hamburg        NaN      1969.0   \n6                                    Prognos in Basel        NaN         NaN   \n7   Max-Planck-Institut für ausländisches und inte...        NaN      1973.0   \n8                              University of Freiburg        NaN      1975.0   \n9                        Vrije Universiteit Amsterdam        NaN      1980.0   \n10   International Institute for the Sociology of Law        NaN         NaN   \n11                                 Erhard Blankenburg  Q51595283         NaN   \n12                                   Sociology of law        NaN         NaN   \n13                                 Criminal sociology        NaN         NaN   \n14                   Sociology of the state apparatus        NaN         NaN   \n15                                Civil law sociology        NaN         NaN   \n16                              Science Centre Berlin        NaN      1975.0   \n17                       Vrije Universiteit Amsterdam        NaN      1980.0   \n\n    Unnamed: 7   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n0          NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n1          NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n2          NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n3       1965.0   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n4       1966.0   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n5       1971.0   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n6       1974.0   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n7       1974.0   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n8       1980.0   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n9          NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n10         NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n11         NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n12         NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n13         NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n14         NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n15         NaN   https://de.wikipedia.org/wiki/Erhard_Blankenburg  \n16      1980.0  https://www.linkedin.com/in/erhard-blankenburg...  \n17      2003.0  https://www.linkedin.com/in/erhard-blankenburg...  ",
          "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Erhard Blankenburg</th>\n      <th>Q51595283</th>\n      <th>educated at</th>\n      <th>P69</th>\n      <th>University of Freiburg</th>\n      <th>Unnamed: 5</th>\n      <th>Unnamed: 6</th>\n      <th>Unnamed: 7</th>\n      <th>https://de.wikipedia.org/wiki/Erhard_Blankenburg</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>educated at</td>\n      <td>P69</td>\n      <td>FU Berlin</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>educated at</td>\n      <td>P69</td>\n      <td>University of Oregon</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>educated at</td>\n      <td>P69</td>\n      <td>University of Basel</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>academic degree</td>\n      <td>P512</td>\n      <td>Master of Arts</td>\n      <td>NaN</td>\n      <td>1965.0</td>\n      <td>1965.0</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>academic degree</td>\n      <td>P512</td>\n      <td>Dr. phil.</td>\n      <td>NaN</td>\n      <td>1966.0</td>\n      <td>1966.0</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>employer</td>\n      <td>P108</td>\n      <td>Quickborner Team, Hamburg</td>\n      <td>NaN</td>\n      <td>1969.0</td>\n      <td>1971.0</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>employer</td>\n      <td>P108</td>\n      <td>Prognos in Basel</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>1974.0</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>employer</td>\n      <td>P108</td>\n      <td>Max-Planck-Institut für ausländisches und inte...</td>\n      <td>NaN</td>\n      <td>1973.0</td>\n      <td>1974.0</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>8</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>academic appointment</td>\n      <td>P8413</td>\n      <td>University of Freiburg</td>\n      <td>NaN</td>\n      <td>1975.0</td>\n      <td>1980.0</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>9</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>academic appointment</td>\n      <td>P8413</td>\n      <td>Vrije Universiteit Amsterdam</td>\n      <td>NaN</td>\n      <td>1980.0</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>founded by</td>\n      <td>P112</td>\n      <td>International Institute for the Sociology of Law</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>Zeitschrift für Rechtssoziologie</td>\n      <td>NaN</td>\n      <td>founded by</td>\n      <td>P112</td>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>field of work</td>\n      <td>P101</td>\n      <td>Sociology of law</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>field of work</td>\n      <td>P101</td>\n      <td>Criminal sociology</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>field of work</td>\n      <td>P101</td>\n      <td>Sociology of the state apparatus</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>15</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>field of work</td>\n      <td>P101</td>\n      <td>Civil law sociology</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>NaN</td>\n      <td>https://de.wikipedia.org/wiki/Erhard_Blankenburg</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>employer</td>\n      <td>P108</td>\n      <td>Science Centre Berlin</td>\n      <td>NaN</td>\n      <td>1975.0</td>\n      <td>1980.0</td>\n      <td>https://www.linkedin.com/in/erhard-blankenburg...</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td>Erhard Blankenburg</td>\n      <td>Q51595283</td>\n      <td>academic appointment</td>\n      <td>P8413</td>\n      <td>Vrije Universiteit Amsterdam</td>\n      <td>NaN</td>\n      <td>1980.0</td>\n      <td>2003.0</td>\n      <td>https://www.linkedin.com/in/erhard-blankenburg...</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
    
         "execution_count": 4,
    
         "metadata": {},
         "output_type": "execute_result"
        }
       ],
       "source": [
        "fullName = \"Erhard Blankenburg\"\n",
        "qid=\"Q51595283\"\n",
        "model = ChatOpenAI(model_name=\"gpt-4\")\n",
        "df = use_model(model, prompt, fullName=fullName, qid=qid, website_text=website_text)\n",
        "df.to_csv(f'data/{fullName}-chatgpt.csv', index=False)\n",
        "df"
       ],
       "metadata": {
        "collapsed": false,
        "ExecuteTime": {
    
         "end_time": "2024-03-17T09:30:28.719715800Z",
         "start_time": "2024-03-17T09:30:06.483740300Z"
    
      {
       "cell_type": "markdown",
       "source": [
    
        "## Manual correction\n",
    
        "\n",
    
    Christian Boulanger's avatar
    Christian Boulanger committed
        "The data has now been downloaded to `data/<name>-chatgpt.csv`. It needs to be cleaned and augmented before upload, for example by loading it into OpenRefine and reconciling the `object` column via the  WikiData Reconciliation service. Afterward, remove the object-qid column and recreate it via the \"add column based on this column\" function using `cell.recon.match.id` GREL expression. \n",
    
        "\n",
        "Otherwise, you can also just look up the terms and fill out the object-qid column manually. \n",
        "\n",
        "When done, rename the CSV file by removing the \"-chatgpt\" infix. "
       ],
       "metadata": {
        "collapsed": false
       },
       "id": "38be8467270ebc58"
      },
      {
       "cell_type": "markdown",
       "source": [
        "## Upload data to WikiData\n"
    
       ],
       "metadata": {
        "collapsed": false
       },
       "id": "b110b2b14114ad05"
      },
    
       "execution_count": 5,
    
       "outputs": [
        {
         "name": "stdout",
         "output_type": "stream",
         "text": [
    
          "----------\n",
          "(Q51595283)-[P69]-(Q153987) exists.\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q153987).\n",
          "----------\n",
          "(Q51595283)-[P69]-(Q153006) exists.\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q153006).\n",
          "----------\n",
          "(Q51595283)-[P69]-(Q766145) exists.\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q766145).\n",
          "----------\n",
          "(Q51595283)-[P69]-(Q372608) exists.\n",
          "Time qualifier P582 with value 1965 already exists on (Q51595283)-[P69]-(Q372608).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P69]-(Q372608).\n",
          "----------\n",
          "(Q51595283)-[P512]-(Q2091008) exists.\n",
          "Time qualifier P585 with value 1965 already exists on (Q51595283)-[P512]-(Q2091008).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q2091008).\n",
          "----------\n",
          "(Q51595283)-[P512]-(Q752297) exists.\n",
          "Time qualifier P585 with value 1966 already exists on (Q51595283)-[P512]-(Q752297).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q752297).\n",
          "----------\n",
          "(Q51595283)-[P108]-(Q153987) exists.\n",
          "Time qualifier P580 with value 1966 already exists on (Q51595283)-[P108]-(Q153987).\n",
          "Time qualifier P582 with value 1968 already exists on (Q51595283)-[P108]-(Q153987).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q153987).\n",
          "----------\n",
          "(Q51595283)-[P108]-(Q124866772) exists.\n",
          "Time qualifier P580 with value 1969 already exists on (Q51595283)-[P108]-(Q124866772).\n",
          "Time qualifier P582 with value 1971 already exists on (Q51595283)-[P108]-(Q124866772).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q124866772).\n",
          "----------\n",
          "(Q51595283)-[P108]-(Q2112115) exists.\n",
          "Time qualifier P582 with value 1973 already exists on (Q51595283)-[P108]-(Q2112115).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q2112115).\n",
          "----------\n",
          "(Q51595283)-[P108]-(Q832780) exists.\n",
          "Time qualifier P580 with value 1973 already exists on (Q51595283)-[P108]-(Q832780).\n",
          "Time qualifier P582 with value 1974 already exists on (Q51595283)-[P108]-(Q832780).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q832780).\n",
          "----------\n",
          "(Q51595283)-[P512]-(Q308678) exists.\n",
          "Time qualifier P585 with value 1974 already exists on (Q51595283)-[P512]-(Q308678).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P512]-(Q308678).\n",
          "----------\n",
    
          "(Q51595283)-[P108]-(Q475602) exists.\n",
          "Time qualifier P580 with value 1975 already exists on (Q51595283)-[P108]-(Q475602).\n",
          "Time qualifier P582 with value 1980 already exists on (Q51595283)-[P108]-(Q475602).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P108]-(Q475602).\n",
          "----------\n",
          "(Q51595283)-[P8413]-(Q1065414) exists.\n",
          "Time qualifier P580 with value 1980 already exists on (Q51595283)-[P8413]-(Q1065414).\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P8413]-(Q1065414).\n",
          "----------\n",
          "(Q51595283)-[P1416]-(Q1459361) exists.\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P1416]-(Q1459361).\n",
          "----------\n",
          "(Q51595283)-[P98]-(Q96335163) exists.\n",
          "Source URL https://de.wikipedia.org/wiki/Erhard_Blankenburg already exists on (Q51595283)-[P98]-(Q96335163).\n",
          "----------\n",
          "Created (Q65972149)-[P112]-(Q51595283)\n"
    
         ]
        },
        {
         "name": "stderr",
         "output_type": "stream",
         "text": [
    
          "Sleeping for 9.5 seconds, 2024-03-17 10:36:18\n"
    
         ]
        },
        {
         "name": "stdout",
         "output_type": "stream",
         "text": [
    
          "Added references to (Q65972149)-[P112]-(Q51595283)\n",
          "----------\n",
          "Refining (Q65972149)-[P112]-(Q51595283)\n"
    
         ]
        },
        {
         "name": "stderr",
         "output_type": "stream",
         "text": [
    
          "Sleeping for 9.5 seconds, 2024-03-17 10:36:28\n"
    
         ]
        },
        {
         "name": "stdout",
         "output_type": "stream",
         "text": [
    
          "Added end_time qualifier to (Q65972149)-[P112]-(Q51595283)\n"
    
         ]
        },
        {
         "name": "stderr",
         "output_type": "stream",
         "text": [
    
          "Sleeping for 9.1 seconds, 2024-03-17 10:36:38\n"
    
         ]
        },
        {
         "name": "stdout",
         "output_type": "stream",
         "text": [
    
          "Added references to (Q65972149)-[P112]-(Q51595283)\n",
    
          "(Q51595283)-[P101]-(Q847034) exists.\n"
         ]
        },
        {
         "name": "stderr",
         "output_type": "stream",
         "text": [
          "Sleeping for 8.8 seconds, 2024-03-17 10:36:48\n"
         ]
        },
        {
         "name": "stdout",
         "output_type": "stream",
         "text": [
          "Added references to (Q51595283)-[P101]-(Q847034)\n",
          "----------\n"
         ]
        },
        {
         "name": "stderr",
         "output_type": "stream",
         "text": [
          "Sleeping for 8.4 seconds, 2024-03-17 10:36:59\n"
         ]
        },
        {
         "name": "stdout",
         "output_type": "stream",
         "text": [
          "Created (Q51595283)-[P101]-(Q161733)\n"
         ]
        },
        {
         "name": "stderr",
         "output_type": "stream",
         "text": [
          "Sleeping for 9.3 seconds, 2024-03-17 10:37:08\n"
         ]
        },
        {
         "name": "stdout",
         "output_type": "stream",
         "text": [
          "Added references to (Q51595283)-[P101]-(Q161733)\n"
    
        "# based on code written by GPT-4\n",
    
        "from pywikibot import Claim, WbTime, ItemPage, PropertyPage, Site\n",
    
        "from datetime import datetime\n",
        "\n",
        "def claim_to_string(claim):\n",
        "    subject_qid = claim.on_item.id\n",
        "    predicate_pid = claim.getID()\n",
    
        "    # Object QID, assuming the target is a Wikidata item\n",
        "    # Note: This simplification assumes the claim's target is an item. \n",
        "    # For other target types (e.g., quantities, strings), additional handling is needed.\n",
    
        "    if isinstance(claim.getTarget(), ItemPage):\n",
    
        "        object_qid = claim.getTarget().id\n",
        "    else:\n",
        "        # Placeholder or additional logic for non-ItemPage targets\n",
        "        object_qid = 'N/A'  # This could be expanded to handle other types of targets\n",
        "\n",
        "    return f\"({subject_qid})-[{predicate_pid}]-({object_qid})\"\n",
        "\n",
        "\n",
    
        "# Function to check if a specific time qualifier exists\n",
        "def time_qualifier_exists(claim, qualifier_pid, year_value):\n",
        "    for qualifier in claim.qualifiers.get(qualifier_pid, []):\n",
        "        qualifier_date = qualifier.getTarget()\n",
        "        if qualifier_date.year == year_value:\n",
        "            print(f'Time qualifier {qualifier_pid} with value {year_value} already exists on {claim_to_string(claim)}.')\n",
        "            return True\n",
        "    return False\n",
        "\n",
        "\n",
        "def add_time_qualifiers(repo, claim, start_time, end_time):\n",
        "    qualifiers = []\n",
        "    \n",
        "    if (start_time and end_time) and (start_time == end_time):\n",
        "        if not time_qualifier_exists(claim, 'P585', int(start_time)):\n",
        "            point_in_time_qualifier = Claim(repo, 'P585')\n",
        "            point_in_time_qualifier.setTarget(WbTime(year=int(start_time)))\n",
        "            claim.addQualifier(point_in_time_qualifier, summary='Adding point in time')\n",
        "            print(f'Added point_in_time qualifier to {claim_to_string(claim)}')\n",
        "            qualifiers.append(point_in_time_qualifier)\n",
        "        \n",
        "    else:    \n",
        "        if start_time and not time_qualifier_exists(claim, 'P580', int(start_time)):\n",
        "            start_time_qualifier = Claim(repo, 'P580')\n",
        "            start_time_qualifier.setTarget(WbTime(year=int(start_time)))\n",
        "            claim.addQualifier(start_time_qualifier, summary='Adding start time')\n",
        "            print(f'Added start_time qualifier to {claim_to_string(claim)}')\n",
        "            qualifiers.append(start_time_qualifier)\n",
        "            \n",
        "        if end_time and not time_qualifier_exists(claim, 'P582', int(end_time)):\n",
        "            end_time_qualifier = Claim(repo, 'P582')\n",
        "            end_time_qualifier.setTarget(WbTime(year=int(end_time)))\n",
        "            claim.addQualifier(end_time_qualifier, summary='Adding end time')\n",
        "            print(f'Added end_time qualifier to {claim_to_string(claim)}')\n",
        "            qualifiers.append(end_time_qualifier)\n",
        "\n",
        "    return qualifiers\n",
        "\n",
        "# Function to check if a reference with the given URL already exists on the claim\n",
        "def reference_url_exists(claim, url):\n",
        "    for source in claim.getSources():\n",
        "        if 'P4656' in source or 'P854' in source:  # Check both Wikimedia import URL and reference URL\n",
        "            for prop in source.get('P4656', []) + source.get('P854', []):\n",
        "                if prop.getTarget() == url:\n",
        "                    print(f'Source URL {url} already exists on {claim_to_string(claim)}.')\n",
        "                    return True\n",
        "    return False\n",
        "\n",
        "def qualifier_exists(claim, qualifier_property_id, target):\n",
        "    for existing_qualifier in claim.qualifiers.get(qualifier_property_id, []):\n",
        "        if existing_qualifier.getTarget() == target:\n",
        "            print(f'Qualifier {qualifier_property_id} with value {target.getID()} already exists on {claim_to_string(claim)}.')\n",
        "            return True\n",
        "    return False\n",
        "\n",
        "        \n",
        "def add_reference(repo, claim, reference_url, retrieved_at_time, qualifiers = None):\n",
        "    sources=[]\n",
        "    if reference_url and not reference_url_exists(claim, reference_url):\n",
        "        # Determine whether the URL is a Wikipedia URL or another type of URL\n",
        "        property_id = 'P4656' if 'wikipedia.org' in reference_url else 'P854'\n",
        "    \n",
        "        # Create the reference claim\n",
        "        source_claim = Claim(repo, property_id)\n",
        "        source_claim.setTarget(reference_url)\n",
        "        sources.append(source_claim)\n",
        "    \n",
        "        # Create the 'retrieved at' claim\n",
        "        retrieved_at_claim = Claim(repo, 'P813')\n",
        "        retrieved_at_target = WbTime(year=retrieved_at_time.year, month=retrieved_at_time.month, day=retrieved_at_time.day)\n",
        "        retrieved_at_claim.setTarget(retrieved_at_target)\n",
        "        sources.append(retrieved_at_claim)\n",
        "    \n",
        "    # If a qualifier has been passed for which this reference is the source, add it\n",
        "    if qualifiers:\n",
        "        for qualifier in qualifiers:\n",
        "            supports_qualifier_claim = Claim(repo, 'P10551') # \"supports qualifier\"\n",
        "            site = Site(\"wikidata\", \"wikidata\")\n",
        "            property_page = PropertyPage(site, qualifier.getID())\n",
        "            if not qualifier_exists(claim, 'P10551', property_page):\n",
        "                supports_qualifier_claim.setTarget(property_page)\n",
        "                sources.append(supports_qualifier_claim)\n",
        "        \n",
        "    # Add the references to the claim\n",
        "    if len(sources) > 0:\n",
        "        claim.addSources(sources, summary='Adding reference and retrieved at date')\n",
        "        print(f'Added references to {claim_to_string(claim)}')\n",
        "    \n",
        "    return sources\n",
        "\n",
        "\n",
        "# main function\n",
    
    Christian Boulanger's avatar
    Christian Boulanger committed
        "def update_wikidata_from_csv(file_path):\n",
    
        "    site = Site(\"wikidata\", \"wikidata\")\n",
    
        "    repo = site.data_repository()\n",
        "\n",
        "    previous_object_qid = None\n",
        "    previous_claim = None\n",
        "\n",
        "    with open(file_path, newline='', encoding='utf-8') as csvfile:\n",
        "        reader = csv.DictReader(csvfile)\n",
        "        for row in reader:\n",
    
        "            print(\"----------\")\n",
    
        "            subject_qid = row['subject-qid']\n",
        "            pid = row['pid']\n",
        "            object_qid = row['object-qid']\n",
        "            start_time = row['start_time']\n",
        "            end_time = row['end_time']\n",
        "            reference_url = row['reference_url']\n",
        "\n",
        "            # If the new subject is identical to the old object, refine the previous claim\n",
        "            if subject_qid == previous_object_qid and previous_claim:\n",
        "                claim = previous_claim\n",
        "                print(f'Refining {claim_to_string(claim)}')\n",
        "            else:\n",
    
        "                item = ItemPage(repo, subject_qid)\n",
    
        "                item.get()\n",
        "\n",
        "                # Check if the claim already exists\n",
        "                claim_exists = False\n",
        "                for claim in item.claims.get(pid, []):\n",
        "                    if claim.getTarget().id == object_qid:\n",
        "                        claim_exists = True\n",
        "                        print(f'{claim_to_string(claim)} exists.')\n",
        "                        break\n",
        "\n",
        "                if not claim_exists:\n",
        "                    claim = Claim(repo, pid)\n",
    
        "                    target = ItemPage(repo, object_qid)\n",
    
        "                    claim.setTarget(target)\n",
        "                    item.addClaim(claim)\n",
        "                    print(f'Created {claim_to_string(claim)}')\n",
        "\n",
    
        "            # start_time and end_time\n",
        "            qualifiers = add_time_qualifiers(repo, claim, start_time, end_time)\n",
    
        "            # references\n",
        "            retrieved_at_time = datetime.utcnow()\n",
        "            add_reference(repo, claim, reference_url, retrieved_at_time, qualifiers)\n",
    
        "\n",
        "            # Remember the object and claim for the next iteration\n",
        "            previous_object_qid = object_qid\n",
        "            previous_claim = claim\n",
        "\n",
    
    Christian Boulanger's avatar
    Christian Boulanger committed
        "update_wikidata_from_csv('data/Erhard Blankenburg.csv')"
    
       ],
       "metadata": {
        "collapsed": false,
        "ExecuteTime": {
    
         "end_time": "2024-03-17T09:37:18.357712600Z",
         "start_time": "2024-03-17T09:36:05.736657800Z"
    
       "cell_type": "markdown",
       "source": [
        "The result can be seen at https://www.wikidata.org/wiki/Q51595283"
       ],
    
       "id": "b7757fcf6ea66320"
    
      }
     ],
     "metadata": {
      "kernelspec": {
       "display_name": "Python 3",
       "language": "python",
       "name": "python3"
      },
      "language_info": {
       "codemirror_mode": {
        "name": "ipython",
        "version": 2
       },
       "file_extension": ".py",
       "mimetype": "text/x-python",
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython2",
       "version": "2.7.6"
      }
     },
     "nbformat": 4,
     "nbformat_minor": 5
    }