diff --git a/wikidata/data-extraction.ipynb b/wikidata/data-extraction.ipynb index b925963c678bd9de6e144a53aaccdb5becdda719..ffb9283d7f0b50cc8d3a2fae76344aa9bb87cd6b 100644 --- a/wikidata/data-extraction.ipynb +++ b/wikidata/data-extraction.ipynb @@ -4,13 +4,28 @@ "cell_type": "markdown", "source": [ "# Extract information from a Wikipedia page and upload to Wikidata\n", - "\n" + "\n", + "This notebook takes an excerpt from a Wikipedia page about a scholar and extracts biographical information from it to upload the infromation to the WikiData enrty on that person. The steps are as follows:\n", + " \n", + "1. send the excerpt to the OpenAi API (GPT-4), using a custom prompt that instructs the model to extract CSV data that can easily be arranged into statements and qualifiers\n", + "2. manually edit the data by correcting wrongly inferred information and adding missing triple data\n", + "3. upload the data using pywikibot " ], "metadata": { "collapsed": false }, "id": "6f9eb711429fb6cd" }, + { + "cell_type": "markdown", + "source": [ + "## Definining the prompt" + ], + "metadata": { + "collapsed": false + }, + "id": "f36d8c1c925d1e0e" + }, { "cell_type": "code", "execution_count": 67, @@ -57,6 +72,18 @@ }, "id": "27d869b6191fa004" }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "## Data from Wikipedia (or any other website)" + ], + "metadata": { + "collapsed": false + }, + "id": "2e13909d3eba95cb" + }, { "cell_type": "code", "execution_count": 68, @@ -81,6 +108,24 @@ }, "id": "37687f2fd256a439" }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "collapsed": false + }, + "id": "2362d9d97adbcbbf" + }, + { + "cell_type": "markdown", + "source": [ + "## Query the OpenAI API (GPT-4)\n" + ], + "metadata": { + "collapsed": false + }, + "id": "a5800fe8919e19c4" + }, { "cell_type": "code", "execution_count": 69, @@ -145,9 +190,21 @@ }, "id": "717d713e38598c57" }, + { + "cell_type": "markdown", + "source": [ + "## Upload data to WikiData\n", + "\n", + "The result can be seen at https://www.wikidata.org/wiki/Q51595283" + ], + "metadata": { + "collapsed": false + }, + "id": "b110b2b14114ad05" + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 73, "outputs": [ { "name": "stdout", @@ -764,6 +821,98 @@ "text": [ "Sleeping for 9.5 seconds, 2024-03-15 18:27:04\n" ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Added reference https://de.wikipedia.org/wiki/Erhard_Blankenburg with access date\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Sleeping for 9.0 seconds, 2024-03-15 18:27:14\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Created (Q51595283)-[P98]-(Q96335163)\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Sleeping for 9.5 seconds, 2024-03-15 18:27:24\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Added reference https://de.wikipedia.org/wiki/Erhard_Blankenburg with access date\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Sleeping for 9.1 seconds, 2024-03-15 18:27:34\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Created (Q65972149)-[P112]-(Q51595283)\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Sleeping for 9.5 seconds, 2024-03-15 18:27:44\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Added reference https://de.wikipedia.org/wiki/Erhard_Blankenburg with access date\n", + "Refining (Q65972149)-[P112]-(Q51595283)\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Sleeping for 9.5 seconds, 2024-03-15 18:27:54\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Added end time\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Sleeping for 9.6 seconds, 2024-03-15 18:28:04\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Added reference https://www.linkedin.com/in/erhard-blankenburg-63938058/ with access date\n" + ] } ], "source": [ @@ -870,8 +1019,8 @@ ], "metadata": { "collapsed": false, - "is_executing": true, "ExecuteTime": { + "end_time": "2024-03-15T17:28:14.769604100Z", "start_time": "2024-03-15T17:19:52.058378600Z" } },