From ab263aa0b54885b04813b6026bf5cd8680a26df2 Mon Sep 17 00:00:00 2001 From: Christian Boulanger <boulanger@lhlt.mpg.de> Date: Mon, 30 Sep 2024 11:15:58 +0200 Subject: [PATCH] Update documentation --- convert-anystyle-data/anystyle-to-tei.ipynb | 25 +++++++++++++-------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/convert-anystyle-data/anystyle-to-tei.ipynb b/convert-anystyle-data/anystyle-to-tei.ipynb index bc825fc..5297f32 100644 --- a/convert-anystyle-data/anystyle-to-tei.ipynb +++ b/convert-anystyle-data/anystyle-to-tei.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "source": [ - "# Convert AnyStyle GS to TEI (`<bibl>`/`<biblStruct>`) GS \n", + "# Convert AnyStyle to TEI-bibl data \n", "\n", "References: \n", "- https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COBI (Overview)\n", @@ -14,19 +14,28 @@ "- https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/ (Grobid examples using `<bibl>`)\n", "\n", "\n", - "We use `<bibl>` here instead of `<biblStruct>` because it is more loosely-structured and allows for a more flat datastructure. \n", + "We use `<bibl>` here for marking up the citation data. These annotations can then be further processed:\n", + "- [to Gold Standard based on `<biblStruct>`](tei-to-biblstruct-gs.ipynb)\n", + "- [to bibliographic data formats](tei-to-bibformats.ipynb)\n", + "- [to the prodigy annotation format](tei-to-prodigy.ipynb)\n", "\n", - "Todo:\n", - "- BiblStruct mit der übergeordneten <listBibl n=\"fußnote\" src=\"Input\">\n", - "\n", - "\n", - "## Collect metadata on TEI `<bibl>` tags" + "Code was written with assistance by ChatGPT 4. " ], "metadata": { "collapsed": false }, "id": "4c77ab592c98dfd" }, + { + "cell_type": "markdown", + "source": [ + "## Collect metadata on TEI `<bibl>` tags" + ], + "metadata": { + "collapsed": false + }, + "id": "dd3645db958007fe" + }, { "cell_type": "markdown", "source": [ @@ -79,8 +88,6 @@ "import re\n", "from tqdm.notebook import tqdm\n", "\n", - "\n", - "# written by GPT-4\n", "def extract_headings_and_links(tag, doc_heading, doc_base_url):\n", " # Extract heading numbers from the document\n", " heading_numbers = re.findall(r'\\d+(?:\\.\\d+)*', doc_heading)\n", -- GitLab