# Conversion to TEI (`<bibl>`)

References: 
- https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COBI (Overview)
- https://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COBIOT (Mapping to other bibliographic formats)
- https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-bibl.html (`<bibl>`)
- https://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-biblStruct.html (`biblStruct`)
- https://epidoc.stoa.org/gl/latest/supp-bibliography.html (Examples)
- https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/ (Grobid examples using `<bibl>`)
- http://www.jsonml.org/ (a JSON schema for lossless conversion from/to xml)

We use `<bibl>` here instead of `<biblStruct>` because it is more loosely-structured and allows for a more flat datastructure. 

## Collect metadata on TEI `<bibl>` tags

Cache XML schema for offline use

In [2]:
import xmlschema
import os
if not os.path.isdir("schema/tei"):
    schema = xmlschema.XMLSchema("https://www.tei-c.org/release/xml/tei/custom/schema/xsd/tei_all.xsd")
    schema.export(target='schema/tei', save_remote=True)

This generates JSON data with information on the tags used, extracting from the schema and from the documentation pages

In [42]:
import xml.etree.ElementTree as ET
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from tqdm.notebook import tqdm


# written by GPT-4
def extract_headings_and_links(tag, doc_heading, doc_base_url):
    # Extract heading numbers from the document
    heading_numbers = re.findall(r'\d+(?:\.\d+)*', doc_heading)

    # Download the HTML page
    url = f"{doc_base_url}/ref-{tag}.html"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract the links associated with each heading number
    links = {}
    for link in soup.find_all('a', class_='link_ptr'):
        heading_value = link.find('span', class_='headingNumber').text.strip()
        link_url = link.get('href')
        links[heading_value] = f"{doc_base_url}/{link_url}"

    return {heading: link_url for heading, link_url in zip(heading_numbers, links.values()) if
            heading in heading_numbers}


def generate_tag_docs(xsd_path):
    namespaces = {'xs': 'http://www.w3.org/2001/XMLSchema'}
    doc_base_url = "https://www.tei-c.org/release/doc/tei-p5-doc/en/html"

    tree = ET.parse('schema/tei/tei_all.xsd')
    root = tree.getroot()
    schema = xmlschema.XMLSchema(xsd_path)
    bibl_schema = schema.find("tei:bibl")
    data_list = []
    #names = [child_element.local_name for child_element in bibl_schema.iterchildren()]
    names = ['author', 'biblScope', 'citedRange', 'date', 'edition', 'editor', 'idno', 'location', 'note', 'orgName', 
             'publisher', 'pubPlace', 'ptr', 'series', 'title', 'volume', 'issue']
    for name in tqdm(names, desc="Processing TEI tags"):
        doc_node = root.find(f".//xs:element[@name='{name}']/xs:annotation/xs:documentation", namespaces=namespaces)
        if doc_node is not None:
            matches = re.search(r'^(.*)\[(.*)]$', doc_node.text)
            if matches is None: continue
            description = matches.group(1)
            doc_heading = matches.group(2)
            doc_urls = extract_headings_and_links(name, doc_heading, doc_base_url)
            data_list.append({'name': name, 'description': description, 'documentation': doc_heading, 'urls': doc_urls})

    return pd.DataFrame(data_list)


cache_file = "schema/tei/tei-tags-documentation.json"
if not os.path.isfile(cache_file):
    df = generate_tag_docs("schema/tei/tei_all.xsd")
    json_str = df.to_json(index=False, orient='records', indent=4).replace(r"\/", "/")
    with open(cache_file, "w", encoding='utf-8') as f:
        f.write(json_str)
else:
    df = pd.read_json(cache_file)
df


Processing TEI tags:   0%|          | 0/17 [00:00<?, ?it/s]

Unnamed: 0,name,description,documentation,urls
0,author,"(author) in a bibliographic reference, contain...","3.12.2.2. Titles, Authors, and Editors 2.2.1. ...",{'3.12.2.2': 'https://www.tei-c.org/release/do...
1,biblScope,(scope of bibliographic reference) defines the...,3.12.2.5. Scopes and Ranges in Bibliographic C...,{'3.12.2.5': 'https://www.tei-c.org/release/do...
2,citedRange,(cited range) defines the range of cited conte...,3.12.2.5. Scopes and Ranges in Bibliographic C...,{'3.12.2.5': 'https://www.tei-c.org/release/do...
3,date,(date) contains a date in any format.,"3.6.4. Dates and Times 2.2.4. Publication, Dis...",{'3.6.4': 'https://www.tei-c.org/release/doc/t...
4,edition,(edition) describes the particularities of one...,2.2.2. The Edition Statement,{'2.2.2': 'https://www.tei-c.org/release/doc/t...
5,editor,contains a secondary statement of responsibili...,"3.12.2.2. Titles, Authors, and Editors",{'3.12.2.2': 'https://www.tei-c.org/release/do...
6,idno,(identifier) supplies any form of identifier u...,"14.3.1. Basic Principles 2.2.4. Publication, D...",{'14.3.1': 'https://www.tei-c.org/release/doc/...
7,location,(location) defines the location of a place as ...,14.3.4. Places,{'14.3.4': 'https://www.tei-c.org/release/doc/...
8,note,(note) contains a note or annotation.,3.9.1. Notes and Simple Annotation 2.2.6. The ...,{'3.9.1': 'https://www.tei-c.org/release/doc/t...
9,orgName,(organization name) contains an organizational...,14.2.2. Organizational Names,{'14.2.2': 'https://www.tei-c.org/release/doc/...


## Convert Ground Truth to TEI

This converts the AnyStyle XML data to TEI, translating from the flat schema to the nested TEI `<bibl>` structure.


In [80]:
import xml.etree.ElementTree as ET
import regex as re
import glob
import os
import xml.dom.minidom
import json
import xmlschema
from nameparser import HumanName

def even_num_brackets(string: str):
    """
    Simple heuristic to determine if string contains an even number of round and square brackets,
    so that if not, trailing or leading brackets will be removed.
    """
    return ((string.endswith(")") and string.count(")") == string.count("("))
            or (string.endswith("]") and string.count("]") == string.count("[")))

def remove_punctuation(text, keep_trailing_chars="?!"):
    """This removes leading and trailing punctuation using very simple rules for German and English"""
    start, end = 0, len(text)
    while start < len(text) and re.match("\p{P}", text[start]) and text[end - 1]:
        start += 1
    while end > start and re.match("\p{P}", text[end - 1]) and not even_num_brackets(text[start:end]) and text[end - 1] not in keep_trailing_chars:
        end -= 1
    return text[start:end].strip()

def remove_punctuation2(text):
    """same as remove_punctuation, but keep trailing periods."""
    return remove_punctuation(text, "?!.")

def clean_editor(text): 
    text = re.sub(r'^in(:| )', '', remove_punctuation(text), flags=re.IGNORECASE)
    text = re.sub(r'\(?(hrsg\. v\.|hg\. v|hrsg\.|ed\.|eds\.)\)?', '', text, flags=re.IGNORECASE)
    return text.strip()

def clean_container(text):
    return remove_punctuation(re.sub(r'^(in|aus|from)(:| )', '', text.strip(), flags=re.IGNORECASE))

def clean_pages(text):
    return remove_punctuation(re.sub(r'^(S\.|p\.|pp\.|ff?\.||seqq?\.)', '', text.strip(), flags=re.IGNORECASE))

def extract_year(text):
    m = re.search( r'[12][0-9]{3}', text)
    return m.group(0) if m else None

def find_string(string, container):
    start = container.find(string)
    if start > -1:
        end = start + len(string)
        return start, end
    raise ValueError(f"Could not find '{string}' in '{container}'")

def add_node(parent, tag, text="", attributes=None, clean_func=None, preserve=False):
    """
    Adds a child node to the parent, optionally adding text and attributes. 
    If a clean_func is passed, the text is set after applying the function to it.
    If the `preserve` flag is True, the removed preceding or trailing text is preserved in the xml,
    outside of the node content 
    """
    node = ET.SubElement(parent, tag, (attributes or {}))
    if clean_func:
        cleaned_text = clean_func(text)
        if preserve:
            start, end = find_string(cleaned_text, text)
            prefix, suffix = text[:start], text[end:]
            if prefix !="" and len(parent) > 1:
                prev_sibling = parent[-2]
                prev_tail = (prev_sibling.tail or '')
                new_prev_tail = f'{prev_tail} {prefix}'.strip()
                prev_sibling.tail = new_prev_tail
            node.text = cleaned_text
            if suffix != "":
                node.tail = suffix
    else:
        node.text = text
    return node

def create_tei_root():
    return ET.Element('TEI', {
        'xmlns': "http://www.tei-c.org/ns/1.0"
    })

def create_tei_header(tei_root, title):
    tei_header = add_node(tei_root, 'teiHeader')
    file_desc = add_node(tei_header, 'fileDesc')
    title_stmt = add_node(file_desc, 'titleStmt')
    add_node(title_stmt, 'title', title)
    publication_stmt = add_node(file_desc, 'publicationStmt')
    add_node(publication_stmt, 'publisher', 'mpilhlt')
    source_desc = add_node(file_desc, 'sourceDesc')
    add_node(source_desc, 'p', title)
    return tei_header

def create_body(text_root):
    body = ET.SubElement(text_root, 'body')
    add_node(body, 'p', 'The article text is not part of this document')
    return body

def prettify(xml_string, indentation="  "):
    """Return a pretty-printed XML string"""
    return xml.dom.minidom.parseString(xml_string).toprettyxml(indent=indentation)

def split_creators(text:str, bibl, tag, clean_func, preserve):
    sep_regex = r'[;&/]| and | und '
    creators = re.split(sep_regex, text)  
    seperators = re.findall(sep_regex, text)
    for creator in creators:
        # <author>/<editor>
        creator_node = add_node(bibl, tag, creator, clean_func=clean_func, preserve=preserve)
        # <persName>
        name = HumanName(creator_node.text)
        creator_node.text = ''
        pers_name = add_node(creator_node, 'persName')
        inv_map = {v: k for k, v in name.as_dict(False).items()}
        if len(name) == 1:
            add_node(pers_name, 'surname', list(name)[0])
        else:
            for elem in list(name):
                match inv_map[elem]:
                    case 'last':
                        # <surname>
                        add_node(pers_name, 'surname', elem)
                    case 'first' | 'middle':
                        # <forename>
                        add_node(pers_name, 'forename', elem)
            if len(seperators):
                creator_node.tail = seperators.pop(0).strip()
        
def anystyle_to_tei(input_xml_path, id, preserve=False):
    anystyle_root = ET.parse(input_xml_path).getroot()
    tei_root = create_tei_root()
    create_tei_header(tei_root, title=id)
    text_root = add_node(tei_root, 'text')
    body = create_body(text_root)
    # <listBibl> element for <bibl> elements that are not in footnotes, such as a bibliography
    listBibl = add_node(body, 'listBibl')
    # iterate over all sequences (=footnotes) and translate into TEI equivalents
    for sequence in anystyle_root.findall('sequence'):
        # if the sequence contains a citation-number, create a new <note> to add <bibl> elements to 
        if (cn:= sequence.findall('citation-number')):
            footnote_number = cn[0].text
            attributes = {
                'n': footnote_number,
                'type': 'footnote',
                'place': 'bottom'
            }
            node = add_node(text_root, 'note', attributes=attributes, clean_func=remove_punctuation, preserve=preserve)
        else:
            # otherwise add to <listBibl> element
            node = listBibl
        bibl = None
        for child in sequence:
            tag = child.tag
            text = child.text
            if tag == "citation-number": continue # this has already been taken care of
            if (bibl is None # if we do not have a bibl element yet
                or (bibl.find(tag) and tag != "note") # or tag already exists in the current element
                or tag in ['signal', 'legal-ref'] # or tag belongs to a specific groups that signal a separate reference
                or (tag in ["author", "editor", "authority"] and bibl.find('date'))): # or specific tags follow a date field 
                # then create a new bibl element
                bibl = ET.SubElement(node, 'bibl')
            match tag:
                case 'author':
                    split_creators(text, bibl, 'author', clean_func=remove_punctuation, preserve=preserve)
                case 'authority':
                    split_creators(text, bibl, 'publisher', clean_func=remove_punctuation, preserve=preserve)                
                case 'backref':
                    add_node(bibl, 'ref', text, clean_func=remove_punctuation2, preserve=preserve)
                case 'container-title':
                    add_node(bibl, 'title', text, {'level': 'm'}, clean_func= clean_container, preserve=preserve)
                case 'collection-title':
                    add_node(bibl, 'title', text, {'level': 's'}, clean_func= clean_container, preserve=preserve)
                case 'date':
                    add_node(bibl, 'date', text, clean_func= extract_year, preserve=preserve)
                case 'edition':
                    add_node(bibl, 'edition', text, clean_func=remove_punctuation2, preserve=preserve)
                case 'editor':
                    split_creators(text, bibl, 'editor', clean_func=clean_editor, preserve=preserve)
                case 'location':
                    add_node(bibl, 'pubPlace', text, clean_func=remove_punctuation, preserve=preserve)
                case 'note':
                    add_node(bibl, 'note', text, clean_func=remove_punctuation, preserve=preserve)                    
                case 'journal':
                    add_node(bibl, 'title', text, {'level': 'j'}, clean_func= clean_container, preserve=preserve)
                case 'legal-ref':
                    add_node(bibl, 'ref', text, {'type': 'legal'}, clean_func = remove_punctuation, preserve=preserve)
                case 'pages':
                    if bibl[-1].tag == "ref":
                        add_node(bibl, 'citedRange', text, {'unit': 'pp'}, clean_func= clean_pages, preserve=preserve)
                    else:
                        add_node(bibl, 'biblScope', text, {'unit': 'pp'}, clean_func= clean_pages, preserve=preserve)
                case 'signal':
                    add_node(bibl, 'note', text, {'type': 'signal'}, clean_func=remove_punctuation, preserve=preserve)
                case 'title':
                    add_node(bibl, 'title', text, {'level': 'a'}, clean_func=remove_punctuation2, preserve=preserve)
                case 'url':
                    add_node(bibl, 'ptr', text, {'type':'web'}, clean_func=remove_punctuation, preserve=preserve)    
                case 'volume':
                    add_node(bibl, 'biblScope', text, {'unit': 'vol'}, clean_func = remove_punctuation, preserve=preserve)
            if len(bibl) == 0:
                node.remove(bibl)
    if len(listBibl) == 0:
        body.remove(listBibl)
    return ET.tostring(tei_root, 'unicode')

def tei_to_json(tei_xml, schema):
    dict_obj = xmlschema.to_dict(tei_xml, schema=schema, converter=xmlschema.JsonMLConverter)
    return json.dumps(dict_obj, default=str)

# main

# XML->JSON-Conversion doesn't provide anything useful 
# tei_xsd_path = "schema/tei/tei_all.xsd"
# if 'schema' not in locals():
#     print("Parsing schema file, please wait...")
#     schema = xmlschema.XMLSchema(tei_xsd_path)

for input_path in glob.glob('anystyle/*.xml'):
    base_name = os.path.basename(input_path)
    id = os.path.splitext(base_name)[0]
    print(f'Converting {base_name} into TEI-XML ...')
    output_xml = anystyle_to_tei(input_path, id, preserve=True)
    # output_json = tei_to_json(output_xml, schema)
    with open(f'tei/{id}.xml', 'w', encoding='utf-8') as f:
        f.write(prettify(output_xml))
    # with open(f'tei/{id}.json', 'w', encoding='utf-8') as f:
    #     f.write(output_json)
    


Converting 10.1111_1467-6478.00057.xml into TEI-XML ...
Converting 10.1111_1467-6478.00080.xml into TEI-XML ...
Converting 10.1515_zfrs-1980-0103.xml into TEI-XML ...
Converting 10.1515_zfrs-1980-0104.xml into TEI-XML ...


## Recreate GT from TEI

In [73]:
from lxml import etree
import glob
import os
import json
import regex as re

def tei_to_ground_truth_input(tei_xml_doc):
    """
    Extract the original footnote strings from the <note> elements in a given TEI document and return a list of strings
    """
    root = etree.fromstring(tei_xml_doc)
    ground_truth_list = []
    ns = {"tei": "http://www.tei-c.org/ns/1.0"}
    # iterate over the <note type="footnote"> elements
    for note in root.findall('.//tei:note[@type="footnote"]', ns):
        footnote_elements = [note.attrib['n']]
        # iterate over the <bibl> elements
        for bibl in note.findall('tei:bibl', ns):
            text = etree.tostring(bibl, method="text", encoding='utf-8').decode()
            clean_text = re.sub(r'\n *', ' ', text)
            while re.search(r'  ', clean_text):
                clean_text = re.sub(r'  ', ' ', clean_text)
            clean_text = re.sub(r' ,', ',', clean_text)
            clean_text = re.sub(r' \.', '.', clean_text)
            clean_text = re.sub(r'\( ', '(', clean_text)
            clean_text = re.sub(r' \)', ')', clean_text)
            footnote_elements.append(clean_text.strip())
        ground_truth_list.append(" ".join(footnote_elements))
    return ground_truth_list

for input_path in glob.glob('tei/*.xml'):
    base_name = os.path.basename(input_path)
    id = os.path.splitext(base_name)[0]
    print(f'Extracting GT input data from {base_name}  ...')
    with open(input_path, 'r', encoding='utf-8') as f:
        result = tei_to_ground_truth_input(f.read())
    print(json.dumps(result))

Extracting GT input data from 10.1111_1467-6478.00057.xml  ...
["1 A. Phillips, \u2018 Citizenship and Feminist Politics \u2019 in Citizenship, ed. G. Andrews (1991) 77.", "2 T. Brennan and C. Pateman, \u2018\u201c Mere Auxiliaries to the Commonwealth\u201d: Women and the Origins of Liberalism \u2019 (1979) 27 Political Studies 183.", "3 M. Sawer and M. Simms, A Woman\u2019s Place: Women and Politics in Australia (2nd ed., 1993).", "4 I have explored the gendered nature of citizenship at greater length in two complementary papers : \u2018 Embodying the Citizen \u2019 in Public and Private: Feminist Legal Debates, ed. M. Thornton (1995) and \u2018 Historicising Citizenship: Remembering Broken Promises \u2019 (1996) 20 Melbourne University Law Rev. 1072.", "5 S. Walby, \u2018 Is Citizenship Gendered? \u2019 (1994) 28 Sociology 379", "6 I. Kant, \u2018 Metaphysical First Principles of the Doctrine of Right \u2019 in The Metaphysics of Morals (trans. M. Gregor, 1991) 125\u20136, s. 146.", 

## Extract bibliographic data from TEI files using XSLT

https://github.com/OpenArabicPE/convert_tei-to-bibliographic-data

In [33]:
from lxml import etree
import glob
from urllib.request import urlopen
import requests

if not 'cache' in locals():
    cache = {}

class HttpsResolver(etree.Resolver):
    def resolve(self, url, id, context):
        if url in cache:
            xml_str = cache[url]
        else:
            r = requests.get(url)
            assert(r.status_code == 200)
            xml_str = cache[url] = r.content
        return self.resolve_string(xml_str, context, base_url=url)

xml_parser = etree.XMLParser(no_network=False)
xml_parser.resolvers.add(HttpsResolver())

def apply_xslt(xslt_path, xml_input_path, xml_output_path):
    try:
        if xslt_path.startswith('http'):
            with urlopen(xslt_path) as f:
                xslt_doc = etree.parse(f, parser=xml_parser)
        else:
            xslt_doc = etree.parse(xslt_path)
        xml_doc = etree.parse(xml_input_path)
        transformer = etree.XSLT(xslt_doc)
        new_xml = transformer(xml_doc)
        with open(xml_output_path, 'w', encoding='utf-8') as f:
            f.write(new_xml)
    except etree.XSLTParseError as e:
        print(f"Error parsing XSLT file at {xslt_path}: {e}")

xslt_url = 'https://openarabicpe.github.io/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_bibl.xsl'

for input_path in glob.glob('tei/*.xml'):
    print(f'Converting {input_path}')
    base_name = os.path.basename(input_path)
    output_path = f'tei-biblstruct/{base_name}'
    apply_xslt(xslt_url, input_path, output_path )


Converting tei/10.1515_zfrs-1980-0103.xml
Error parsing XSLT file at https://openarabicpe.github.io/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_bibl.xsl: Failed to compile predicate
Converting tei/10.1515_zfrs-1980-0104.xml
Error parsing XSLT file at https://openarabicpe.github.io/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_bibl.xsl: Failed to compile predicate
Converting tei/10.1111_1467-6478.00080.xml
Error parsing XSLT file at https://openarabicpe.github.io/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_bibl.xsl: Failed to compile predicate
Converting tei/10.1111_1467-6478.00057.xml
Error parsing XSLT file at https://openarabicpe.github.io/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_bibl.xsl: Failed to compile predicate


In [29]:
!saxon -s:"tei/10.1111_1467-6478.00057.xml" -xsl:"https://openarabicpe.github.io/convert_tei-to-bibliographic-data/xslt/convert_tei-to-biblstruct_bibl.xsl"

Error on line 6 column 88 of functions.xsl:
  XTSE0165  I/O error reported by XML parser processing
  https://openarabicpe.github.io/../xslt-calendar-conversion/functions/date-functions.xsl.
  Caused by java.io.IOException: Server returned HTTP response code: 400 for URL:
  https://openarabicpe.github.io/../xslt-calendar-conversion/functions/date-functions.xsl
I/O error reported by XML parser processing https://openarabicpe.github.io/../xslt-calendar-conversion/functions/date-functions.xsl
