# Finetuning experiment: Extract structured data for German law journal editors from website text
based on https://github.com/ml-explore/mlx-examples/tree/main/lora
Hardware: Mac mini 2023 (M2, 16 GB RAM)
%% Cell type:markdown id:1135fbc8a6ced279 tags:
## Preparation
### Download website data
This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True`
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
"""
instruction="""
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
"""
example="""
Here is an example:
```yaml
- lastname: Mustermann
firstname: Martina
title: Dr.
position: Vorsitzender Richterin
affiliation: Oberlandesgericht Buxtehude
role: Herausgeber
```
"""
epilog="""
Adhere to these guidelines to efficiently and accurately process the following content:"
"""
test_data = """
Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
# user
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
Adhere to these guidelines to efficiently and accurately process the following content:"
python(39031) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Testing
Test loss 0.800, Test ppl 2.226.
%% Cell type:markdown id:3bab8168bd116d38 tags:
Result:
600 iters: Test loss 0.800, Test ppl 2.226
%% Cell type:markdown id:c7e42a5574068ba9 tags:
### Manual test prompt
%% Cell type:code id:8d316a1e7570f1d4 tags:
``` python
prompt=f"""
### SYSTEM
{system_message}
### USER
{instruction}
{example}
### CONTENT
{test_data}
### END OF CONTENT
""".strip()
```
%% Cell type:code id:3e7a823a9f4a35d9 tags:
``` python
print(prompt)
```
%% Output
### SYSTEM
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
### USER
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
Here is an example:
```yaml
- lastname: Mustermann
firstname: Martina
title: Dr.
position: Vorsitzender Richterin
affiliation: Oberlandesgericht Buxtehude
role: Herausgeber
```
### CONTENT
Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock