# Finetuning experiment: Extract structured data for German law journal editors from website text
# Finetuning experiment: Extract structured data for German law journal editors from website text
based on https://github.com/ml-explore/mlx-examples/tree/main/lora
based on https://github.com/ml-explore/mlx-examples/tree/main/lora
Hardware: Mac mini 2023 (M2, 16 GB RAM)
Hardware: Mac mini 2023 (M2, 16 GB RAM)
%% Cell type:markdown id:1135fbc8a6ced279 tags:
%% Cell type:markdown id:1135fbc8a6ced279 tags:
## Preparation
## Preparation
### Download website data
### Download website data
This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True`
This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True`
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
"""
"""
instruction="""
instruction="""
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries.
- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
"""
"""
example="""
example="""
Here is an example:
Here is an example:
```yaml
```yaml
- lastname: Mustermann
- lastname: Mustermann
firstname: Martina
firstname: Martina
title: Dr.
title: Dr.
position: Vorsitzender Richterin
position: Vorsitzender Richterin
affiliation: Oberlandesgericht Buxtehude
affiliation: Oberlandesgericht Buxtehude
role: Herausgeber
role: Herausgeber
```
```
"""
"""
epilog="""
epilog="""
Adhere to these guidelines to efficiently and accurately process the following content:"
Adhere to these guidelines to efficiently and accurately process the following content:"
"""
"""
test_data = """
test_data = """
Herausgeber:
Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
# user
# user
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries.
- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
Adhere to these guidelines to efficiently and accurately process the following content:"
Adhere to these guidelines to efficiently and accurately process the following content:"
python(39031) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(39031) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Testing
Testing
Test loss 0.800, Test ppl 2.226.
Test loss 0.800, Test ppl 2.226.
%% Cell type:markdown id:3bab8168bd116d38 tags:
%% Cell type:markdown id:3bab8168bd116d38 tags:
Result:
Result:
600 iters: Test loss 0.800, Test ppl 2.226
600 iters: Test loss 0.800, Test ppl 2.226
%% Cell type:markdown id:c7e42a5574068ba9 tags:
%% Cell type:markdown id:c7e42a5574068ba9 tags:
### Manual test prompt
### Manual test prompt
%% Cell type:code id:8d316a1e7570f1d4 tags:
%% Cell type:code id:8d316a1e7570f1d4 tags:
``` python
``` python
prompt=f"""
prompt=f"""
### SYSTEM
### SYSTEM
{system_message}
{system_message}
### USER
### USER
{instruction}
{instruction}
{example}
{example}
### CONTENT
### CONTENT
{test_data}
{test_data}
### END OF CONTENT
### END OF CONTENT
""".strip()
""".strip()
```
```
%% Cell type:code id:3e7a823a9f4a35d9 tags:
%% Cell type:code id:3e7a823a9f4a35d9 tags:
``` python
``` python
print(prompt)
print(prompt)
```
```
%% Output
%% Output
### SYSTEM
### SYSTEM
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
### USER
### USER
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries.
- Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries.
Here is an example:
Here is an example:
```yaml
```yaml
- lastname: Mustermann
- lastname: Mustermann
firstname: Martina
firstname: Martina
title: Dr.
title: Dr.
position: Vorsitzender Richterin
position: Vorsitzender Richterin
affiliation: Oberlandesgericht Buxtehude
affiliation: Oberlandesgericht Buxtehude
role: Herausgeber
role: Herausgeber
```
```
### CONTENT
### CONTENT
Herausgeber:
Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock