Skip to content
Snippets Groups Projects
Commit 434d66d1 authored by cboulanger's avatar cboulanger
Browse files

Fix training data generation. Refactoring

parent e41b3f96
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:d6264ff5d5024ba1 tags: %% Cell type:markdown id:d6264ff5d5024ba1 tags:
# Finetuning experiment: Extract structured data for German law journal editors from website text # Finetuning experiment: Extract structured data for German law journal editors from website text
based on https://github.com/ml-explore/mlx-examples/tree/main/lora based on https://github.com/ml-explore/mlx-examples/tree/main/lora
Hardware: Mac mini 2023 (M2, 16 GB RAM) Hardware: Mac mini 2023 (M2, 16 GB RAM)
%% Cell type:markdown id:1135fbc8a6ced279 tags: %% Cell type:markdown id:1135fbc8a6ced279 tags:
## Preparation ## Preparation
### Download website data ### Download website data
This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True` This only downloads new content if the list of journals has been changed or already downloaded files have been deleted. To overwrite existing files, use `overwrite=True`
%% Cell type:code id:9eb2effc7bfb22f tags: %% Cell type:code id:9eb2effc7bfb22f tags:
``` python ``` python
from lib.prepare_training_data import download_input_data from lib.prepare_training_data import download_input_data
download_input_data(input_file='data/editors.csv', download_input_data(input_file='data/editors.csv',
output_dir='data/website-data', output_dir='data/website-data',
overwrite=False) overwrite=False)
``` ```
%% Output %% Output
Downloaded 0 web pages. Downloaded 0 web pages.
%% Cell type:markdown id:434335a9891b27e7 tags: %% Cell type:markdown id:434335a9891b27e7 tags:
### Prompt and test data for all experiments ### Prompt and test data for all experiments
%% Cell type:code id:b4be7c0872d2fd34 tags: %% Cell type:code id:b4be7c0872d2fd34 tags:
``` python ``` python
system_message =""" system_message ="""
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference. You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
""" """
instruction = """ instruction = """
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown). Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries. - Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty. - Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries. - Ensure the YAML output is strictly valid. It must be a list of dictionaries.
""" """
example = """ example = """
Here is an example: Here is an example:
```yaml ```yaml
- lastname: Mustermann - lastname: Mustermann
firstname: Martina firstname: Martina
title: Dr. title: Dr.
position: Vorsitzender Richterin position: Vorsitzender Richterin
affiliation: Oberlandesgericht Buxtehude affiliation: Oberlandesgericht Buxtehude
role: Herausgeber role: Herausgeber
``` ```
""" """
epilog=""" epilog="""
Adhere to these guidelines to efficiently and accurately process the following content:" Adhere to these guidelines to efficiently and accurately process the following content:"
""" """
test_data = """ test_data = """
Herausgeber: Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
Redaktion: Redaktion:
RA Adam Gengelbach, Unterhachingen RA Adam Gengelbach, Unterhachingen
Ass. iur. Petra Priem, Herrenchiemsee Ass. iur. Petra Priem, Herrenchiemsee
""" """
``` ```
%% Cell type:markdown id:a9dff0d6c779882c tags: %% Cell type:markdown id:a9dff0d6c779882c tags:
## mistralai/Mistral-7B-v0.2 ## mistralai/Mistral-7B-v0.2
%% Cell type:markdown id:f9ba088a74f8c557 tags:
### Set paths for model
%% Cell type:code id:203bf0c10dd860a5 tags:
``` python
import os
HF_MODEL_PATH = 'mistralai/Mistral-7B-Instruct-v0.2'
LOCAL_MODEL_PATH = f'mlx_models/{HF_MODEL_PATH}'
os.environ['HF_MODEL_PATH'] = HF_MODEL_PATH
os.environ['LOCAL_MODEL_PATH'] = LOCAL_MODEL_PATH
print(f"""
HF_MODEL_PATH={HF_MODEL_PATH}
LOCAL_MODEL_PATH={LOCAL_MODEL_PATH}
""".strip())
```
%% Output
HF_MODEL_PATH=mistralai/Mistral-7B-Instruct-v0.2
LOCAL_MODEL_PATH=mlx_models/mistralai/Mistral-7B-Instruct-v0.2
%% Cell type:markdown id:a52bdff5b0eae3bd tags:
### Create a 4-Bit quantized model if necessary
%% Cell type:code id:fdb9ec6772be0c23 tags:
``` python
![ -d "$LOCAL_MODEL_PATH" ] || python convert.py --hf-path "$HF_MODEL_PATH" --mlx-path "$LOCAL_MODEL_PATH" -q
```
%% Cell type:markdown id:30521e178126b249 tags: %% Cell type:markdown id:30521e178126b249 tags:
### Generate training, testing and validation files ### Generate training, testing and validation files
%% Cell type:code id:31a2389404720256 tags: %% Cell type:code id:31a2389404720256 tags:
``` python ``` python
from lib.prepare_training_data import create_training_file from lib.prepare_training_data import create_training_file
import sys import sys
mistral_ft_instruction = f""" mistral_ft_instruction = f"""
# instruction # instruction
{system_message} {system_message}
# user # user
{instruction} {instruction}
{epilog} {epilog}
# content # content
""" """
# the template function receives the instruction, the content to be analyzed, and the expected answer # the template function receives the instruction, the content to be analyzed, and the expected answer
def template_fn(instruction: str, content: str, answer: str): def template_fn(instruction: str, content: str, answer: str):
return f'<s>[INST]{instruction}{content}[/INST]{answer}</s>' return f'<s>[INST]{instruction}{content}[/INST]{answer}</s>'
create_training_file(instruction=mistral_ft_instruction, create_training_file(instruction=mistral_ft_instruction,
template_func=template_fn, template_func=template_fn,
input_file='data/editors/editors.csv', input_file='data/editors/editors.csv',
output_dir='data/editors/mistral', output_dir='data/editors/mistral',
content_dir='data/editors/website-data', content_dir='data/editors/website-data',
max_chars=6000, max_gt_items=5, max_chars=6000, max_gt_items=5,
record_identifier_col="journal_abbr", record_identifier_col="journal_abbr",
cols_to_remove = ['journal_abbr', 'website', 'retrieved_on'], cols_to_remove = ['journal_abbr', 'website', 'retrieved_on'],
column_to_filter_by='lastname', column_to_filter_by='lastname',
lines_before=2, lines_after=2) lines_before=2, lines_after=2)
``` ```
%% Output %% Output
Length of generated sequences: Length of generated sequences:
- max: 5550 - max: 5550
- avg: 2259.182608695652 - avg: 2259.182608695652
Longest sequences: Longest sequences:
DivRuW: 5550 DivRuW: 5550
JurBüro: 5051 JurBüro: 5051
AVR: 4366 AVR: 4366
APR: 4350 APR: 4350
AusR: 4244 AusR: 4244
BKK: 4078 BKK: 4078
DÖD: 3818 DÖD: 3818
EuZW: 3786 EuZW: 3786
HRN: 3467 HRN: 3467
AuAS: 3272 AuAS: 3272
%% Cell type:code id:6181ba9486346975 tags: %% Cell type:code id:6181ba9486346975 tags:
``` python ``` python
print(mistral_ft_instruction) print(mistral_ft_instruction)
``` ```
%% Output %% Output
# instruction # instruction
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference. You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
# user # user
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown). Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries. - Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty. - Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries. - Ensure the YAML output is strictly valid. It must be a list of dictionaries.
Adhere to these guidelines to efficiently and accurately process the following content:" Adhere to these guidelines to efficiently and accurately process the following content:"
# content # content
%% Cell type:code id:203bf0c10dd860a5 tags:
``` python
import os
HF_MODEL_PATH = 'mistralai/Mistral-7B-Instruct-v0.2'
LOCAL_MODEL_PATH = f'mlx_models/{HF_MODEL_PATH}'
os.environ['HF_MODEL_PATH'] = HF_MODEL_PATH
os.environ['LOCAL_MODEL_PATH'] = LOCAL_MODEL_PATH
print(f"""
HF_MODEL_PATH={HF_MODEL_PATH}
LOCAL_MODEL_PATH={LOCAL_MODEL_PATH}
""".strip())
```
%% Output
HF_MODEL_PATH=mistralai/Mistral-7B-Instruct-v0.2
LOCAL_MODEL_PATH=mlx_models/mistralai/Mistral-7B-Instruct-v0.2
%% Cell type:markdown id:a52bdff5b0eae3bd tags:
### Create a 4-Bit quantized model if necessary
%% Cell type:code id:fdb9ec6772be0c23 tags:
``` python
![ -d "$LOCAL_MODEL_PATH" ] || python convert.py --hf-path "$HF_MODEL_PATH" --mlx-path "$LOCAL_MODEL_PATH" -q
```
%% Cell type:markdown id:8c46d1d132de28c3 tags: %% Cell type:markdown id:8c46d1d132de28c3 tags:
### Finetuning ### Finetuning
%% Cell type:code id:fd1a48e84474aaea tags: %% Cell type:code id:fd1a48e84474aaea tags:
``` python ``` python
!python lora.py --train \ !python lora.py --train \
--model "$LOCAL_MODEL_PATH" \ --model "$LOCAL_MODEL_PATH" \
--data data/editors/mistral \ --data data/editors/mistral \
--adapter-file "$LOCAL_MODEL_PATH/editors.npz" \ --adapter-file "$LOCAL_MODEL_PATH/editors.npz" \
--iters 600 --batch-size 1 --lora-layers 4 --iters 600 --batch-size 1 --lora-layers 4
``` ```
%% Cell type:markdown id:4945c07efbb3b4e8 tags: %% Cell type:markdown id:4945c07efbb3b4e8 tags:
To run in a separate shell: To run in a separate shell:
%% Cell type:code id:dc9af052b1e9a9e4 tags: %% Cell type:code id:dc9af052b1e9a9e4 tags:
``` python ``` python
print(f""" print(f"""
cd mlx/lora cd mlx/lora
python lora.py --train \\ python lora.py --train \\
--model {LOCAL_MODEL_PATH} \\ --model {LOCAL_MODEL_PATH} \\
--data data/editors/mistral \\ --data data/editors/mistral \\
--adapter-file {LOCAL_MODEL_PATH}/editors.npz \\ --adapter-file {LOCAL_MODEL_PATH}/editors.npz \\
--iters 600 --batch-size 1 --lora-layers 4 --iters 600 --batch-size 1 --lora-layers 4
""".strip()) """.strip())
``` ```
%% Output %% Output
cd mlx/lora cd mlx/lora
python lora.py --train \ python lora.py --train \
--model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \ --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
--data data/editors/mistral \ --data data/editors/mistral \
--adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \ --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \
--iters 600 --batch-size 1 --lora-layers 4 --iters 600 --batch-size 1 --lora-layers 4
%% Cell type:markdown id:2f3bb7b9404da7e7 tags: %% Cell type:markdown id:2f3bb7b9404da7e7 tags:
Training loss: ~0.8, ~90 Tokens/sec Training loss: ~0.8, ~90 Tokens/sec
%% Cell type:markdown id:27ec240d6a886b16 tags: %% Cell type:markdown id:27ec240d6a886b16 tags:
### Test the model with adapter ### Test the model with adapter
%% Cell type:code id:a66ab3a823260361 tags: %% Cell type:code id:a66ab3a823260361 tags:
``` python ``` python
os.environ['TOKENIZERS_PARALLELISM'] = 'false' os.environ['TOKENIZERS_PARALLELISM'] = 'false'
!python lora.py --test \ !python lora.py --test \
--model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \ --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
--data data/editors/mistral \ --data data/editors/mistral \
--adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz
``` ```
%% Output %% Output
python(39031) MallocStackLogging: can't turn off malloc stack logging because it was not enabled. python(39031) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
Testing Testing
Test loss 0.800, Test ppl 2.226. Test loss 0.800, Test ppl 2.226.
%% Cell type:markdown id:3bab8168bd116d38 tags: %% Cell type:markdown id:3bab8168bd116d38 tags:
Result: Result:
600 iters: Test loss 0.800, Test ppl 2.226 600 iters: Test loss 0.800, Test ppl 2.226
%% Cell type:markdown id:c7e42a5574068ba9 tags: %% Cell type:markdown id:c7e42a5574068ba9 tags:
### Manual test prompt ### Manual test prompt
%% Cell type:code id:8d316a1e7570f1d4 tags: %% Cell type:code id:8d316a1e7570f1d4 tags:
``` python ``` python
prompt=f""" prompt=f"""
### SYSTEM ### SYSTEM
{system_message} {system_message}
### USER ### USER
{instruction} {instruction}
{example} {example}
### CONTENT ### CONTENT
{test_data} {test_data}
### END OF CONTENT ### END OF CONTENT
""".strip() """.strip()
``` ```
%% Cell type:code id:3e7a823a9f4a35d9 tags: %% Cell type:code id:3e7a823a9f4a35d9 tags:
``` python ``` python
print(prompt) print(prompt)
``` ```
%% Output %% Output
### SYSTEM ### SYSTEM
You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference. You are a text processing agent. As instructed below, extract information from the provided content in a structured format without discussing reasoning or providing commentary. Only use source text given as input for data extraction unless specifically asked for inference.
### USER ### USER
Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown). Analyze content from a German law journal's website. Your task is to identify members of the editorial board (terms to look for: 'Herausgeber', 'Redakteur', 'Schriftleitung') and the advisory board ('Beirat'). For each identified member, extract and organize their information into the following categories: lastname, firstname, title (including academic titles like 'Dr.' or 'Prof. Dr.' and suffixes such as 'LL.M.'), position (their job title, if provided), affiliation, and role. For 'role', infer the role within the journal from the context (options 'Herausgeber', 'Redaktion', 'Schriftleitung', 'Beirat', or an empty string if the role is unknown).
- Format the output as a YAML list of dictionaries. - Format the output as a YAML list of dictionaries.
- Exclude any dictionary entries for which information is not available or relevant fields are empty. - Exclude any dictionary entries for which information is not available or relevant fields are empty.
- Ensure the YAML output is strictly valid. It must be a list of dictionaries. - Ensure the YAML output is strictly valid. It must be a list of dictionaries.
Here is an example: Here is an example:
```yaml ```yaml
- lastname: Mustermann - lastname: Mustermann
firstname: Martina firstname: Martina
title: Dr. title: Dr.
position: Vorsitzender Richterin position: Vorsitzender Richterin
affiliation: Oberlandesgericht Buxtehude affiliation: Oberlandesgericht Buxtehude
role: Herausgeber role: Herausgeber
``` ```
### CONTENT ### CONTENT
Herausgeber: Herausgeber:
Prof. Dr. Stefan Knesebeck, Universität Wuppertal Prof. Dr. Stefan Knesebeck, Universität Wuppertal
Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel Prof. Dr. Dr. h.c. Fritz M. Müller LL.M.(Yale), Universität Wanne-Eickel
RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen RA Prof. Dr. Vera Valentin, Hochschule für Recht und Sport Edingen
Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen Prof. Dr. Dr. h.c. Rita Rosenbaum, Universität Tupfingen
Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock Dr. Ingo Gonzalo de Sanchez, Vorsitzender Richter am Oberlandesgericht Rostock
Redaktion: Redaktion:
RA Adam Gengelbach, Unterhachingen RA Adam Gengelbach, Unterhachingen
Ass. iur. Petra Priem, Herrenchiemsee Ass. iur. Petra Priem, Herrenchiemsee
### END OF CONTENT ### END OF CONTENT
%% Cell type:code id:1ea4b39f35c09268 tags: %% Cell type:code id:1ea4b39f35c09268 tags:
``` python ``` python
import os import os
import time import time
os.environ['LLM_PROMPT'] = prompt os.environ['LLM_PROMPT'] = prompt
os.environ['TOKENIZERS_PARALLELISM'] = 'false' os.environ['TOKENIZERS_PARALLELISM'] = 'false'
start_time = time.time() start_time = time.time()
!python lora.py \ !python lora.py \
--model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \ --model mlx_models/mistralai/Mistral-7B-Instruct-v0.2 \
--adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \ --adapter-file mlx_models/mistralai/Mistral-7B-Instruct-v0.2/editors.npz \
--max-tokens 400 \ --max-tokens 400 \
--temp 0 \ --temp 0 \
--prompt "$LLM_PROMPT" --prompt "$LLM_PROMPT"
print(f'Generation took {time.time() - start_time} seconds') print(f'Generation took {time.time() - start_time} seconds')
``` ```
%% Output %% Output
python(39255) MallocStackLogging: can't turn off malloc stack logging because it was not enabled. python(39255) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
- lastname: Knesebeck - lastname: Knesebeck
firstname: Stefan firstname: Stefan
title: Prof. Dr. title: Prof. Dr.
position: Universität Wuppertal position: Universität Wuppertal
affiliation: Universität Wuppertal affiliation: Universität Wuppertal
role: Herausgeber role: Herausgeber
- lastname: Müller - lastname: Müller
firstname: Fritz M. firstname: Fritz M.
title: Prof. Dr. Dr. h.c. LL.M.(Yale) title: Prof. Dr. Dr. h.c. LL.M.(Yale)
position: Universität Wanne-Eickel position: Universität Wanne-Eickel
affiliation: Universität Wanne-Eickel affiliation: Universität Wanne-Eickel
role: Herausgeber role: Herausgeber
- lastname: Valentin - lastname: Valentin
firstname: Vera firstname: Vera
title: Prof. Dr. title: Prof. Dr.
position: Hochschule für Recht und Sport Edingen position: Hochschule für Recht und Sport Edingen
affiliation: Hochschule für Recht und Sport Edingen affiliation: Hochschule für Recht und Sport Edingen
role: Redaktion role: Redaktion
- lastname: Rosenbaum - lastname: Rosenbaum
firstname: Rita firstname: Rita
title: Prof. Dr. Dr. h.c. title: Prof. Dr. Dr. h.c.
position: Universität Tupfingen position: Universität Tupfingen
affiliation: Universität Tupfingen affiliation: Universität Tupfingen
role: Herausgeber role: Herausgeber
- lastname: Gonzalo de Sanchez - lastname: Gonzalo de Sanchez
firstname: Ingo firstname: Ingo
title: Dr. title: Dr.
position: Vorsitzender Richter am Oberlandesgericht Rostock position: Vorsitzender Richter am Oberlandesgericht Rostock
affiliation: Oberlandesgericht Rostock affiliation: Oberlandesgericht Rostock
role: Herausgeber role: Herausgeber
- lastname: Gengelbach - lastname: Gengelbach
firstname: Adam firstname: Adam
title: RA title: RA
position: Unterhachingen position: Unterhachingen
affiliation: Unterhachingen affiliation: Unterhachingen
role: Redaktion role: Redaktion
- lastname: Priem - lastname: Priem
firstname: Petra firstname: Petra
title: Ass. iur. title: Ass. iur.
position: Herrenchiemsee position: Herrenchiemsee
affiliation: Herrenchiemsee affiliation: Herrenchiemsee
role: Redaktion role: Redaktion
Generation took 87.13131785392761 seconds Generation took 87.13131785392761 seconds
%% Cell type:markdown id:d1b0c8c8648906b7 tags: %% Cell type:markdown id:d1b0c8c8648906b7 tags:
## mlx-community/quantized-gemma-7b-it ## mlx-community/quantized-gemma-7b-it
This model can be directly downloaded from HF, no conversion necessary This model can be directly downloaded from HF, no conversion necessary
%% Cell type:markdown id:7c5659b8c268e72f tags: %% Cell type:markdown id:7c5659b8c268e72f tags:
### Zero-shot ### Zero-shot
%% Cell type:code id:89e1a05fc3b6e435 tags: %% Cell type:code id:89e1a05fc3b6e435 tags:
``` python ``` python
from mlx_lm import load, generate from mlx_lm import load, generate
import time import time
os.environ['TOKENIZERS_PARALLELISM'] = 'false' os.environ['TOKENIZERS_PARALLELISM'] = 'false'
prompt = f""" prompt = f"""
#### instructions #### instructions
{system_message} {system_message}
### user ### user
{instruction} {instruction}
{example} {example}
{epilog} {epilog}
{test_data} {test_data}
""".strip() """.strip()
model, tokenizer = load("mlx-community/quantized-gemma-7b-it") model, tokenizer = load("mlx-community/quantized-gemma-7b-it")
start_time = time.time() start_time = time.time()
response = generate(model, tokenizer, prompt=prompt, verbose=False, max_tokens=300, temp=0) response = generate(model, tokenizer, prompt=prompt, verbose=False, max_tokens=300, temp=0)
print(response) print(response)
print(f'Generation took {time.time() - start_time} seconds') print(f'Generation took {time.time() - start_time} seconds')
``` ```
%% Output %% Output
Schriftleitung: Schriftleitung:
Dr. Martin Schmidt, Berlin Dr. Martin Schmidt, Berlin
Beirat: Beirat:
Dr. Hans-Peter Kaulitz, Berlin Dr. Hans-Peter Kaulitz, Berlin
Dr. Franz-Josef Schmidt, München Dr. Franz-Josef Schmidt, München
``` ```
**Expected Output:** **Expected Output:**
```yaml ```yaml
- lastname: Knesebeck - lastname: Knesebeck
firstname: Stefan firstname: Stefan
title: Prof. Dr. title: Prof. Dr.
position: N/A position: N/A
affiliation: Universität Wuppertal affiliation: Universität Wuppertal
role: Herausgeber role: Herausgeber
- lastname: Müller - lastname: Müller
firstname: Fritz M. firstname: Fritz M.
title: Prof. Dr. Dr. h.c. LL.M.(Yale) title: Prof. Dr. Dr. h.c. LL.M.(Yale)
position: N/A position: N/A
affiliation: Universität Wanne-Eickel affiliation: Universität Wanne-Eickel
role: Herausgeber role: Herausgeber
- lastname: Valentin - lastname: Valentin
firstname: Vera firstname: Vera
title: RA Prof. Dr. title: RA Prof. Dr.
position: N/A position: N/A
affiliation: Hochschule für Recht und Sport Edingen affiliation: Hochschule für Recht und Sport Edingen
role: N/A role: N/A
- lastname: Rosenbaum - lastname: Rosenbaum
firstname: Rita firstname: Rita
title: Prof. Dr. Dr. h.c. title: Prof. Dr. Dr. h.c.
position: N/A position: N/A
affiliation: Universität Tupfingen affiliation: Universität Tupfingen
role: N/A role: N/A
- lastname: Gonzalo de Sanchez - lastname: Gonzalo de Sanchez
firstname: Ingo firstname: Ingo
title: Dr. title: Dr.
position: Vorsitzender Richter am Oberlandesgericht Rostock position: Vorsitzender Richter am Oberlandesgericht Rostock
affiliation: Oberlandesgericht Rostock affiliation: Oberlandesgericht Rostock
role: N/A role: N/A
- lastname: Gengelbach - lastname: Gengelbach
firstname: Adam firstname: Adam
title: RA title: RA
position: N/A position: N/A
Generation took 50.564462184906006 seconds Generation took 50.564462184906006 seconds
%% Cell type:markdown id:e48938d56b99848c tags: %% Cell type:markdown id:e48938d56b99848c tags:
### Generate training, testing and validation files ### Generate training, testing and validation files
based on https://gist.github.com/alexweberk/635431b5c5773efd6d1755801020429f based on https://gist.github.com/alexweberk/635431b5c5773efd6d1755801020429f
%% Cell type:code id:8d61e8cf63aa5965 tags: %% Cell type:code id:8d61e8cf63aa5965 tags:
``` python ``` python
from lib.prepare_training_data import create_training_file from lib.prepare_training_data import create_training_file
prompt = f""" prompt = f"""
# instructions # instructions
{system_message} {system_message}
# user # user
{instruction} {instruction}
{epilog}' {epilog}'
""".strip() """.strip()
def template_fn(prompt: str, answer: str): def template_fn(prompt: str, answer: str):
return f'<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n{answer}<end_of_turn><eos>' return f'<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n{answer}<end_of_turn><eos>'
create_training_file(instruction=instruction, create_training_file(instruction=instruction,
template_func=template_fn, template_func=template_fn,
input_file='data/editors/editors.csv', input_file='data/editors/editors.csv',
output_dir='data/editors-gemma', output_dir='data/editors-gemma',
content_dir='data/editors/website-data', content_dir='data/editors/website-data',
max_chars=6000, max_gt_items=5, max_chars=6000, max_gt_items=5,
record_identifier_col="journal_abbr", record_identifier_col="journal_abbr",
cols_to_remove=['journal_abbr', 'website', 'retrieved_on'], cols_to_remove=['journal_abbr', 'website', 'retrieved_on'],
column_to_filter_by='lastname', column_to_filter_by='lastname',
lines_before=2, lines_after=2) lines_before=2, lines_after=2)
``` ```
%% Output %% Output
Length of generated sequences: Length of generated sequences:
- max: 5107 - max: 5107
- avg: 1976.0964912280701 - avg: 1976.0964912280701
Longest sequences: Longest sequences:
FoR: 5107 FoR: 5107
DivRuW: 5097 DivRuW: 5097
AfP: 4519 AfP: 4519
StAZ: 4418 StAZ: 4418
DÖD: 4220 DÖD: 4220
ECFR: 3519 ECFR: 3519
APR: 3445 APR: 3445
CB: 3387 CB: 3387
AuA: 3317 AuA: 3317
HRN: 3128 HRN: 3128
%% Cell type:code id:db51ef32ff18dff3 tags: %% Cell type:code id:db51ef32ff18dff3 tags:
``` python ``` python
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment