# Comparing OpenAI and open LLMs

Using the [text-only content of the website of the journal AUR - Agrar- und Umweltrecht](langchain-experiments/data/input/journal-website.txt), 
we compare the performance of GPT-4, GPT-3.5-turbo and Models available on Huggingface.

## Preparation

Import dependencies, define shorthand functions, and prepare test data

In [1]:
import io
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

def response_to_df(response):
    data = io.StringIO(response)
    try:
        return pd.read_csv(data)
    except:
        raise RuntimeError(f"Error while parsing response:\n{response}")

def use_model(model, template, **params):
    prompt = ChatPromptTemplate.from_template(template)
    chain = (
            prompt
            | model
            | StrOutputParser()
    )
    return response_to_df(chain.invoke(params))

with open('data/input/journal-website.txt', encoding='utf-8') as f:
    website_text = f.read()
journal_name = "AUR - Agrar- und Umweltrecht"

## Prompt

OpenAI's GPT-4 works perfectly with a minimal, German-language prompt, and infers the meaning of the columns
to returns the data we need:

```
Finde im folgenden Text die Herausgeber, Redaktion/Schriftleitung und Beirat der Zeitschrift '{journal_name}' und gebe sie im CSV-Format zurück mit den Spalten 'lastname', 'firstname', 'title', 'position', 'affiliation','role'. Die Spalte 'role' enthält entweder 'Herausgeber', 'Redaktion', 'Beirat', 'Schriftleitung' oder ist leer wenn nicht bestimmbar. Wenn keine passenden Informationen verfügbar sind, gebe nur den CSV-Header zurück. Setze alle Werte in den CSV-Spalten in Anführungszeichen."
````


In contrast, the open models performed miserably with such a prompt. We therefore use English and provide very detailed instructions.  

In [5]:
template = """
In the following German text, which was scraped from a website, find the members of the editorial board or the advisory board of the journal '{journal_name}' as per the following rules:
- In German, typical labels for these roles are "Herausgeber", "Redaktion/Redakteur/Schriftleitung" and "Beirat".  
- Return the data as comma-separated values, which can be saved to a `.csv` file. Put all values in the CSV rows in quotes. 
- The CSV data must have the columns 'lastname', 'firstname', 'title', 'position', 'affiliation','role'. 
- The column 'role' must contain either 'Herausgeber', 'Redaktion', 'Beirat' or is empty. Leave the column empty if you cannot determine the role. Use 'Redaktion' for the "Schriftleitung" role.
- The column 'title' should contain academic titles such as "Dr." or "Prof. Dr."
- The column 'position' should contain the job title
- The column 'affiliation' contains the institution or organization the person belongs to, or the city if one is mentioned
- If the journal is published ("herausgeben von") by an association, institute or other organization, but its name in the column 'lastname'. 
- If you cannot find any information, simply return the CSV header. 
- You must not output any introduction, commentary or explanation such as 'Here is the CSV data for the members of the editorial board or the advisory board of the journal'. Only return the data.

{website_text}
"""

## ChatGPT-4 

GPT-4 delivers an almost perfect [result](data/output/editors-openai-gpt-4.csv). There are some problems left which could be resolved by adding some more instructions to the prompt. 



In [3]:
model = ChatOpenAI(model_name="gpt-4")
df = use_model(model, template, journal_name=journal_name, website_text=website_text)
df.to_csv('data/output/editors-openai-gpt-4.csv')
df

Unnamed: 0,lastname,firstname,title,position,affiliation,role
0,DGAR,,,Deutsche Gesellschaft für Agrarrecht,,Herausgeber
1,Busse,Christian,Dr.,Regierungsdirektor,Bundesministerium für Ernährung und Landwirtsc...,Redaktion
2,Endres,Ewald,Prof. Dr.,,"Hochschule Weihenstephan-Triesdorf, Freising",Redaktion
3,Francois,Matthias,Dr.,Rechtsanwalt,Bitburg,Redaktion
4,von Garmissen,Bernd,Dr.,Rechtsanwalt,Göttingen,Redaktion
5,Glas,Ingo,,Rechtsanwalt,Rostock,Redaktion
6,Graß,Christiane,,Rechtsanwältin,Bonn,Redaktion
7,Haarstrich,Jens,,Rechtsanwalt,Peine,Redaktion
8,Koch,Erich,Dr.,Ltd. Verwaltungsdirektor,"Sozialversicherung für Landwirtschaft, Forsten...",Redaktion
9,Köpl,Christian,Dr.,Ministerialrat,"Bayerisches Staatsministerium für Ernährung, L...",Redaktion


## ChatGPT 3.5-turbo

GPT-3.5 [performs less well](data/output/editors-openai-gpt-3.5-turbo.csv), but still ok. It gets some of the 'title' amd 'position' 
column data confused, and does not recognize the institutional publisher (Herausgeber) of the journal. 


In [4]:
model = ChatOpenAI(model_name="gpt-3.5-turbo")
df = use_model(model, template, journal_name=journal_name, website_text=website_text)
df.to_csv('data/output/editors-openai-gpt-3.5-turbo.csv')
df

Unnamed: 0,lastname,firstname,title,position,affiliation,role
0,Busse,Christian,Dr.,Regierungsdirektor,Bundesministerium für Ernährung und Landwirtsc...,Redaktion
1,Endres,Ewald,Prof. Dr.,,"Hochschule Weihenstephan-Triesdorf, Freising",Redaktion
2,Francois,Matthias,Dr.,Rechtsanwalt,Bitburg,Redaktion
3,von Garmissen,Bernd,Dr.,Rechtsanwalt,Göttingen,Redaktion
4,Glas,Ingo,,Rechtsanwalt,Rostock,Redaktion
5,Graß,Christiane,Rechtsanwältin,,Bonn,Redaktion
6,Haarstrich,Jens,Rechtsanwalt,,Peine,Redaktion
7,Koch,Erich,Dr.,Ltd. Verwaltungsdirektor,"Sozialversicherung für Landwirtschaft, Forsten...",Redaktion
8,Köpl,Christian,Dr.,Ministerialrat,"Bayerisches Staatsministerium für Ernährung, L...",Redaktion
9,Martinez,Jose,Prof. Dr.,,"Institut für Landwirtschaftsrecht, Georg-Augus...",Redaktion


Now, let's try the open models via the Huggingface Inference Endpoint. For this to work, you need to deploy
endpoints via https://ui.endpoints.huggingface.co/ and update the value of `enpoint_url` below.

## TheBloke/Llama-2-13B-chat-GPTQ 

The [LLama2 13 billion parameter model](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ) produces [unusuable output](data/output/editors-llama-2-13b-chat-gptq.txt).

In [4]:
from lib.hf_llama2_chat_gptq import query
llama2_template = f"<s>[INST] <<SYS>>You are a helpful assistant. No comments or explanation, just answer the question.<</SYS>>{template}[/INST]"

endpoint_url = "https://z8afrqamxvaaitmf.us-east-1.aws.endpoints.huggingface.cloud"
query(endpoint_url, template, journal_name=journal_name, website_text=website_text).split("\n")


['Martinez, Dr. Christian Busse, Bundesministerium für Ernährung und Landwirtschaft, Bonn Agrarprodukt Recht',
 'Prof. Dr. Ewald Endres, Hochschule Weihenstephan-Triesdorf Freising Forsting Forsting, Jagd, Fischerei, Fischerei',
 'Lawyeranwalt Ingo Glas, Bitburg Boden Recht',
 'Christiane Grass, Bonn Agrarzivil Recht',
 'Jens Haarstrich, Peine Redaktionär, Rostock',
 'Prof. Dr. Bernd von Garmissen, Göttingen Erb, Redaktion, Umwelt',
 'Ltdr. Jose Martinez, Georg-August-Universität Göttingen, Göttingen',
 '',
 '',
 "Note: The column 'Role' contains the following values: 'Herausgeber', 'Redaktion', 'Beirat'"]

## TheBloke/Llama-2-70B-chat-GPTQ via Huggingface Inference Endpoint

The 70 billion parameter variant [does a bit better](data/output/editors-llama-2-70b-chat-gptq.csv) but, among other things, doesn't the academic titles right. It also cannot be persuaded to [not comment on the CSV output].(data/output/editors-llama-2-70b-chat-gptq.txt). Given that the model costs $13/h to run, that's not really that impressive.

In [3]:
endpoint_url = "https://gp8iviqlqee101a0.us-east-1.aws.endpoints.huggingface.cloud"
query(endpoint_url, template, journal_name=journal_name, website_text=website_text).split("\n")

'  Here is the CSV data for the members of the editorial board or the advisory board of the journal \'AUR - Agrar- und Umweltrecht\':\n\n"lastname","firstname","title","affiliation","role"\n"Busse","Christian", "Regierungsdirektor", "Bundesministerium für Ernährung und Landwirtschaft, Bonn", "Herausgeber"\n"Endres","Ewald", "Prof. Dr.", "Hochschule Weihenstephan-Triesdorf, Freising", "Redaktion"\n"Francois","Matthias", "Rechtsanwalt", "Bitburg", "Redaktion"\n"Garmissen","Bernd", "Rechtsanwalt", "Göttingen", "Redaktion"\n"Graß","Christiane", "Rechtsanwältin", "Bonn", "Redaktion"\n"Haarstrich","Jens", "Rechtsanwalt", "Peine", "Redaktion"\n"Köpl","Christian", "Ministerialrat", "Bayerisches Staatsministerium für Ernährung, Landwirtschaft und Forsten, München", ""\n"Martinez","Jose", "Prof. Dr.", "Institut für Landwirtschaftsrecht, Georg-August-Universität Göttingen, Göttingen", "Herausgeber"\n"Nies","Volkmar", "Ltd. Landwirtschaftsdirektor", "Landwirtschaftskammer NRW, Bonn", "Redaktion"\n