<h1 align="center">Artefact for automated extraction of sustainability data from CSR</h1>
<p align="center">
<img alt="Build" src="">
<img alt="License" src="">
<p align="center">
This is a prototype developed for the project seminar of the <strong>Master's programme</strong> in Information Systems at the <strong>Georg-August University in Göttingen</strong>.
<p align="center">
The prototype is designed to extract environment-related information from corporate sustainability reports with the help of an <strong>LLM</strong>.
To install this application on the HPC, follow these steps:
1. Connect with SSH (see [GWDG documentation](
2. Log into the HPC (see [GWDG documentation](
1. Connect to HPC via VSCode (see [GWDG documentation](
3. Clone this repository to the HPC by running the following command in your terminal (or use you`re IDE):
git clone
4. Navigate to the project directory and create a virtual environment by running the following command:
python3 -m venv env
5. Activate the virtual environment by running the following command:
source env/bin/activate
6. Install the required dependencies by running the following command:
pip install -r requirements.txt
## Usage on HPC
1. Create a bash script by creating new file with `.sh` as ending
2. Copy and paste the content of `llmgpu.txt` into the new script
#SBATCH --job-name= <insert_job_name>
#SBATCH --output=job-%J.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user= <insert_mail_address>
#SBATCH --ntasks=1
#SBATCH -p gpu <modify_if_needed>
#SBATCH -G v100:5 <modify_if_needed>
#SBATCH --mem-per-gpu=26G <modify_if_needed>
#SBATCH --time=00:00:00 <insert_estimated_time_that_the_script_needs_to_execute>
#SBATCH -C scratch
# Load any necessary modules or activate the desired Python environment
# Execute the Python script with the input file from the array
module load python
rm -rf /scratch/users/<insert_username>/
python -m venv /scratch/users/<insert_username>/env
source /scratch/users/<insert_username>/env/bin/activate
pip install transformers -U
pip install frontend
pip install pymupdf
pip install -r requirements.txt
pip install pytesseract opencv-python langchain tiktoken requests pathlib Pillow
pip install torch torchvision torchaudio transformers einops accelerate bitsandbytes sentence_transformers xformers chromadb
python <>
6. Add username in ``.env``
# .env file
7. Change username in ````
import os
user_name = os.getenv('add_username')
8. Submit ``slurm job``
squeue -u <user name>
### After first execution of the bash script!!!
All dependencies are installed --> `comment out`
module load python
#rm -rf /scratch/users/<insert_username>/
#python -m venv /scratch/users/<insert_username>/env
source /scratch/users/<insert_username>/env/bin/activate
#pip install transformers -U
#pip install frontend
#pip install pymupdf
#pip install -r requirements.txt
#pip install pytesseract opencv-python langchain tiktoken requests pathlib Pillow
#pip install torch torchvision torchaudio transformers einops accelerate bitsandbytes sentence_transformers xformers chromadb
python <>
* ``Submit`` slurm job
sbatch <script name> # e.g.
* ``Check`` job status
squeue -u <user name>
* ``Get`` estimated ``start time`` of the job
squeue -u <user name> --start
* ``Cancel`` job
scancel <job id>
see [Using the GWDG Scientific Compute Cluster]( for more
This project is licensed under the MIT License.