Newer
Older
<h1 align="center">Artefact for automated extraction of sustainability data from CSR</h1>
<p align="center">
<a>
<img alt="Build" src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple">
</a>
<a>
<img alt="License" src="https://img.shields.io/badge/MIT-MIT?label=license">
</a>
</p>
<p align="center">
This is a prototype developed for the project seminar of the <strong>Master's programme</strong> in Information Systems at the <strong>Georg-August University in Göttingen</strong>.
</p>
<p align="center">
The prototype is designed to extract environment-related information from corporate sustainability reports with the help of an <strong>LLM</strong>.
</p>
To install this application on the HPC, follow these steps:
1. Connect with SSH (see [GWDG documentation](https://docs.gwdg.de/doku.php?id=en:services:application_services:high_performance_computing:connect_with_ssh))
2. Log into the HPC (see [GWDG documentation](https://docs.gwdg.de/doku.php?id=en:services:application_services:high_performance_computing:connect_with_ssh))
1. Connect to HPC via VSCode (see [GWDG documentation](https://info.gwdg.de/news/en/configuring-vscode-to-access-gwdgs-hpc-cluster/))
3. Clone this repository to the HPC by running the following command in your terminal (or use you`re IDE):
```bash
git clone https://gitlab.gwdg.de/t.reichard/projektstudium.git
```
4. Navigate to the project directory and create a virtual environment by running the following command:
```bash
python3 -m venv env
```
5. Activate the virtual environment by running the following command:
```bash
source env/bin/activate
```
6. Install the required dependencies by running the following command:
```bash
pip install -r requirements.txt
```
## Usage on HPC
1. Create a bash script by creating new file with `.sh` as ending
2. Copy and paste the content of `llmgpu.txt` into the new script
```bash
#!/bin/bash
#SBATCH --job-name= <insert_job_name>
#SBATCH --output=job-%J.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user= <insert_mail_address>
#SBATCH --ntasks=1
#SBATCH -p gpu <modify_if_needed>
#SBATCH -G v100:5 <modify_if_needed>
#SBATCH --mem-per-gpu=26G <modify_if_needed>
#SBATCH --time=00:00:00 <insert_estimated_time_that_the_script_needs_to_execute>
#SBATCH -C scratch
```
```bash
# Load any necessary modules or activate the desired Python environment
# Execute the Python script with the input file from the array
module load python
rm -rf /scratch/users/<insert_username>/
python -m venv /scratch/users/<insert_username>/env
source /scratch/users/<insert_username>/env/bin/activate
pip install transformers -U
pip install frontend
pip install pymupdf
pip install -r requirements.txt
pip install pytesseract opencv-python langchain tiktoken requests pathlib Pillow
pip install torch torchvision torchaudio transformers einops accelerate bitsandbytes sentence_transformers xformers chromadb
python <insert_path_to_main.py>
```
6. Add username in ``.env``
```bash
# .env file
USERS_MUSTERMANN=mustermann
7. Change username in ``main.py``
```python
import os
user_name = os.getenv('add_username')
```
8. Submit ``slurm job``
```bash
squeue -u <user name>
```
### After first execution of the bash script!!!
All dependencies are installed --> `comment out`
```bash
module load python
#rm -rf /scratch/users/<insert_username>/
#python -m venv /scratch/users/<insert_username>/env
source /scratch/users/<insert_username>/env/bin/activate
#pip install transformers -U
#pip install frontend
#pip install pymupdf
#pip install -r requirements.txt
#pip install pytesseract opencv-python langchain tiktoken requests pathlib Pillow
#pip install torch torchvision torchaudio transformers einops accelerate bitsandbytes sentence_transformers xformers chromadb
python <insert_path_to_main.py>
```
* ``Submit`` slurm job
```bash
sbatch <script name> # e.g. llmgpu.sh
```
* ``Check`` job status
```bash
squeue -u <user name>
```
* ``Get`` estimated ``start time`` of the job
```bash
squeue -u <user name> --start
```
* ``Cancel`` job
```bash
scancel <job id>
```
see [Using the GWDG Scientific Compute Cluster](https://docs.gwdg.de/lib/exe/fetch.php?media=en:services:application_services:high_performance_computing:courses:parallelkurs.pdf) for more
This project is licensed under the MIT License.