README.md

<h1 align="center">Artefact for automated extraction of sustainability data from CSR</h1>

<p align="center">
    <a>
            <img alt="Build" src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple">
    </a>
    <a>
        <img alt="License" src="https://img.shields.io/badge/MIT-MIT?label=license">
    </a>
</p>
<p align="center">
    This is a prototype developed for the project seminar of the <strong>Master's programme</strong> in Information Systems at the <strong>Georg-August University in Göttingen</strong>.
</p>
<p align="center">
    The prototype is designed to extract environment-related information from corporate sustainability reports with the help of an <strong>LLM</strong>.
</p>

## Installation

To install this application on the HPC, follow these steps:

1. Connect with SSH (see [GWDG documentation](https://docs.gwdg.de/doku.php?id=en:services:application_services:high_performance_computing:connect_with_ssh))

2. Log into the HPC (see [GWDG documentation](https://docs.gwdg.de/doku.php?id=en:services:application_services:high_performance_computing:connect_with_ssh))
   1. Connect to HPC via VSCode (see [GWDG documentation](https://info.gwdg.de/news/en/configuring-vscode-to-access-gwdgs-hpc-cluster/))

3. Clone this repository to the HPC by running the following command in your terminal (or use you`re IDE): 
    ```bash
    git clone https://gitlab.gwdg.de/t.reichard/projektstudium.git
    ```
   
4. Navigate to the project directory and create a virtual environment by running the following command:
    ```bash
    python3 -m venv env
   ```
   
5. Activate the virtual environment by running the following command:
    ```bash
    source env/bin/activate
    ```
   
6. Install the required dependencies by running the following command:
    ```bash
    pip install -r requirements.txt
    ```
## Usage on HPC

1. Create a bash script by creating new file with `.sh` as ending

2. Copy and paste the content of `llmgpu.txt` into the new script

3. Modify ``slurm parameter`` in bash script:
   ```bash
   #!/bin/bash
   #SBATCH --job-name= <insert_job_name>
   #SBATCH --output=job-%J.out
   #SBATCH --mail-type=ALL
   #SBATCH --mail-user= <insert_mail_address>
   #SBATCH --ntasks=1
   #SBATCH -p gpu <modify_if_needed>
   #SBATCH -G v100:5 <modify_if_needed>
   #SBATCH --mem-per-gpu=26G <modify_if_needed>
   #SBATCH --time=00:00:00 <insert_estimated_time_that_the_script_needs_to_execute>
   #SBATCH -C scratch
   ```
   
4. Modify ``python commands`` in bash script:
   ```bash
   # Load any necessary modules or activate the desired Python environment
   # Execute the Python script with the input file from the array
   module load python
   rm -rf /scratch/users/<insert_username>/
   python -m venv /scratch/users/<insert_username>/env
   source /scratch/users/<insert_username>/env/bin/activate
   pip install transformers -U
   pip install frontend
   pip install pymupdf
   pip install -r requirements.txt
   pip install python-dotenv
   pip install pytesseract opencv-python langchain tiktoken requests pathlib Pillow
   pip install torch torchvision torchaudio transformers einops accelerate bitsandbytes sentence_transformers xformers chromadb
   python <insert_path_to_main.py>
   ```
  
5. Create .env file from .exampleEnv


6. Add username in ``.env``
   ```bash
    # .env file
    USERS_MUSTERMANN=mustermann 
   ```
7. Change username in ``main.py``
    ```python
   import os
   
    user_name = os.getenv('add_username')
   ```

8. Submit ``slurm job``
   ```bash
   sbatch <script name> or <path to script>
   ```

9. Check ``job status``
   ```bash
   squeue -u <user name>
   ```

### After first execution of the bash script!!!

All dependencies are installed --> `comment out`
```bash
   module load python
   #rm -rf /scratch/users/<insert_username>/
   #python -m venv /scratch/users/<insert_username>/env
   source /scratch/users/<insert_username>/env/bin/activate
   #pip install transformers -U
   #pip install frontend
   #pip install pymupdf
   #pip install -r requirements.txt
   #pip install pytesseract opencv-python langchain tiktoken requests pathlib Pillow
   #pip install torch torchvision torchaudio transformers einops accelerate bitsandbytes sentence_transformers xformers chromadb
   python <insert_path_to_main.py>
```

## Useful slurm commands

* ``Submit`` slurm job
    ```bash
      sbatch <script name> # e.g. llmgpu.sh
   ```
* ``Check`` job status
   ```bash
      squeue -u <user name>
   ```
* ``Get`` estimated ``start time`` of the job 
   ```bash
         squeue -u <user name> --start
   ```
* ``Cancel`` job
   ```bash
         scancel <job id>
   ```

## Further information

see [Using the GWDG Scientific Compute Cluster](https://docs.gwdg.de/lib/exe/fetch.php?media=en:services:application_services:high_performance_computing:courses:parallelkurs.pdf) for more

## License

This project is licensed under the MIT License.