ISMB-2024-Tutorial-Federated-Ensemble-Learning-for-Biomedical-Data

Federated Ensemble Learning for Biomedical Data

Federated Ensemble Learning for Biomedical Data

Download

You can download the data, presentation, notebooks, and code of the tutorial here.

The materials contain the lecture slides, the exercises, and the necessary data and code to run the exercises. The solutions to the exercises will be uploaded after the tutorial.

Background

Federated Random Forest

Federated Random Forests (FRF) is a type of federated ensemble learning algorithm that addresses the challenge of limited data access in precision medicine. It allows multiple parties or entities to collaborate and build predictive models without sharing their actual data. Federated Random Forest adapts the ensemble learning principles of Random Forests to federated learning settings, allowing for decentralized model training and improved predictive performance while respecting data privacy and regulatory constraints.

In FRF, each part trains a local model using their own dataset. These local models consist of decision trees. A subset of decision trees is then randomly sampled from each local model to create a combined model, known as the federated combined model.

The communication between parties in FRF is efficient, requiring only two communication steps per model update. This is in contrast to other federated learning algorithms that require frequent communication for exchanging model parameters.

The performance of FRF has been evaluated in various biomedical datasets, and it has been found to outperform average local models and perform comparably to data-centralized models trained on the entire dataset. FRF enables collaborations across institutes, allowing clinicians to benefit from a vast collection of unbiased data from different geographic locations, demographics, and other varying factors. This can lead to the development of more generalizable models for better clinical decisions, especially for patients in rural areas and those with rare or geographically uncommon diseases.

Furthermore, this FRF can be extended to cases where features are partially overlapped between different institutes. This extension ensures that the federated learning framework remains robust and efficient even when data inconsistencies arise due to diverse recording practices. By leveraging a federated approach, each participating institute can contribute to the global model without exposing its entire dataset, thus maintaining privacy and compliance with data protection regulations. The extended FRF accommodates variability in feature sets by integrating only the available features from each site, enabling collaborative model building despite incomplete or non-uniform data.

Overall, FRF, in combination with secure multi-party computation, has the potential to revolutionize clinical practice by increasing the accuracy and robustness of healthcare AI, paving the way for precision medicine.

Federated Graph Neural Networks

Graph neural networks (GNNs) are a type of machine learning model that can handle graph data, which is very common in biomedicine. For example, we can use a graph to represent a patient, where the nodes are proteins that interact with each other and have patient-specific features from different omics data sources. We have developed Ensemble-GNN, a Python software package that can build federated, ensemble-based GNNs for graph data. Ensemble-GNN can easily create predictive models using graphs with different node features, such as gene expression or DNA methylation.

Setup

Federated GNNs and RF

The participants of the tutorials need a laptop (or PC) and internet access. There is no need to install anything in advance since we provide each participant with a computational environment at Jupyterhub-HPC of the Gesellschaft für Wissenschaftliche Datenverarbeitung mbh Göttingen (GWDG). On the day of the workshop, you will get user credentials to log in to Jupyterhub-HPC. After you log in, you will have to select only one of the following options below to spawn your Jupyter server with GPU access:

Job profile: GPU-workshop: Bioinformatics (July 2024)

Set the duration (in hours): leave untouched

Set the number of cores: leave untouched

Set the amount of memory (in GB): leave untouched

Jupyter Notebook's Home directory: leave untouched

After the server is spawned, you will see two different folders, corresponding to the Federated Random Forest and Federated GNNs. Inside those folders, you can find corresponding .ipynb notebooks. Before you open an .ipynb notebook to work in it, please open New --> Terminal. Inside the terminal, you can track the GPU load, with the commands "nvtop" and "nvitop". You can also track the load of the server by using the "top" command to track the resources of the server. While the "top" command runs, you can press "E" (several times) to change units to Mb, Gb, or Tb. Please track how much RAM you are already using (column RES). You can provide your_username (ukeb23..) as a parameter to the "top" command to track resources consumed only by you:

top -u your_username

To sort processes by memory usage, press Shift+M. If by using Jupyter Notebook you are getting close to 20 Gb by RAM usage, please restart the kernel, and run only a cell (few cells) you are currently working on.

Data sets

Federated Random Forest dataset

The dataset employed in our analysis is ILPD, which serves of the study titled Park et. al 2024. The dataset comprises 584 patient records collected from the NorthEast of Andhra Pradesh, India. The prediction task is to determine whether a patient suffers from liver disease based on the information about several biochemical markers, including albumin and other enzymes required for metabolism.

This cross-silo dataset allows us to evaluate the performance of our Federated Random Forest model in a diverse multi-institute setting. The data includes clinical attributes. Our focus is on a classification task of disease. We have split data into 10 silos in our analysis. You can access this data set inside the Code_and_data_for_exercises/FRF_exercise.zip archive located here.

Federated GNN dataset

Here we performed an analysis, classifying patients' gene expression profiles into breast cancer subtypes. We utilized the Breast Cancer Dataset from The Cancer Genome Atlas (TCGA). This dataset provides expression profiles of breast tumors from patients across 19 different institutes. For our federated learning application, we focus on classification.

The dataset contains five subtypes: luminal A (499 samples), luminal B (197 samples), basal-like (171 samples), HER2-enriched (78 samples) and normal-like (36 samples). We performed classification with the binary label: 1 == LumA subtype (499 patients), 0 == all the other subtypes (482 patients). For a detailed description of the data, we refer to section 2.2 of Chereda et al. 2024

Feedback

The participants can provide feedback to the tutorial.

References

[Hauschild et al. 2022] Hauschild, Anne-Christin, et al. "Federated Random Forests can improve local performance of predictive models for various healthcare applications." Bioinformatics 38.8 (2022): 2278-2286.

[Markovic et al. 2022] Markovic, Tijana, et al. "Random forest based on federated learning for intrusion detection." IFIP International Conference on Artificial Intelligence Applications and Innovations. Cham: Springer International Publishing, 2022.

[Pfeifer et al. 2023] Pfeifer, B., Chereda, H., Martin, R., Saranti, A., Angerschmid, A., Clemens, S., Hauschild, A.C., Beißbarth, T., Holzinger, A., Heider, D. "Ensemble-GNN: federated ensemble learning with graph neural networks for disease module discovery and classification." Bioinformatics 39.11 (2023): btad703.

[Chereda et al. 2024] Chereda, H., Leha, A., and Beissbarth, T. "Stable feature selection utilizing Graph Convolutional Neural Network and Layer-wise Relevance Propagation for biomarker discovery in breast cancer." Artificial Intelligence in Medicine 151.102840 (2024).

[Park et al. 2024] Park, Youngjun et al. “Federated Random Forest for Partially Overlapping Clinical Data.” (2024).

Contact

Clinical Decision Support Systems Group

Team:

Prof. Dr. Anne-Christin Hauschild
Dr. Youngjun Park
Maryam Moradpour

Details: https://medizininformatik.umg.eu/ueber-uns/wissenschaftliche-arbeitsgruppen/klinische-entscheidungsunterstuetzung/

Medical Bioinformatics

Team:

Prof. Dr. Tim Beißbarth
Dr. Hryhorii Chereda

Details: https://bioinformatics.umg.eu/