|
|
# Synopsis
|
|
|
The grmsem package is a quantitative genetics tool supporting the modelling of multivariate genetic variance structures in quantitative data. grmsem allows fitting different models through multivariate genetic-relationship-matrix (GRM) structural equation modelling (SEM) in unrelated individuals, using a maximum likelihood approach. Specifically, it combines genome-wide genotyping information, as captured by GRMs, with twin-research-based SEM techniques. `grmsem` uses a maximum likelihood approach setting fixed effect means to zero by use of **z-standardised** phenotypes.
|
|
|
The `grmsem` package is an open-source quantitative genetics tool that supports the modelling of genetic and residual covariance structures in samples of unrelated individuals with genome-wide genotyping information. `grmsem` allows fitting different models describing the underlying multivariate genetic architecture of quantitative traits, as captured by a genetic-relationship-matrix (GRM), using structural equation modelling (SEM) techniques and a maximum likelihood approach. Analogous to twin models, the `grmsem` package includes multiple models, such as a `Cholesky decomposition` model, an `Independent Pathway` model and the `Direct Symmetric` model, but also novel models such as a combined `Independent Pathway / Cholesky` model. A general form of these models can be automatically fitted. The user can adapt each model by changing the pre-fitted parameters. All estimates can be obtained in standardised form. Follow-up analyses include estimations of genetic correlations, bivariate heritabilities and factorial co-heritabilities. `grmsem` replaces the package `gsem`, presented in [@StPourcain2018], because an unrelated package with the same name had been released simultaneously.
|
|
|
|
|
|
The user can select different pre-specified model structures, including
|
|
|
The user can select $\Lambda_A$ and $\Lambda_E$ according to pre-specified model structures, including the models
|
|
|
|
|
|
* [Cholesky decomposition model](Cholesky model)
|
|
|
* [Independent pathway model](Independent pathway model)
|
|
|
* [Combined independent pathway and Cholesky model](IPC model)
|
|
|
* [Common pathway model](Common pathway model)
|
|
|
- Cholesky decomposition
|
|
|
- Independent Pathway (IP)
|
|
|
- Combined Independent Pathway / Cholesky (IPC)
|
|
|
- Common Pathway (CP)
|
|
|
|
|
|
by setting the `model` parameter of `grmsem.fit()` to `Cholesky`, `Independent`, `IPC` or `Common` respectively. grmsem fits, like GREML, all available data to the model. Each model can be adapted by the user by setting free parameters and starting values. Note that the likelihood for ill-specified models may not necessarily reach the global maximum and the model fit should be confirmed using different starting values. Although k, the number of different phenotypes is not restricted, in principle, computational demands will typically set a limit based on k x n \~ 30,000 for Cholesky decomposition models, where n is the number of observations per trait; models using less parameters can handle larger k x n.
|
|
|
by setting the `model` option of `grmsem.fit()` to `Cholesky`, `IP`, `IPC` or `CP` respectively. Note that the Common Pathway model is for likelihood comparisons only. In addition, the Cholesky model can be re-parametrised as
|
|
|
|
|
|
- Direct Symmetric (DS)
|
|
|
|
|
|
model, estimating genetic and residual covariances directly, using the `model` option `DS`. `grmsem` fits, like GREML, all available data to the model. Each model can be adapted by the user by setting free parameters and starting values. Although $k$, the number of different phenotypes is not restricted, in principle, computational demands will typically set a limit based on $k\cdot n\approx30,000$ for Cholesky decomposition models; models using fewer parameters can handle larger $k\cdot n$.
|
|
|
|
|
|
# Download and installation
|
|
|
## Package
|
|
|
The calculations performed by the package grmsem are computationally very demanding. Hence it is highly recommended to optimise the R software prior to installing the package. This can be done within a Linux environment using OpenBLAS (an optimised Basic Linear Algebra Subsystem library), ATLAS (Automatically Tuned Linear Algebra Software) or the Intel MKL (Math Kernel Library) to improve the performance of basic vector and matrix operations (see [here](https://csantill.github.io/RPerformanceWBLAS/) for further information).
|
|
|
|
|
|
The package itself can be installed by standard commands, using either of the two options:
|
|
|
The calculations performed by the package `grmsem` are computationally demanding. Hence it is highly recommended to optimise the R software prior to installing the package. This can be done within a Linux environment using OpenBLAS (an optimised Basic Linear Algebra Subsystem library), ATLAS (Automatically Tuned Linear Algebra Software) or the Intel MKL (Math Kernel Library) to improve the performance of basic vector and matrix operations (see [here](https://csantill.github.io/RPerformanceWBLAS/) for further information). The package itself can be installed by standard commands, using either of the two options:
|
|
|
|
|
|
* `install.packages("grmsem")` (latest CRAN release)
|
|
|
* `devtools::install_git('https://gitlab.gwdg.de/beate.stpourcain/grmsem')` (development version)
|
|
|
|
|
|
grmsem estimations should be run in parallel by setting the `cores` parameter of `grmsem.fit()`. Empirically, good performance can be achieved by setting `cores=4`, while sharing memory across as many cores as possible. For this, the entire node should be blocked so that memory across all cores is available for the job (depending on the system ranging usually between 8-24 cores). For example, the fit of a Cholesky model (12 parameters) to simulated trivariate data with 5000 observations per trait and 20,000 SNPs per genetic factor, requires 1h40min using 4 cores, sharing memory across 24 cores with vmem max of 6.9 Gb, using R MKL 3.6.3.
|
|
|
`grmsem` should be run in parallel by setting the `cores` option of `grmsem.fit()`. Empirically, good performance was achieved with `cores=4`, while sharing memory across as many cores as possible. For this, the entire node should be blocked so that memory across all cores is available for the job (depending on the system ranging usually between 8-24 cores).
|
|
|
Run times and memory requirements for different examples are detailed below.
|
|
|
|
|
|
## Data sets
|
|
|
All data sets used in the vignette and this wiki can be downloaded from here:
|
|
|
|
|
|
* `https://gitlab.gwdg.de/beate.stpourcain/grmsem_external`.
|
|
|
|
|
|
Note that small data sets are already included in the package.
|
|
|
The `small` data set is already included in the package. All data sets used in the vignette can be downloaded [here](https://gitlab.gwdg.de/beate.stpourcain/grmsem_external).
|
|
|
|
|
|
# Input files
|
|
|
|
|
|
## Genetic relationship matrix (GRM)
|
|
|
A GRM consisting of pairs of unrelated individuals (with a 0.05 relatedness cut-off or lower) with genome-wide genotyping information can be estimated using [PLINK](https://www.cog-genomics.org/plink2) or [GCTA](https://cnsgenomics.com/software/gcta/#Overview) software. The lower triangle elements of the GRM can be saved in two different forms:
|
|
|
## Genetic relationship matrix
|
|
|
A Genetic Relationship Matrix (GRM) is a symmetric matrix with entries representing the (standardized) number of mutually shared alleles among individuals of a sample. A GRM, consisting of pairs of unrelated individuals (relatedness cut-off $\leq$ 0.05) with genome-wide information, can be estimated using [PLINK](https://www.cog-genomics.org/plink2) or [GCTA](https://cnsgenomics.com/software/gcta/#Overview) software, of which the lower triangle elements are saved in two different forms:
|
|
|
|
|
|
* **grm.gz files:**
|
|
|
These files have no header line and contain three columns, which are indices of pairs of individuals, number of non-missing SNPs and the estimate of genetic relatedness. The files are generated with the [GCTA](https://cnsgenomics.com/software/gcta/#Overview) `--make-grm-gz` command. The unzipped GRM file can be imported into grmsem using the `grm.input.R()` function.
|
|
|
The lower triangle GRM elements are generated and saved with the [GCTA](https://cnsgenomics.com/software/gcta/#Overview) `--make-grm-gz` command. `grm.gz` files have no header and contain four columns: indices of pairs of individuals (column 1,2; corresponding to row numbers of grm.id files), number of non-missing SNPs (column 3) and the estimate of genetic relatedness (column 4). The compressed GRM file (`grm.gz`) can be imported using the `grm.input()` function, specifying the file name in the working directory, and will be returned as symmetric matrix. An example is shown below:
|
|
|
|
|
|
* **grm-bin files:**
|
|
|
The GRM is saved in binary form and generated with the [GCTA](https://cnsgenomics.com/software/gcta/#Overview) `--make-grm-bin` command. The binary `grm-bin` file can be imported into grmsem using the `grm.bin.input.R()` function.
|
|
|
```{r eval = FALSE}
|
|
|
> G <- grm.input("large.gcta.grm.gz")
|
|
|
> G[1:3,1:3] #Rekationships among the first three individuals
|
|
|
[,1] [,2] [,3]
|
|
|
[1,] 0.99354762 0.02328514 0.01644197
|
|
|
[2,] 0.02328514 0.99406837 0.01021175
|
|
|
[3,] 0.01644197 0.01021175 1.02751472
|
|
|
```
|
|
|
|
|
|
## Phenotype files
|
|
|
**Z-standardised** phenotype files are created in wide-format and need to be imported with individual observations **in the order** of the constructed GRM matrix, as given, for example, in [GCTA](https://cnsgenomics.com/software/gcta/#Overview) grm.id files (2-column file: family ID, individual ID).
|
|
|
* **grm-bin files:**
|
|
|
The lower triangle GRM elements are saved in binary form using the [GCTA](https://cnsgenomics.com/software/gcta/#Overview) `--make-grm-bin` command. The binary `grm.bin` file can be imported using the `grm.bin.input()` function, specifying the file name in the working directory, and will be returned as symmetric matrix.
|
|
|
|
|
|
Phenotype files have a header, but no IDs. The number of phenotypes in the file determines the number of phenotypes **k** in the model. An example of a tri-variate phenotype is shown below:
|
|
|
## Phenotype file
|
|
|
**Z-standardised** scores are required in form of tables (data frames) with each column representing a different phenotype. The number of columns determines the number of phenotypes **k** in the model. The observations must be **in the same order** as the individuals in the columns/rows of the GRM matrix. This order is shown, for example, in the [GCTA](https://cnsgenomics.com/software/gcta/#Overview) .id file (2-column file: family ID, individual ID). An example of a quad-variate phenotype is shown below:
|
|
|
|
|
|
```{r eval = FALSE}
|
|
|
> myph[1:10,]
|
|
|
Y1 Y2 Y3
|
|
|
[1,] -0.37839860 -0.83805596 -0.48236280
|
|
|
[2,] 0.01784131 0.71329485 1.36606815
|
|
|
[3,] 0.41885151 0.69768545 -0.09581534
|
|
|
[4,] -1.06905942 0.07335632 -0.17890886
|
|
|
[5,] -0.69229145 -1.01271372 -2.12097871
|
|
|
[6,] -0.05680556 -0.13606959 -1.09739314
|
|
|
> load("ph.large.RData")
|
|
|
> ph.large[1:2,]
|
|
|
Y1 Y2 Y3 Y4
|
|
|
[1,] -0.7640819 -0.6016908 -0.3981901 -0.3169821
|
|
|
[2,] -0.5099606 0.6671311 -1.3119328 -0.5601261
|
|
|
```
|
|
|
|
|
|
|
|
|
# Examples
|
|
|
To illustrate the functionality of `grmsem`, we carried out several analyses using a range of different [data sets](https://gitlab.gwdg.de/beate.stpourcain/grmsem_external), as described in detail in the vignette.
|
|
|
|
... | ... | |