... | ... | @@ -10,7 +10,8 @@ The user can select different pre-specified model structures, including |
|
|
|
|
|
by setting the `model` parameter of `gsem.fit()` to `Cholesky`, `Independent`, `IPC` or `Common` respectively. `grmsem` fits, like GREML, all available data to the model. Each model can be adapted by the user by setting free parameters and starting values. Note that the likelihood for ill-specified models may not necessarily reach the global maximum and the model fit should be confirmed using different starting values. Although k, the number of different phenotypes is not restricted, in principle, computational demands will typically set a limit based on k x n \~ 30,000 for Cholesky decomposition models, where n is the number of observations per trait; models using less parameters can handle larger k x n.
|
|
|
|
|
|
# Installation instructions
|
|
|
# Download and installation
|
|
|
## Package
|
|
|
The calculations performed by the package `grmsem` are computationally very demanding. Hence it is highly recommended to optimise the R software prior to installing the package. This can be done within a Linux environment using OpenBLAS (an optimised Basic Linear Algebra Subsystem library), ATLAS (Automatically Tuned Linear Algebra Software) or the Intel MKL (Math Kernel Library) to improve the performance of basic vector and matrix operations (see [here](https://csantill.github.io/RPerformanceWBLAS/) for further information).
|
|
|
|
|
|
The package itself can be installed by standard commands, using either of the two options:
|
... | ... | @@ -18,13 +19,45 @@ The package itself can be installed by standard commands, using either of the tw |
|
|
* `install.packages("grmsem")` to obtain the newest release from CRAN
|
|
|
* `devtools::install_git('https://gitlab.gwdg.de/beate.stpourcain/grmsem')` for the current development version.
|
|
|
|
|
|
All data sets can be downloaded from here: `https://gitlab.gwdg.de/beate.stpourcain/grmsem_external`.
|
|
|
`grmsem` should be run in parallel by setting the `cores` parameter of `gsem.fit()`. Empirically, good performance was achieved by setting `cores=4`, while sharing memory across as many cores as possible. For this, the entire node should be blocked so that memory across all cores is available for the job (depending on the system ranging usually between 8-24 cores). For example, the fit of a Cholesky model (12 parameters) to simulated trivariate data with 5000 observations per trait and 20,000 SNPs per genetic factor, requires 1h40min using 4 cores, sharing memory across 24 cores with vmem max of 6.9 Gb, using R MKL 3.6.3.
|
|
|
|
|
|
## Data sets
|
|
|
All data sets used in the vignette and this wiki can be downloaded from here: `https://gitlab.gwdg.de/beate.stpourcain/grmsem_external`.
|
|
|
Note that small data sets are already included in the package.
|
|
|
|
|
|
# Input files
|
|
|
|
|
|
## Genetic relationship matrix (GRM)
|
|
|
A GRM consisting of pairs of unrelated individuals (with a 0.05 relatedness cut-off or lower) with genome-wide genotyping information can be estimated using [PLINK](https://www.cog-genomics.org/plink2) or [GCTA](https://cnsgenomics.com/software/gcta/#Overview) software. The lower triangle elements of the GRM can be saved in two different forms:
|
|
|
|
|
|
* **grm.gz files:**
|
|
|
These files have no header line and contain three columns, which are indices of pairs of individuals, number of non-missing SNPs and the estimate of genetic relatedness. The files are generated with the [GCTA](https://cnsgenomics.com/software/gcta/#Overview) `--make-grm-gz` command. The unzipped GRM file can be imported into `grmsem` using the `grm.input.R()` function.
|
|
|
|
|
|
* **grm-bin files:**
|
|
|
The GRM is saved in binary form and generated with the [GCTA](https://cnsgenomics.com/software/gcta/#Overview) `--make-grm-bin` command. The binary `grm-bin` file can be imported into `grmsem` using the `grm.bin.input.R()` function.
|
|
|
|
|
|
## Phenotype files
|
|
|
**Z-standardised** phenotype files are created in wide-format and need to be imported with individual observations **in the order** of the constructed GRM matrix, as given, for example, in [GCTA](https://cnsgenomics.com/software/gcta/#Overview) (2-column file: family ID, individual ID).
|
|
|
|
|
|
Phenotype files have a header, but no IDs. The number of phenotypes in the file determines the number of phenotypes **k** in the model. An example of a tri-variate phenotype is shown below:
|
|
|
|
|
|
```{r eval = FALSE}
|
|
|
> myph[1:10,]
|
|
|
Y1 Y2 Y3
|
|
|
[1,] -0.37839860 -0.83805596 -0.48236280
|
|
|
[2,] 0.01784131 0.71329485 1.36606815
|
|
|
[3,] 0.41885151 0.69768545 -0.09581534
|
|
|
[4,] -1.06905942 0.07335632 -0.17890886
|
|
|
[5,] -0.69229145 -1.01271372 -2.12097871
|
|
|
[6,] -0.05680556 -0.13606959 -1.09739314
|
|
|
```
|
|
|
|
|
|
|
|
|
# Examples
|
|
|
To illustrate the functionality of `grmsem`, we carried out several analyses using a range of different [data sets](https://gitlab.gwdg.de/beate.stpourcain/grmsem_external), as described in detail in the vignette. An example of a large data set, with a defined genetic architecture but high run-time, is shown here.
|
|
|
To illustrate the functionality of `grmsem`, we carried out several analyses using a range of different [data sets](https://gitlab.gwdg.de/beate.stpourcain/grmsem_external), as described in detail in the vignette. An example of a large data set, with a defined genetic architecture but high run-time, is shown below.
|
|
|
|
|
|
**Quad-variate Cholesky decomposition model**
|
|
|
## Quad-variate Cholesky decomposition model
|
|
|
An example of a large data set, with a defined genetic architecture but high run-time, is shown [here](https://gitlab.gwdg.de/beate.stpourcain/grmsem_external), base on the files, `G.large.RData`, `ph.large.RData` and a pre-fitted output model `fit.large.RData`.
|
|
|
|
|
|
* [Data simulation](Data simulation)
|
|
|
* [Model fit and output](Model fit and output)
|
... | ... | |