SKAT analysis¶
SKAT is a very popular test for the
association of sets of rare and common variants. In genipe
, we
provide straightforward bindings over SKAT to make it easy to perform parallel
SNP-set analysis of imputed variants.
This tutorial will walk through a full example of an analysis while describing the expected input file formats and the resulting output files at every step.
Input files generated by genipe¶
Impute2¶
After running the pipeline, you should have an Impute2 file containing genotype probabilities for all the imputed variants. The general structure of the Impute2 file contains information on the variant: the chromosome, name, position, first and second alleles. The subsequent columns correspond to the probabilities for the AA, AB and BB genotypes for all the variants. Hence, a single variant is represented by three columns in Impute2 files.
The following example shows a single line of the Impute2 file where column identifiers were added to help explain the file structure.
A B C D E 1 2 3 1 2 3 1 2 3 1 2 3
3 3:20109120:C 20109120 A C 1 0 0 1 0 0 0 0 1 0.998 0.002 0
3 rs79511213:T 20202955 C T 1 0 0 0 1 0 1 0 0 0.989 0.011 0
...
The columns for the previous examples go as follows:
A
The chromosome.B
The variant name.C
The genomic position of the variant.D
The “A1” allele. This is not necessarily the reference or alternative allele.E
The “A2” allele. This is not necessarily the reference or alternative allele.1
All those columns represent the probability of being a A1-A1 homozygous. It is noticeable that the columns 1, 2 and 3 alternate. This is because probabilities are given for all the variants in order. In other words, the probabilities are read as triplets.2
Columns representing the A1-A2 heterozygous.3
Columns representing the A2-A2 heterozygous.
Also note that identifying the minor and major alleles is done internally by computing the minor allele frequency (MAF) and assigning the rarest allele to be the minor allele.
Warning
Real Impute2 files do not have the A, B, C, … row. It was only added to facilitate the description of the format.
Samples file¶
This file is generated by genipe
and has the .sample
extension.
It greatly resembles the PLINK fam
file for those who are familiar with
this format. Specifically, it contains the samples that are included in the
Impute2
file in the correct order. It is needed to correctly interpret
the sample described by the probability triplets described earlier.
Specifically, the format is as follow:
ID_1 ID_2 missing father mother sex plink_pheno
0 0 0 D D D B
SAMPLE1 SAMPLE1 0 0 0 1 -9
SAMPLE2 SAMPLE2 0 0 0 2 -9
SAMPLE3 SAMPLE3 0 0 0 2 -9
SAMPLE4 SAMPLE4 0 0 0 1 -9
...
This time, the first rows are part of the format. The name of the first two columns is important for the analysis to work, and omitting the second row will result in missing samples in the analysis.
The attentive reader will notice that this sample file, when compared to the previously described Impute2 file yields the following dosage matrix (if we assume that the A1 is the major allele, which is not true in practice):
3:20109120:C |
rs79511213:T |
|
---|---|---|
SAMPLE1 |
0 (A1-A1) |
0 (A1-A1) |
SAMPLE2 |
0 (A1-A1) |
1 (A1-A2) |
SAMPLE3 |
2 (A2-A2) |
0 (A1-A1) |
SAMPLE4 |
0.002 (A1-A1) |
0.011 (A1-A1) |
Warning
Once again, note that for simplicity this example assumes that the A1 is the major allele, which is often not true in practice.
Creating a SNP sets file¶
SKAT is based on the analysis of a “SNP set”. This is simply an arbitrary set of variants that is created by the user. All of the variants in a same SNP set will be analyzed together by SKAT.
It is also noteworthy that adding neutral variants to a SNP set is generally avoided as much as possible, as it is known to decrease SKAT’s power, even if true deleterious variants are present in the SNP set. Describing the current best practices to create SNP sets is out of the scope of this tutorial, but possible approaches consist of testing variants in a same gene together, using variant prioritization tools to select interesting candidates or to group variants according to a biological pathway of interest.
The expected format for SNP sets is very straightforward. A tab separated file
with a single header row containing the following fields: variant
,
snp_set
is required. An optional third field identified by the weight
header can also be provided. Concretely, the expected file looks like this:
variant snp_set
3:20109120:C SET1
exm295293 SET1
3:20225460:T SET1
rs143768051:T SET1
rs79511213:T SET2
rs183736430:T SET2
rs375511922:G SET2
rs369840848:G SET2
Here, two gene sets are defined named SET1 and SET2. They both contain 4 variants, which is a small number for a real life application, especially if the variants are rare.
Note
Don’t forget that this file has to be tab (t) delimited!
Another example using weights is presented here:
variant snp_set weight
3:20109120:C SET1 1.4075226244343892
exm295293 SET1 1.8227675243926944
3:20225460:T SET1 1.7138009049773757
rs143768051:T SET1 1.5197343711318112
rs79511213:T SET2 0.6643099547511312
rs183736430:T SET2 0.1363741358618047
rs375511922:G SET2 0.4431278280542987
rs369840848:G SET2 0.1171662895927601
rs202028804:C SET2 0.5328902714932127
The only difference is the added weight field that is used by SKAT. By default, SKAT uses a weighting scheme that assigns larger prior weights to rare variants. If you want to use custom weights based on something la deleteriousness prediction, you need to generate the weights manually and provide them in the SNP set file as shown here.
Format for the phenotypes file¶
The last required file is the phenotype file. The latter contains information
on the phenotype of interest for all the samples included in the analysis. The
expected format is a tab delimited file containing a user-specified identifier
column (through the --sample-column
argument) and a phenotype column (
represented by the --pheno-name
argument).
The order of the samples is not important, as long as the identifier are
consistent with the .sample
file.
The following file is a valid example:
Sample Hypertension SystolicBP Age
SAMPLE1 1 163.21 63
SAMPLE2 0 122.13 21
SAMPLE3 0 119.01 42
SAMPLE4 1 158.84 79
Note
This file has to be tab (t) delimited!
As you can see, both continuous an discrete traits can be included in this
file, as well as covariates. The covariates are included in the analysis using
the --covar
argument and the variable type of the outcome is given using
the --outcome-type
argument which can be set to discrete or continuous.
Running the script¶
Once all of this is ready, you can finally run the script. A sample command for the analysis described throughout this tutorial is the following:
1 2 3 4 5 6 7 8 9 10 11 12 | imputed-stats skat \
--impute2 data/genipe_is_great.impute2 \
--sample data/genipe_is_great.sample \
--pheno data/phenotypes.txt \
--sample-column Sample \
--pheno-name SystolicBP \
--covar Age \
--extract-sites data/genipe_is_great.good_sites \
--nb-process 4 \
--out my_skat_analysis \
--snp-sets carefully_crafted_snp_sets.txt \
--outcome-type continuous \
|
The line by line explanation of this command is as follows:
We are calling the program with the
skat
option to specify the type of analysis. Alternative analysis types are presented in other tutorials, but they include linear and logistic regression as well as Cox survival analysis.The Impute2 file as previously described. This contains the probability for every allele of every variant. It is generated by
genipe
.The sample contained the sample identifiers with an ordering consistent with the Impute2 file. It was generated by
genipe
.The phenotype is a tab delimited file containing phenotypic information for all the samples. Covariates are also included in this file.
The sample column is the identifier for the column containing sample IDs in the phenotype file.
The phenotype column is the identifier for the outcome for this analysis. In our case, we are using the quantitative trait SystolicBP representing the systolic blood pressure in this example.
The covariates are provided as a comma separated list of column names from the phenotype file. In our case, we’re only using the Age column, but adding covariates is easy. As an example –covar Age,Hypertension would add the Hypertension column as a covariate.
A list of variants to keep for the analysis is provided here. A good choice would be to simply use the .good_sites file generated by py:mod:genipe, but additional QC steps could be used to further narrow this list.
The
--nb-process
argument determines how many simultaneous SKAT analyses will be started in parallel. This value should be a multiple of the number of the number of SNP sets or of the number of CPUs on your machine as parallelism only occurs at the gene set level.The output prefix for the analysis results.
The SNP set file that was created by the user.
The outcome type. This should be set to either ‘continuous’ or ‘discrete’ depending on the type of dependent variable under study.
Note
Also note that the SKAT-O algorithm can be used by using the --skat-o
flag.
Results¶
After running this analysis, a file named my_skat_analysis.skat.dosage should appear. This file will contain an association p-value for every one of the specified SNP sets.
You can also verify the my_skat_analysis.log file to see if genipe
or SKAT generated any warnings.
Finally, a directory containing .csv files and an R script should also be visible. This directory contains everything that is needed to run the analysis in R. You can verify that the generated R script is consistent with your expectations and that the analysis is correct. If the file sizes are manageable, you can also archive them to insure reproducibility of the analysis.
If you are unsatisfied about the generated R script, you can download the
genipe
source and modify the
run_skat.R
file. This is a template rendered and launched by genipe
to conduct
the analysis by using the regular R SKAT package. It is easy to extend this
file to customize your analysis or to write new ones.
If you have any questions or problems, don’t hesitate to contact us through our Github page or by creating a new issue in the Issue tracker.
We are also very open to suggestions and improvements. If you have developed a new interesting feature, please send us a push request from Github.
Usage¶
The following command will display the documentation for the SKAT analysis in the console:
$ imputed-stats skat --help
usage: imputed-stats skat [-h] [-v] [--debug] --impute2 FILE --sample FILE
--pheno FILE [--extract-sites FILE] [--out FILE]
[--nb-process INT] [--nb-lines INT] [--chrx]
[--gender-column NAME] [--scale INT] [--prob FLOAT]
[--maf FLOAT] [--covar NAME] [--categorical NAME]
[--missing-value NAME] [--sample-column NAME]
[--interaction NAME] --snp-sets FILE --outcome-type
{continuous,discrete} [--skat-o] --pheno-name NAME
Uses the SKAT R package to analyze user defined gene sets. This script is part
of the 'genipe' package, version 1.4.2.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--debug set the logging level to debug
Input Files:
--impute2 FILE The output from IMPUTE2.
--sample FILE The sample file (the order should be the same as in
the IMPUTE2 files).
--pheno FILE The file containing phenotypes and co variables.
--extract-sites FILE A list of sites to extract for analysis (optional).
Output Options:
--out FILE The prefix for the output files. [imputed_stats]
General Options:
--nb-process INT The number of process to use. [1]
--nb-lines INT The number of line to read at a time. [1000]
--chrx The analysis is performed for the non pseudo-autosomal
region of the chromosome X (male dosage will be
divided by 2 to get values [0, 0.5] instead of [0, 1])
(males are coded as 1 and option '--gender-column'
should be used).
--gender-column NAME The name of the gender column (use to exclude samples
with unknown gender (i.e. not 1, male, or 2, female).
If gender not available, use 'None'. [Gender]
Dosage Options:
--scale INT Scale dosage so that values are in [0, n] (possible
values are 1 (no scaling) or 2). [2]
--prob FLOAT The minimal probability for which a genotype should be
considered. [>=0.9]
--maf FLOAT Minor allele frequency threshold for which marker will
be skipped. [<0.01]
Phenotype Options:
--covar NAME The co variable names (in the phenotype file),
separated by coma.
--categorical NAME The name of the variables that are categorical (note
that the gender is always categorical). The variables
are separated by coma.
--missing-value NAME The missing value in the phenotype file.
--sample-column NAME The name of the sample ID column (in the phenotype
file). [sample_id]
--interaction NAME Add an interaction between the genotype and this
variable.
SKAT Options:
--snp-sets FILE A file indicating a snp_set and an optional weight for
every variant.
--outcome-type {continuous,discrete}
The variable type for the outcome. This will be passed
to SKAT.
--skat-o By default, the regular SKAT is used. Setting this
flag will use the SKAT-O algorithm instead.
--pheno-name NAME The phenotype.