SKAT analysis

SKAT is a very popular test for the association of sets of rare and common variants. In genipe, we provide straightforward bindings over SKAT to make it easy to perform parallel SNP-set analysis of imputed variants.

This tutorial will walk through a full example of an analysis while describing the expected input file formats and the resulting output files at every step.

Input files generated by genipe

Impute2

After running the pipeline, you should have an Impute2 file containing genotype probabilities for all the imputed variants. The general structure of the Impute2 file contains information on the variant: the chromosome, name, position, first and second alleles. The subsequent columns correspond to the probabilities for the AA, AB and BB genotypes for all the variants. Hence, a single variant is represented by three columns in Impute2 files.

The following example shows a single line of the Impute2 file where column identifiers were added to help explain the file structure.

A B            C        D E 1 2 3 1 2 3 1 2 3 1     2     3
3 3:20109120:C 20109120 A C 1 0 0 1 0 0 0 0 1 0.998 0.002 0
3 rs79511213:T 20202955 C T 1 0 0 0 1 0 1 0 0 0.989 0.011 0
...

The columns for the previous examples go as follows:

  • A The chromosome.

  • B The variant name.

  • C The genomic position of the variant.

  • D The “A1” allele. This is not necessarily the reference or alternative allele.

  • E The “A2” allele. This is not necessarily the reference or alternative allele.

  • 1 All those columns represent the probability of being a A1-A1 homozygous. It is noticeable that the columns 1, 2 and 3 alternate. This is because probabilities are given for all the variants in order. In other words, the probabilities are read as triplets.

  • 2 Columns representing the A1-A2 heterozygous.

  • 3 Columns representing the A2-A2 heterozygous.

Also note that identifying the minor and major alleles is done internally by computing the minor allele frequency (MAF) and assigning the rarest allele to be the minor allele.

Warning

Real Impute2 files do not have the A, B, C, … row. It was only added to facilitate the description of the format.

Samples file

This file is generated by genipe and has the .sample extension. It greatly resembles the PLINK fam file for those who are familiar with this format. Specifically, it contains the samples that are included in the Impute2 file in the correct order. It is needed to correctly interpret the sample described by the probability triplets described earlier.

Specifically, the format is as follow:

ID_1 ID_2 missing father mother sex plink_pheno
0 0 0 D D D B
SAMPLE1 SAMPLE1 0 0 0 1 -9
SAMPLE2 SAMPLE2 0 0 0 2 -9
SAMPLE3 SAMPLE3 0 0 0 2 -9
SAMPLE4 SAMPLE4 0 0 0 1 -9
...

This time, the first rows are part of the format. The name of the first two columns is important for the analysis to work, and omitting the second row will result in missing samples in the analysis.

The attentive reader will notice that this sample file, when compared to the previously described Impute2 file yields the following dosage matrix (if we assume that the A1 is the major allele, which is not true in practice):

Dosage and hard-calls derived from the Impute2 and .sample files described in this tutorial.

3:20109120:C

rs79511213:T

SAMPLE1

0 (A1-A1)

0 (A1-A1)

SAMPLE2

0 (A1-A1)

1 (A1-A2)

SAMPLE3

2 (A2-A2)

0 (A1-A1)

SAMPLE4

0.002 (A1-A1)

0.011 (A1-A1)

Warning

Once again, note that for simplicity this example assumes that the A1 is the major allele, which is often not true in practice.

Good sites file

The .good_sites file is also generated by genipe. It contains a single column representing the name of the variants that passed the quality thresholds.

Only the variants in this file will be considered by the imputed-stats tool.

Creating a SNP sets file

SKAT is based on the analysis of a “SNP set”. This is simply an arbitrary set of variants that is created by the user. All of the variants in a same SNP set will be analyzed together by SKAT.

It is also noteworthy that adding neutral variants to a SNP set is generally avoided as much as possible, as it is known to decrease SKAT’s power, even if true deleterious variants are present in the SNP set. Describing the current best practices to create SNP sets is out of the scope of this tutorial, but possible approaches consist of testing variants in a same gene together, using variant prioritization tools to select interesting candidates or to group variants according to a biological pathway of interest.

The expected format for SNP sets is very straightforward. A tab separated file with a single header row containing the following fields: variant, snp_set is required. An optional third field identified by the weight header can also be provided. Concretely, the expected file looks like this:

variant         snp_set
3:20109120:C    SET1
exm295293       SET1
3:20225460:T    SET1
rs143768051:T   SET1
rs79511213:T    SET2
rs183736430:T   SET2
rs375511922:G   SET2
rs369840848:G   SET2

Here, two gene sets are defined named SET1 and SET2. They both contain 4 variants, which is a small number for a real life application, especially if the variants are rare.

Note

Don’t forget that this file has to be tab (t) delimited!

Another example using weights is presented here:

variant         snp_set  weight
3:20109120:C    SET1     1.4075226244343892
exm295293       SET1     1.8227675243926944
3:20225460:T    SET1     1.7138009049773757
rs143768051:T   SET1     1.5197343711318112
rs79511213:T    SET2     0.6643099547511312
rs183736430:T   SET2     0.1363741358618047
rs375511922:G   SET2     0.4431278280542987
rs369840848:G   SET2     0.1171662895927601
rs202028804:C   SET2     0.5328902714932127

The only difference is the added weight field that is used by SKAT. By default, SKAT uses a weighting scheme that assigns larger prior weights to rare variants. If you want to use custom weights based on something la deleteriousness prediction, you need to generate the weights manually and provide them in the SNP set file as shown here.

Format for the phenotypes file

The last required file is the phenotype file. The latter contains information on the phenotype of interest for all the samples included in the analysis. The expected format is a tab delimited file containing a user-specified identifier column (through the --sample-column argument) and a phenotype column ( represented by the --pheno-name argument).

The order of the samples is not important, as long as the identifier are consistent with the .sample file.

The following file is a valid example:

Sample  Hypertension SystolicBP Age
SAMPLE1 1            163.21     63
SAMPLE2 0            122.13     21
SAMPLE3 0            119.01     42
SAMPLE4 1            158.84     79

Note

This file has to be tab (t) delimited!

As you can see, both continuous an discrete traits can be included in this file, as well as covariates. The covariates are included in the analysis using the --covar argument and the variable type of the outcome is given using the --outcome-type argument which can be set to discrete or continuous.

Running the script

Once all of this is ready, you can finally run the script. A sample command for the analysis described throughout this tutorial is the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
imputed-stats skat \
    --impute2 data/genipe_is_great.impute2 \
    --sample data/genipe_is_great.sample \
    --pheno data/phenotypes.txt \
    --sample-column Sample \
    --pheno-name SystolicBP \
    --covar Age \
    --extract-sites data/genipe_is_great.good_sites \
    --nb-process 4 \
    --out my_skat_analysis \
    --snp-sets carefully_crafted_snp_sets.txt \
    --outcome-type continuous \

The line by line explanation of this command is as follows:

  1. We are calling the program with the skat option to specify the type of analysis. Alternative analysis types are presented in other tutorials, but they include linear and logistic regression as well as Cox survival analysis.

  2. The Impute2 file as previously described. This contains the probability for every allele of every variant. It is generated by genipe.

  3. The sample contained the sample identifiers with an ordering consistent with the Impute2 file. It was generated by genipe.

  4. The phenotype is a tab delimited file containing phenotypic information for all the samples. Covariates are also included in this file.

  5. The sample column is the identifier for the column containing sample IDs in the phenotype file.

  6. The phenotype column is the identifier for the outcome for this analysis. In our case, we are using the quantitative trait SystolicBP representing the systolic blood pressure in this example.

  7. The covariates are provided as a comma separated list of column names from the phenotype file. In our case, we’re only using the Age column, but adding covariates is easy. As an example –covar Age,Hypertension would add the Hypertension column as a covariate.

  8. A list of variants to keep for the analysis is provided here. A good choice would be to simply use the .good_sites file generated by py:mod:genipe, but additional QC steps could be used to further narrow this list.

  9. The --nb-process argument determines how many simultaneous SKAT analyses will be started in parallel. This value should be a multiple of the number of the number of SNP sets or of the number of CPUs on your machine as parallelism only occurs at the gene set level.

  10. The output prefix for the analysis results.

  11. The SNP set file that was created by the user.

  12. The outcome type. This should be set to either ‘continuous’ or ‘discrete’ depending on the type of dependent variable under study.

Note

Also note that the SKAT-O algorithm can be used by using the --skat-o flag.

Results

After running this analysis, a file named my_skat_analysis.skat.dosage should appear. This file will contain an association p-value for every one of the specified SNP sets.

You can also verify the my_skat_analysis.log file to see if genipe or SKAT generated any warnings.

Finally, a directory containing .csv files and an R script should also be visible. This directory contains everything that is needed to run the analysis in R. You can verify that the generated R script is consistent with your expectations and that the analysis is correct. If the file sizes are manageable, you can also archive them to insure reproducibility of the analysis.

If you are unsatisfied about the generated R script, you can download the genipe source and modify the run_skat.R file. This is a template rendered and launched by genipe to conduct the analysis by using the regular R SKAT package. It is easy to extend this file to customize your analysis or to write new ones.

If you have any questions or problems, don’t hesitate to contact us through our Github page or by creating a new issue in the Issue tracker.

We are also very open to suggestions and improvements. If you have developed a new interesting feature, please send us a push request from Github.

Usage

The following command will display the documentation for the SKAT analysis in the console:

$ imputed-stats skat --help
usage: imputed-stats skat [-h] [-v] [--debug] --impute2 FILE --sample FILE
                          --pheno FILE [--extract-sites FILE] [--out FILE]
                          [--nb-process INT] [--nb-lines INT] [--chrx]
                          [--gender-column NAME] [--scale INT] [--prob FLOAT]
                          [--maf FLOAT] [--covar NAME] [--categorical NAME]
                          [--missing-value NAME] [--sample-column NAME]
                          [--interaction NAME] --snp-sets FILE --outcome-type
                          {continuous,discrete} [--skat-o] --pheno-name NAME

Uses the SKAT R package to analyze user defined gene sets. This script is part
of the 'genipe' package, version 1.4.2.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --debug               set the logging level to debug

Input Files:
  --impute2 FILE        The output from IMPUTE2.
  --sample FILE         The sample file (the order should be the same as in
                        the IMPUTE2 files).
  --pheno FILE          The file containing phenotypes and co variables.
  --extract-sites FILE  A list of sites to extract for analysis (optional).

Output Options:
  --out FILE            The prefix for the output files. [imputed_stats]

General Options:
  --nb-process INT      The number of process to use. [1]
  --nb-lines INT        The number of line to read at a time. [1000]
  --chrx                The analysis is performed for the non pseudo-autosomal
                        region of the chromosome X (male dosage will be
                        divided by 2 to get values [0, 0.5] instead of [0, 1])
                        (males are coded as 1 and option '--gender-column'
                        should be used).
  --gender-column NAME  The name of the gender column (use to exclude samples
                        with unknown gender (i.e. not 1, male, or 2, female).
                        If gender not available, use 'None'. [Gender]

Dosage Options:
  --scale INT           Scale dosage so that values are in [0, n] (possible
                        values are 1 (no scaling) or 2). [2]
  --prob FLOAT          The minimal probability for which a genotype should be
                        considered. [>=0.9]
  --maf FLOAT           Minor allele frequency threshold for which marker will
                        be skipped. [<0.01]

Phenotype Options:
  --covar NAME          The co variable names (in the phenotype file),
                        separated by coma.
  --categorical NAME    The name of the variables that are categorical (note
                        that the gender is always categorical). The variables
                        are separated by coma.
  --missing-value NAME  The missing value in the phenotype file.
  --sample-column NAME  The name of the sample ID column (in the phenotype
                        file). [sample_id]
  --interaction NAME    Add an interaction between the genotype and this
                        variable.

SKAT Options:
  --snp-sets FILE       A file indicating a snp_set and an optional weight for
                        every variant.
  --outcome-type {continuous,discrete}
                        The variable type for the outcome. This will be passed
                        to SKAT.
  --skat-o              By default, the regular SKAT is used. Setting this
                        flag will use the SKAT-O algorithm instead.
  --pheno-name NAME     The phenotype.