.. contents:: Quick navigation
   :depth: 2


SKAT analysis
==============

`SKAT <http://www.hsph.harvard.edu/skat/>`_ is a very popular test for the
association of sets of rare and common variants. In :py:mod:`genipe`, we
provide straightforward bindings over SKAT to make it easy to perform parallel
SNP-set analysis of imputed variants.

This tutorial will walk through a full example of an analysis while describing
the expected input file formats and the resulting output files at every
step.


.. _skat-tut-1:

Input files generated by genipe
--------------------------------

Impute2
^^^^^^^^

After running the pipeline, you should have an `Impute2` file containing
genotype probabilities for all the imputed variants. The general structure of
the `Impute2` file contains information on the variant: the chromosome, name,
position, first and second alleles. The subsequent columns correspond to the
probabilities for the AA, AB and BB genotypes for all the variants. Hence, a
single variant is represented by three columns in `Impute2` files.

The following example shows a single line of the `Impute2` file where column
identifiers were added to help explain the file structure. ::

    A B            C        D E 1 2 3 1 2 3 1 2 3 1     2     3
    3 3:20109120:C 20109120 A C 1 0 0 1 0 0 0 0 1 0.998 0.002 0
    3 rs79511213:T 20202955 C T 1 0 0 0 1 0 1 0 0 0.989 0.011 0
    ...

The columns for the previous examples go as follows:

- ``A`` The chromosome.
- ``B`` The variant name.
- ``C`` The genomic position of the variant.
- ``D`` The "A1" allele. This is not necessarily the reference or alternative
  allele.
- ``E`` The "A2" allele. This is not necessarily the reference or alternative
  allele.
- ``1`` All those columns represent the probability of being a A1-A1
  homozygous. It is noticeable that the columns 1, 2 and 3 alternate. This is
  because probabilities are given for all the variants in order. In other
  words, the probabilities are read as triplets.
- ``2`` Columns representing the A1-A2 heterozygous.
- ``3`` Columns representing the A2-A2 heterozygous.

Also note that identifying the minor and major alleles is done internally by
computing the minor allele frequency (MAF) and assigning the rarest allele to
be the minor allele.

.. warning::

    Real `Impute2` files do not have the `A, B, C, ...` row. It was only added
    to facilitate the description of the format.

Samples file
^^^^^^^^^^^^^

This file is generated by :py:mod:`genipe` and has the ``.sample`` extension.
It greatly resembles the PLINK ``fam`` file for those who are familiar with
this format. Specifically, it contains the samples that are included in the
``Impute2`` file **in the correct order**. It is needed to correctly interpret
the sample described by the probability triplets described earlier.

Specifically, the format is as follow: ::

    ID_1 ID_2 missing father mother sex plink_pheno
    0 0 0 D D D B
    SAMPLE1 SAMPLE1 0 0 0 1 -9
    SAMPLE2 SAMPLE2 0 0 0 2 -9
    SAMPLE3 SAMPLE3 0 0 0 2 -9
    SAMPLE4 SAMPLE4 0 0 0 1 -9
    ...

This time, the **first rows are part of the format**. The name of the first two
columns is important for the analysis to work, and omitting the second row will
result in missing samples in the analysis.

The attentive reader will notice that this `sample` file, when compared to the
previously described `Impute2` file yields the following dosage matrix (if we
assume that the A1 is the major allele, which is not true in practice):


.. table:: Dosage and hard-calls derived from the `Impute2` and `.sample` files
           described in this tutorial.

    +-------------------+------------------------+----------------------------+
    |                   | 3:20109120:C           | rs79511213:T               |
    +===================+========================+============================+
    | SAMPLE1           | 0 (A1-A1)              | 0 (A1-A1)                  |
    +-------------------+------------------------+----------------------------+
    | SAMPLE2           | 0 (A1-A1)              | 1 (A1-A2)                  |
    +-------------------+------------------------+----------------------------+
    | SAMPLE3           | 2 (A2-A2)              | 0 (A1-A1)                  |
    +-------------------+------------------------+----------------------------+
    | SAMPLE4           | 0.002 (A1-A1)          | 0.011 (A1-A1)              |
    +-------------------+------------------------+----------------------------+


.. warning::

    Once again, note that for simplicity this example assumes that the A1 is
    the major allele, which is often not true in practice.

Good sites file
^^^^^^^^^^^^^^^^

The `.good_sites` file is also generated by :py:mod:`genipe`. It contains a
single column representing the name of the variants that passed the quality
thresholds.

Only the variants in this file will be considered by the ``imputed-stats``
tool.


.. _skat-tut-2:

Creating a SNP sets file
-------------------------

SKAT is based on the analysis of a "SNP set". This is simply an arbitrary set
of variants that is created by the user. All of the variants in a same SNP set
will be analyzed together by SKAT.

It is also noteworthy that adding neutral variants to a SNP set is generally
avoided as much as possible, as it is known to decrease SKAT's power, even if
true deleterious variants are present in the SNP set. Describing the current
best practices to create SNP sets is out of the scope of this tutorial, but
possible approaches consist of testing variants in a same gene together, using
variant prioritization tools to select interesting candidates or to group
variants according to a biological pathway of interest.

The expected format for SNP sets is very straightforward. A tab separated file
with a single header row containing the following fields: ``variant``,
``snp_set`` is required. An optional third field identified by the ``weight``
header can also be provided. Concretely, the expected file looks like this: ::

    variant         snp_set
    3:20109120:C    SET1
    exm295293       SET1
    3:20225460:T    SET1
    rs143768051:T   SET1
    rs79511213:T    SET2
    rs183736430:T   SET2
    rs375511922:G   SET2
    rs369840848:G   SET2

Here, two gene sets are defined named `SET1` and `SET2`. They both contain 4
variants, which is a small number for a real life application, especially if
the variants are rare.

.. note::

    Don't forget that this file has to be tab (\t) delimited!

Another example using weights is presented here: ::

    variant         snp_set  weight
    3:20109120:C    SET1     1.4075226244343892
    exm295293       SET1     1.8227675243926944
    3:20225460:T    SET1     1.7138009049773757
    rs143768051:T   SET1     1.5197343711318112
    rs79511213:T    SET2     0.6643099547511312
    rs183736430:T   SET2     0.1363741358618047
    rs375511922:G   SET2     0.4431278280542987
    rs369840848:G   SET2     0.1171662895927601
    rs202028804:C   SET2     0.5328902714932127

The only difference is the added `weight` field that is used by SKAT. By
default, SKAT uses a weighting scheme that assigns larger prior weights to rare
variants. If you want to use custom weights based on something la
deleteriousness prediction, you need to generate the weights manually and
provide them in the SNP set file as shown here.


.. _skat-tut-3:

Format for the phenotypes file
-------------------------------

The last required file is the phenotype file. The latter contains information
on the phenotype of interest for all the samples included in the analysis. The
expected format is a tab delimited file containing a user-specified identifier
column (through the ``--sample-column`` argument) and a phenotype column (
represented by the ``--pheno-name`` argument).

The order of the samples is not important, as long as the identifier are
consistent with the ``.sample`` file.

The following file is a valid example: ::

    Sample  Hypertension SystolicBP Age
    SAMPLE1 1            163.21     63
    SAMPLE2 0            122.13     21
    SAMPLE3 0            119.01     42
    SAMPLE4 1            158.84     79

.. note::

    This file has to be tab (\t) delimited!

As you can see, both continuous an discrete traits can be included in this
file, as well as covariates. The covariates are included in the analysis using
the ``--covar`` argument and the variable type of the outcome is given using
the ``--outcome-type`` argument which can be set to `discrete` or `continuous`.


.. _skat-tut-4:

Running the script
-------------------

Once all of this is ready, you can finally run the script. A sample command
for the analysis described throughout this tutorial is the following:

.. code-block:: console
    :linenos:

    imputed-stats skat \
        --impute2 data/genipe_is_great.impute2 \
        --sample data/genipe_is_great.sample \
        --pheno data/phenotypes.txt \
        --sample-column Sample \
        --pheno-name SystolicBP \
        --covar Age \
        --extract-sites data/genipe_is_great.good_sites \
        --nb-process 4 \
        --out my_skat_analysis \
        --snp-sets carefully_crafted_snp_sets.txt \
        --outcome-type continuous \

The line by line explanation of this command is as follows:

1. We are calling the program with the ``skat`` option to specify the type of
   analysis. Alternative analysis types are presented in other tutorials, but
   they include linear and logistic regression as well as Cox survival
   analysis.
2. The `Impute2` file as previously described. This contains the probability
   for every allele of every variant. It is generated by :py:mod:`genipe`.
3. The `sample` contained the sample identifiers with an ordering consistent
   with the `Impute2` file. It was generated by :py:mod:`genipe`.
4. The `phenotype` is a tab delimited file containing phenotypic information
   for all the samples. Covariates are also included in this file.
5. The sample column is the identifier for the column containing sample IDs in
   the phenotype file.
6. The phenotype column is the identifier for the outcome for this analysis. In
   our case, we are using the quantitative trait `SystolicBP` representing the
   systolic blood pressure in this example.
7. The covariates are provided as a comma separated list of column names from
   the phenotype file. In our case, we're only using the `Age` column, but
   adding covariates is easy. As an example `--covar Age,Hypertension` would
   add the `Hypertension` column as a covariate.
8. A list of variants to keep for the analysis is provided here. A good choice
   would be to simply use the `.good_sites` file generated by py:mod:`genipe`,
   but additional QC steps could be used to further narrow this list.
9. The ``--nb-process`` argument determines how many simultaneous SKAT analyses
   will be started in parallel. This value should be a multiple of the number
   of the number of SNP sets or of the number of CPUs on your machine as
   parallelism only occurs at the gene set level.
10. The output prefix for the analysis results.
11. The SNP set file that was created by the user.
12. The outcome type. This should be set to either 'continuous' or 'discrete'
    depending on the type of dependent variable under study.

.. note::

    Also note that the SKAT-O algorithm can be used by using the ``--skat-o``
    flag.


.. _skat-tut-5:

Results
--------

After running this analysis, a file named `my_skat_analysis.skat.dosage` should
appear. This file will contain an association p-value for every one of the
specified SNP sets.

You can also verify the `my_skat_analysis.log` file to see if :py:mod:`genipe`
or `SKAT` generated any warnings.

Finally, a directory containing `.csv` files and an `R` script should also be
visible. This directory contains everything that is needed to run the analysis
in R. You can verify that the generated R script is consistent with your
expectations and that the analysis is correct. If the file sizes are
manageable, you can also archive them to insure reproducibility of the
analysis.

If you are unsatisfied about the generated R script, you can download the
:py:mod:`genipe` source and modify the
`run_skat.R <https://github.com/pgxcentre/genipe/blob/master/genipe/script_templates/run_skat.R>`_
file. This is a template rendered and launched by :py:mod:`genipe` to conduct
the analysis by using the regular R SKAT package. It is easy to extend this
file to customize your analysis or to write new ones.

If you have any questions or problems, don't hesitate to contact us through our
`Github page <https://github.com/pgxcentre>`_ or by creating a new issue in the
`Issue tracker <https://github.com/pgxcentre/genipe/issues>`_.

We are also very open to suggestions and improvements. If you have developed a
new interesting feature, please send us a push request from Github.


.. _skat-tut-6:

Usage
------

The following command will display the documentation for the SKAT analysis in
the console:

.. code-block:: console

   $ imputed-stats skat --help
   usage: imputed-stats skat [-h] [-v] [--debug] --impute2 FILE --sample FILE
                             --pheno FILE [--extract-sites FILE] [--out FILE]
                             [--nb-process INT] [--nb-lines INT] [--chrx]
                             [--gender-column NAME] [--scale INT] [--prob FLOAT]
                             [--maf FLOAT] [--covar NAME] [--categorical NAME]
                             [--missing-value NAME] [--sample-column NAME]
                             [--interaction NAME] --snp-sets FILE --outcome-type
                             {continuous,discrete} [--skat-o] --pheno-name NAME

   Uses the SKAT R package to analyze user defined gene sets. This script is part
   of the 'genipe' package, version 1.4.2.

   optional arguments:
     -h, --help            show this help message and exit
     -v, --version         show program's version number and exit
     --debug               set the logging level to debug

   Input Files:
     --impute2 FILE        The output from IMPUTE2.
     --sample FILE         The sample file (the order should be the same as in
                           the IMPUTE2 files).
     --pheno FILE          The file containing phenotypes and co variables.
     --extract-sites FILE  A list of sites to extract for analysis (optional).

   Output Options:
     --out FILE            The prefix for the output files. [imputed_stats]

   General Options:
     --nb-process INT      The number of process to use. [1]
     --nb-lines INT        The number of line to read at a time. [1000]
     --chrx                The analysis is performed for the non pseudo-autosomal
                           region of the chromosome X (male dosage will be
                           divided by 2 to get values [0, 0.5] instead of [0, 1])
                           (males are coded as 1 and option '--gender-column'
                           should be used).
     --gender-column NAME  The name of the gender column (use to exclude samples
                           with unknown gender (i.e. not 1, male, or 2, female).
                           If gender not available, use 'None'. [Gender]

   Dosage Options:
     --scale INT           Scale dosage so that values are in [0, n] (possible
                           values are 1 (no scaling) or 2). [2]
     --prob FLOAT          The minimal probability for which a genotype should be
                           considered. [>=0.9]
     --maf FLOAT           Minor allele frequency threshold for which marker will
                           be skipped. [<0.01]

   Phenotype Options:
     --covar NAME          The co variable names (in the phenotype file),
                           separated by coma.
     --categorical NAME    The name of the variables that are categorical (note
                           that the gender is always categorical). The variables
                           are separated by coma.
     --missing-value NAME  The missing value in the phenotype file.
     --sample-column NAME  The name of the sample ID column (in the phenotype
                           file). [sample_id]
     --interaction NAME    Add an interaction between the genotype and this
                           variable.

   SKAT Options:
     --snp-sets FILE       A file indicating a snp_set and an optional weight for
                           every variant.
     --outcome-type {continuous,discrete}
                           The variable type for the outcome. This will be passed
                           to SKAT.
     --skat-o              By default, the regular SKAT is used. Setting this
                           flag will use the SKAT-O algorithm instead.
     --pheno-name NAME     The phenotype.