.. contents:: Quick navigation :depth: 2 SKAT analysis ============== `SKAT `_ is a very popular test for the association of sets of rare and common variants. In :py:mod:`genipe`, we provide straightforward bindings over SKAT to make it easy to perform parallel SNP-set analysis of imputed variants. This tutorial will walk through a full example of an analysis while describing the expected input file formats and the resulting output files at every step. .. _skat-tut-1: Input files generated by genipe -------------------------------- Impute2 ^^^^^^^^ After running the pipeline, you should have an `Impute2` file containing genotype probabilities for all the imputed variants. The general structure of the `Impute2` file contains information on the variant: the chromosome, name, position, first and second alleles. The subsequent columns correspond to the probabilities for the AA, AB and BB genotypes for all the variants. Hence, a single variant is represented by three columns in `Impute2` files. The following example shows a single line of the `Impute2` file where column identifiers were added to help explain the file structure. :: A B C D E 1 2 3 1 2 3 1 2 3 1 2 3 3 3:20109120:C 20109120 A C 1 0 0 1 0 0 0 0 1 0.998 0.002 0 3 rs79511213:T 20202955 C T 1 0 0 0 1 0 1 0 0 0.989 0.011 0 ... The columns for the previous examples go as follows: - ``A`` The chromosome. - ``B`` The variant name. - ``C`` The genomic position of the variant. - ``D`` The "A1" allele. This is not necessarily the reference or alternative allele. - ``E`` The "A2" allele. This is not necessarily the reference or alternative allele. - ``1`` All those columns represent the probability of being a A1-A1 homozygous. It is noticeable that the columns 1, 2 and 3 alternate. This is because probabilities are given for all the variants in order. In other words, the probabilities are read as triplets. - ``2`` Columns representing the A1-A2 heterozygous. - ``3`` Columns representing the A2-A2 heterozygous. Also note that identifying the minor and major alleles is done internally by computing the minor allele frequency (MAF) and assigning the rarest allele to be the minor allele. .. warning:: Real `Impute2` files do not have the `A, B, C, ...` row. It was only added to facilitate the description of the format. Samples file ^^^^^^^^^^^^^ This file is generated by :py:mod:`genipe` and has the ``.sample`` extension. It greatly resembles the PLINK ``fam`` file for those who are familiar with this format. Specifically, it contains the samples that are included in the ``Impute2`` file **in the correct order**. It is needed to correctly interpret the sample described by the probability triplets described earlier. Specifically, the format is as follow: :: ID_1 ID_2 missing father mother sex plink_pheno 0 0 0 D D D B SAMPLE1 SAMPLE1 0 0 0 1 -9 SAMPLE2 SAMPLE2 0 0 0 2 -9 SAMPLE3 SAMPLE3 0 0 0 2 -9 SAMPLE4 SAMPLE4 0 0 0 1 -9 ... This time, the **first rows are part of the format**. The name of the first two columns is important for the analysis to work, and omitting the second row will result in missing samples in the analysis. The attentive reader will notice that this `sample` file, when compared to the previously described `Impute2` file yields the following dosage matrix (if we assume that the A1 is the major allele, which is not true in practice): .. table:: Dosage and hard-calls derived from the `Impute2` and `.sample` files described in this tutorial. +-------------------+------------------------+----------------------------+ | | 3:20109120:C | rs79511213:T | +===================+========================+============================+ | SAMPLE1 | 0 (A1-A1) | 0 (A1-A1) | +-------------------+------------------------+----------------------------+ | SAMPLE2 | 0 (A1-A1) | 1 (A1-A2) | +-------------------+------------------------+----------------------------+ | SAMPLE3 | 2 (A2-A2) | 0 (A1-A1) | +-------------------+------------------------+----------------------------+ | SAMPLE4 | 0.002 (A1-A1) | 0.011 (A1-A1) | +-------------------+------------------------+----------------------------+ .. warning:: Once again, note that for simplicity this example assumes that the A1 is the major allele, which is often not true in practice. Good sites file ^^^^^^^^^^^^^^^^ The `.good_sites` file is also generated by :py:mod:`genipe`. It contains a single column representing the name of the variants that passed the quality thresholds. Only the variants in this file will be considered by the ``imputed-stats`` tool. .. _skat-tut-2: Creating a SNP sets file ------------------------- SKAT is based on the analysis of a "SNP set". This is simply an arbitrary set of variants that is created by the user. All of the variants in a same SNP set will be analyzed together by SKAT. It is also noteworthy that adding neutral variants to a SNP set is generally avoided as much as possible, as it is known to decrease SKAT's power, even if true deleterious variants are present in the SNP set. Describing the current best practices to create SNP sets is out of the scope of this tutorial, but possible approaches consist of testing variants in a same gene together, using variant prioritization tools to select interesting candidates or to group variants according to a biological pathway of interest. The expected format for SNP sets is very straightforward. A tab separated file with a single header row containing the following fields: ``variant``, ``snp_set`` is required. An optional third field identified by the ``weight`` header can also be provided. Concretely, the expected file looks like this: :: variant snp_set 3:20109120:C SET1 exm295293 SET1 3:20225460:T SET1 rs143768051:T SET1 rs79511213:T SET2 rs183736430:T SET2 rs375511922:G SET2 rs369840848:G SET2 Here, two gene sets are defined named `SET1` and `SET2`. They both contain 4 variants, which is a small number for a real life application, especially if the variants are rare. .. note:: Don't forget that this file has to be tab (\t) delimited! Another example using weights is presented here: :: variant snp_set weight 3:20109120:C SET1 1.4075226244343892 exm295293 SET1 1.8227675243926944 3:20225460:T SET1 1.7138009049773757 rs143768051:T SET1 1.5197343711318112 rs79511213:T SET2 0.6643099547511312 rs183736430:T SET2 0.1363741358618047 rs375511922:G SET2 0.4431278280542987 rs369840848:G SET2 0.1171662895927601 rs202028804:C SET2 0.5328902714932127 The only difference is the added `weight` field that is used by SKAT. By default, SKAT uses a weighting scheme that assigns larger prior weights to rare variants. If you want to use custom weights based on something la deleteriousness prediction, you need to generate the weights manually and provide them in the SNP set file as shown here. .. _skat-tut-3: Format for the phenotypes file ------------------------------- The last required file is the phenotype file. The latter contains information on the phenotype of interest for all the samples included in the analysis. The expected format is a tab delimited file containing a user-specified identifier column (through the ``--sample-column`` argument) and a phenotype column ( represented by the ``--pheno-name`` argument). The order of the samples is not important, as long as the identifier are consistent with the ``.sample`` file. The following file is a valid example: :: Sample Hypertension SystolicBP Age SAMPLE1 1 163.21 63 SAMPLE2 0 122.13 21 SAMPLE3 0 119.01 42 SAMPLE4 1 158.84 79 .. note:: This file has to be tab (\t) delimited! As you can see, both continuous an discrete traits can be included in this file, as well as covariates. The covariates are included in the analysis using the ``--covar`` argument and the variable type of the outcome is given using the ``--outcome-type`` argument which can be set to `discrete` or `continuous`. .. _skat-tut-4: Running the script ------------------- Once all of this is ready, you can finally run the script. A sample command for the analysis described throughout this tutorial is the following: .. code-block:: console :linenos: imputed-stats skat \ --impute2 data/genipe_is_great.impute2 \ --sample data/genipe_is_great.sample \ --pheno data/phenotypes.txt \ --sample-column Sample \ --pheno-name SystolicBP \ --covar Age \ --extract-sites data/genipe_is_great.good_sites \ --nb-process 4 \ --out my_skat_analysis \ --snp-sets carefully_crafted_snp_sets.txt \ --outcome-type continuous \ The line by line explanation of this command is as follows: 1. We are calling the program with the ``skat`` option to specify the type of analysis. Alternative analysis types are presented in other tutorials, but they include linear and logistic regression as well as Cox survival analysis. 2. The `Impute2` file as previously described. This contains the probability for every allele of every variant. It is generated by :py:mod:`genipe`. 3. The `sample` contained the sample identifiers with an ordering consistent with the `Impute2` file. It was generated by :py:mod:`genipe`. 4. The `phenotype` is a tab delimited file containing phenotypic information for all the samples. Covariates are also included in this file. 5. The sample column is the identifier for the column containing sample IDs in the phenotype file. 6. The phenotype column is the identifier for the outcome for this analysis. In our case, we are using the quantitative trait `SystolicBP` representing the systolic blood pressure in this example. 7. The covariates are provided as a comma separated list of column names from the phenotype file. In our case, we're only using the `Age` column, but adding covariates is easy. As an example `--covar Age,Hypertension` would add the `Hypertension` column as a covariate. 8. A list of variants to keep for the analysis is provided here. A good choice would be to simply use the `.good_sites` file generated by py:mod:`genipe`, but additional QC steps could be used to further narrow this list. 9. The ``--nb-process`` argument determines how many simultaneous SKAT analyses will be started in parallel. This value should be a multiple of the number of the number of SNP sets or of the number of CPUs on your machine as parallelism only occurs at the gene set level. 10. The output prefix for the analysis results. 11. The SNP set file that was created by the user. 12. The outcome type. This should be set to either 'continuous' or 'discrete' depending on the type of dependent variable under study. .. note:: Also note that the SKAT-O algorithm can be used by using the ``--skat-o`` flag. .. _skat-tut-5: Results -------- After running this analysis, a file named `my_skat_analysis.skat.dosage` should appear. This file will contain an association p-value for every one of the specified SNP sets. You can also verify the `my_skat_analysis.log` file to see if :py:mod:`genipe` or `SKAT` generated any warnings. Finally, a directory containing `.csv` files and an `R` script should also be visible. This directory contains everything that is needed to run the analysis in R. You can verify that the generated R script is consistent with your expectations and that the analysis is correct. If the file sizes are manageable, you can also archive them to insure reproducibility of the analysis. If you are unsatisfied about the generated R script, you can download the :py:mod:`genipe` source and modify the `run_skat.R `_ file. This is a template rendered and launched by :py:mod:`genipe` to conduct the analysis by using the regular R SKAT package. It is easy to extend this file to customize your analysis or to write new ones. If you have any questions or problems, don't hesitate to contact us through our `Github page `_ or by creating a new issue in the `Issue tracker `_. We are also very open to suggestions and improvements. If you have developed a new interesting feature, please send us a push request from Github. .. _skat-tut-6: Usage ------ The following command will display the documentation for the SKAT analysis in the console: .. code-block:: console $ imputed-stats skat --help usage: imputed-stats skat [-h] [-v] [--debug] --impute2 FILE --sample FILE --pheno FILE [--extract-sites FILE] [--out FILE] [--nb-process INT] [--nb-lines INT] [--chrx] [--gender-column NAME] [--scale INT] [--prob FLOAT] [--maf FLOAT] [--covar NAME] [--categorical NAME] [--missing-value NAME] [--sample-column NAME] [--interaction NAME] --snp-sets FILE --outcome-type {continuous,discrete} [--skat-o] --pheno-name NAME Uses the SKAT R package to analyze user defined gene sets. This script is part of the 'genipe' package, version 1.4.2. optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit --debug set the logging level to debug Input Files: --impute2 FILE The output from IMPUTE2. --sample FILE The sample file (the order should be the same as in the IMPUTE2 files). --pheno FILE The file containing phenotypes and co variables. --extract-sites FILE A list of sites to extract for analysis (optional). Output Options: --out FILE The prefix for the output files. [imputed_stats] General Options: --nb-process INT The number of process to use. [1] --nb-lines INT The number of line to read at a time. [1000] --chrx The analysis is performed for the non pseudo-autosomal region of the chromosome X (male dosage will be divided by 2 to get values [0, 0.5] instead of [0, 1]) (males are coded as 1 and option '--gender-column' should be used). --gender-column NAME The name of the gender column (use to exclude samples with unknown gender (i.e. not 1, male, or 2, female). If gender not available, use 'None'. [Gender] Dosage Options: --scale INT Scale dosage so that values are in [0, n] (possible values are 1 (no scaling) or 2). [2] --prob FLOAT The minimal probability for which a genotype should be considered. [>=0.9] --maf FLOAT Minor allele frequency threshold for which marker will be skipped. [<0.01] Phenotype Options: --covar NAME The co variable names (in the phenotype file), separated by coma. --categorical NAME The name of the variables that are categorical (note that the gender is always categorical). The variables are separated by coma. --missing-value NAME The missing value in the phenotype file. --sample-column NAME The name of the sample ID column (in the phenotype file). [sample_id] --interaction NAME Add an interaction between the genotype and this variable. SKAT Options: --snp-sets FILE A file indicating a snp_set and an optional weight for every variant. --outcome-type {continuous,discrete} The variable type for the outcome. This will be passed to SKAT. --skat-o By default, the regular SKAT is used. Setting this flag will use the SKAT-O algorithm instead. --pheno-name NAME The phenotype.