.. contents:: Quick navigation :depth: 2 .. _genipe-tut-page: Genome-wide imputation pipeline ================================ High-throughput genotyping platforms can assess up to five million variations in thousands of samples. Depending on the genotyped data and the reference panel used, genome-wide imputation tools can infer genotypes for more than 38 million variants (single nucleotides, insertions and deletions). Unfortunately, genome-wide imputation requires high computation power and multiple data processing steps. The :py:mod:`genipe` pipeline automates the different steps for pre-phasing and imputation for genome-wide data. The pipeline follows the guideline described by IMPUTE2's best practices when analyzing genome-wide data (described `by IMPUTE2 `_ and `by SHAPEIT `_). Preparing the data ------------------- The :py:mod:`genipe` module provides a script to prepare all the required files for this tutorial. It will download the appropriate binary files for Plink, IMPUTE2 and SHAPETIT, the genotypes of the study cohort to impute, the IMPUTE2's reference panels, and the indexed human reference (in *fasta* format). .. warning:: You should review `IMPUTE2 licence `_ and `SHAPEIT licence `_ to make sure that you have the right to use these tools. To following command will execute the script (once the virtual Python environment is activated, see :ref:`genipe-pyvenv-activation` for more details). .. code-block:: bash genipe-tutorial .. note:: The script will create a directory called ``genipe_tutorial`` in the home directory. If another directory is required, the default path can be modify using the ``--tutorial-path`` option. To have more information about the input data, have a look at the :ref:`genipe-tut-more-details` section below. If the imputation is to be run on a server using a job scheduler, have a look at the :ref:`genipe-tut-drmaa` section. Input file summary ------------------- You should have the following directory structure: .. code-block:: text $HOME/genipe_tutorial/ │ ├── 1000GP_Phase3/ │ ├── 1000GP_Phase3_chr1.hap.gz │ ├── 1000GP_Phase3_chr2.hap.gz │ ├── ... │ ├── 1000GP_Phase3_chr1.legend.gz │ ├── 1000GP_Phase3_chr2.legend.gz │ ├── ... │ ├── 1000GP_Phase3.sample │ ├── genetic_map_chr1_combined_b37.txt │ ├── genetic_map_chr2_combined_b37.txt │ └── ... │ ├── bin/ │ ├── impute2 │ ├── plink │ └── shapeit │ ├── data/ │ ├── hapmap_CEU_r23a_hg19.bed │ ├── hapmap_CEU_r23a_hg19.bim │ └── hapmap_CEU_r23a_hg19.fam │ ├── genipe_config.ini # OPTIONAL (--use-drmaa, --drmaa-config) │ ├── hg19/ │ ├── hg19.fasta │ └── hg19.fasta.fai │ └── preamble.txt # OPTIONAL (--use-drmaa, --preamble) .. _genipe-tut-execute: Executing the pipeline ----------------------- Once all the input files are ready for analysis, you can finally execute the pipeline. Make sure that the virtual Python environment was properly activated (see :ref:`genipe-pyvenv-activation` for more details). When automatically preparing the data using the ``genipe-tutorial`` script, a new script named ``execute.sh`` will be created to execute the analysis. Here is the content of the ``execute.sh`` script. .. code-block:: bash #!/usr/bin/env bash # Changing directory cd $HOME/genipe_tutorial # Launching the imputation with genipe genipe-launcher \ --chrom autosomes \ --bfile $HOME/genipe_tutorial/data/hapmap_CEU_r23a_hg19 \ --shapeit-bin $HOME/genipe_tutorial/bin/shapeit \ --impute2-bin $HOME/genipe_tutorial/bin/impute2 \ --plink-bin $HOME/genipe_tutorial/bin/plink \ --reference $HOME/genipe_tutorial/hg19/hg19.fasta \ --hap-template $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3_chr{chrom}.hap.gz \ --legend-template $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3_chr{chrom}.legend.gz \ --map-template $HOME/genipe_tutorial/1000GP_Phase3/genetic_map_chr{chrom}_combined_b37.txt \ --sample-file $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3.sample \ --filtering-rules 'ALL<0.01' 'ALL>0.99' \ --thread 4 \ --report-title "Tutorial" \ --report-number "Test Report" When in the correct working directory, the following command should execute the genome-wide imputation of the *HapMap* CEU dataset. .. code-block:: bash cd $HOME/genipe_tutorial ./execute.sh .. note:: In the script ``execute.sh``, the ``--reference`` option is optional and might be removed. .. note:: It is possible to compress IMPUTE2's output files by adding the ``--bgzip`` option (the ``bgzip`` software should be in the path). .. note:: It is possible to remove the ``--shapeit-bin``, the ``--impute2-bin`` and/or the ``--plink-bin`` options if the SHAPEIT, IMPUTE2 and/or the Plink binaries are in the ``PATH`` variable. .. note:: The ``--chrom autosomes`` option will run the imputation only on the autosomes. If all chromosomes are required (including chromosomes 23 and 25 (pseudo-autosomal region), then the ``--chrom`` option can be removed, and the imputation reference files for chromosome 23 PAR and nonPAR are required (see :ref:`this section ` for more information). .. note:: It is possible to add extra options to SHAPEIT and IMPUTE2 using the ``--shapeit-extra`` and ``--impute2-extra`` options, respectively. The following table describes the options **used by** :py:mod:`genipe` **in the previous command** (see the :ref:`genipe-usage` section for a full list): .. table:: +-----------------------+-------------------------------------------------+ | Option | Description | +=======================+=================================================+ | ``--chrom`` | The chromosome to impute (here, the autosomes). | +-----------------------+-------------------------------------------------+ | ``--bfile`` | The genotypes of the study cohort to be imputed.| +-----------------------+-------------------------------------------------+ | ``--shapeit-bin`` | The ``shapeit`` binary. | +-----------------------+-------------------------------------------------+ | ``--impute2-bin`` | The ``impute2`` binary. | +-----------------------+-------------------------------------------------+ | ``--plink-bin`` | The ``plink`` binary. | +-----------------------+-------------------------------------------------+ | ``--reference`` | The *fasta* file containing the reference genome| | | for initial strand verification (*optional*). | +-----------------------+-------------------------------------------------+ | ``--hap-template`` | The template for *IMPUTE2*'s reference haplotype| | | files (``{chrom}`` will be replaced by the | | | chromosome number). | +-----------------------+-------------------------------------------------+ | ``--legend-template`` | The template for *IMPUTE2*'s reference legend | | | files (``{chrom}`` will be replaced by the | | | chromosome number). | +-----------------------+-------------------------------------------------+ | ``--map-template`` | The template for *IMPUTE2*'s reference map | | | files (``{chrom}`` will be replaced by the | | | chromosome number). | +-----------------------+-------------------------------------------------+ | ``--sample-file`` | The name of *IMPUTE2*'s reference sample file. | +-----------------------+-------------------------------------------------+ | ``--filtering-rules`` | Rules used by *IMPUTE2* to exclude sites from | | | its reference files (using the legend files). | | | Each terms are joined using a logical *OR*. | | | Here, loci with an alternative allele frequency | | | lower than 0.01 and higher than 0.99 will be | | | excluded from the imputation reference. | +-----------------------+-------------------------------------------------+ | ``--thread`` | The number of thread to use for the analysis. | | | When using *DRMAA*, this will be the number of | | | simultaneous tasks. Four threads will be used. | +-----------------------+-------------------------------------------------+ | ``--report-title`` | The title of the automatic report. | +-----------------------+-------------------------------------------------+ | ``--report-number`` | The number of the report (will appear as | | | sub-title and in the footer of the automatic | | | report). | +-----------------------+-------------------------------------------------+ .. note:: If the pipeline fails (*e.g.* not enough memory or the walltime exceeded on a computing cluster), re-running the pipeline (with different number of thread or different walltime) will only launch the task that were not completed. The pipeline checks if output files are missing. If an output file is deleted, the step producing this file will be run again (but not the subsequent steps). .. note:: Four options will modify the report content: ``--report-number``, ``--report-title``, ``--report-author`` and ``--report-background``. .. _genipe-tut-compile-report: Compiling the report --------------------- A report containing useful information (such as quality metrics and execution time, among others) is automatically generated once the imputation process is completed. To compile the report, perform the following commands: .. code-block:: bash cd $HOME/genipe_tutorial/genipe/report make && make clean This will generate the following `PDF report `_ (which is named ``report.pdf``). It is always possible to modify the original ``report.tex`` file to include analysis specific details (*e.g.* cohort description). .. _genipe-tut-output-files: Output files ------------- All results will be located in the ``genipe`` directory (or whatever ``--output-dir`` links to). Here is the directory tree summarizing the output files. .. code-block:: text genipe/ │ ├── chr1/ │ ├── chr1.1_5000000.impute2 │ ├── chr1.1_5000000.impute2_info │ ├── chr1.1_5000000.impute2_info_by_sample │ ├── chr1.1_5000000.impute2_summary │ ├── chr1.1_5000000.impute2_warnings │ ├── ... │ ├── chr1.final.bed │ ├── chr1.final.bim │ ├── chr1.final.fam │ ├── chr1.final.log │ ├── chr1.final.phased.haps │ ├── chr1.final.phased.ind.me │ ├── chr1.final.phased.ind.mm │ ├── chr1.final.phased.log │ ├── chr1.final.phased.sample │ ├── chr1.final.phased.snp.me │ ├── chr1.final.phased.snp.mm │ ├── ... │ │ │ └── final_impute2/ │ ├── chr1.imputed.alleles │ ├── chr1.imputed.completion_rates │ ├── chr1.imputed.good_sites │ ├── chr1.imputed.impute2.gz │ ├── chr1.imputed.impute2_info │ ├── chr1.imputed.imputed_sites │ ├── chr1.imputed.log │ ├── chr1.imputed.maf │ ├── chr1.imputed.map │ └── chr1.imputed.sample │ ├── .../ │ ├── chromosome_lengths.txt ├── exclusion_summary.txt ├── frequency_pie.pdf ├── genipe.log ├── markers_to_exclude.txt ├── markers_to_flip.txt │ ├── missing │ ├── missing.imiss │ ├── missing.lmiss │ └── missing.log │ ├── report │ ├── frequency_pie.pdf │ ├── Makefile │ ├── references.bib │ ├── references.bst │ └── report.tex │ └── tasks.db ``genipe`` directory ^^^^^^^^^^^^^^^^^^^^^ This directory contains all the chromosome specific analysis. The specific directory content is describe below. The following files are created inside the ``genipe`` directory: .. table:: +----------------------------+--------------------------------------------+ | File | Description | +============================+============================================+ | ``chromosome_lengths.txt`` | The length of each chromosome (this | | | information is fetched from Ensembl using | | | its REST API and saved to file). | +----------------------------+--------------------------------------------+ | ``exclusion_summary.txt`` | A file containing the exclusion summary | | | prior to phasing (*e.g* the number of | | | ambiguous markers, the number of flipped | | | markers, etc.). | +----------------------------+--------------------------------------------+ | ``frequency_pie.pdf`` | This file contains a pie chart describing | | | the minor allele frequency distribution of | | | the imputed markers. This file is generated| | | only if the :py:mod:`matplotlib` module is | | | installed. | +----------------------------+--------------------------------------------+ | ``genipe.log`` | The log file of the main pipeline. | +----------------------------+--------------------------------------------+ | ``markers_to_exclude.txt`` | The list of markers to exclude prior to | | | phasing. | +----------------------------+--------------------------------------------+ | ``markers_to_flip.txt`` | The list of markers to flip prior to | | | phasing. | +----------------------------+--------------------------------------------+ | ``tasks.db`` | The *sqlite* database containing | | | information of all tasks (if it's | | | completed, execution time, etc). | +----------------------------+--------------------------------------------+ ``genipe/chrN`` directories ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``chrN`` directories contain the intermediate files, created throughout the pipeline. The most important files in these directories are the log files (for errors and summary statistics). There will be one directory per autosomal chromosomes. .. _genipe-tut-output-files-final_impute2: ``genipe/chrN/final_impute2`` directories ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ These ``final_impute2`` directories (located in the ``genipe/chrN`` directories) contain the final output files from the pipeline for each autosomal chromosomes. They will contain the following files: .. table:: +-------------------------------+-----------------------------------------+ | Extension | Description | +===============================+=========================================+ | ``.imputed.alleles`` | Description of the reference and | | | alternative allele at each sites. | +-------------------------------+-----------------------------------------+ | ``.imputed.completion_rates`` | Number of missing values and completion | | | rate for all sites (using the | | | probability threshold set by the user, | | | where the default is higher and equal | | | to 0.9). | +-------------------------------+-----------------------------------------+ | ``.imputed.good_sites`` | List of sites which pass the completion | | | rate threshold (set by the user, where | | | the default is higher and equal to 0.98)| | | using the probability threshold (set by | | | the user, where the default is higher | | | and equal to 0.9). The sites also pass | | | the INFO threshold (set by the user, | | | where the default is higher and equal to| | | 0). | +-------------------------------+-----------------------------------------+ | ``.imputed.impute2`` or | Imputation results (merged from the | | ``.imputed.impute2.gz`` | individual segment files. This file | | | might be compress (with the ``.gz`` | | | extension) if the ``--bgzip`` option was| | | used when launching the pipeline. | +-------------------------------+-----------------------------------------+ | ``.imputed.impute2_info`` | Marker-wise information file with one | | | line per marker and a single header line| | | at the begening. It contains, among | | | others, the information value which is a| | | measure of the observed statistical | | | information associated with the allele | | | frequency estimate. | +-------------------------------+-----------------------------------------+ | ``.imputed.imputed_sites`` | List of imputed sites (excluding sites | | | that were previously genotyped in the | | | study cohort). | +-------------------------------+-----------------------------------------+ | ``.imputed.log`` | The log file of the merging step. | +-------------------------------+-----------------------------------------+ | ``.imputed.maf`` | File containing the minor allele | | | frequency (along with minor allele | | | identification) for all sites using the | | | probabilitty threshold of 0.9. When no | | | genotypes are available (because they | | | are all below the threshold), the MAF is| | | ``NA``. | +-------------------------------+-----------------------------------------+ | ``.imputed.map`` | A *map* file describing the genomic | | | location of all sites. | +-------------------------------+-----------------------------------------+ | ``.imputed.sample`` | The sample file generated by the phasing| | | step, which describe the sample ordering| | | in the IMPUTE2 files. | +-------------------------------+-----------------------------------------+ ``genipe/missing`` directory ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``missing`` directory contains the missing rates for both samples (``missing.imiss``) and genotypes markers (``missing.lmiss``). Those files are generated by Plink. ``genipe/report`` directory ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This ``report`` directory contains the automatically generated report, which provides valuable information about the imputation analysis. Such information contains cross-validation statistics (as provided by IMPUTE2), frequency statistics and completion rates according to user defined parameters. The automatic report is generated in the ``LaTeX`` language (file ``report.tex``), and can be compile using the following command (as long as ``LaTeX`` is installed). .. code-block:: bash cd $HOME/genipe_tutorial/genipe/report make && make clean This will generate the following `PDF report `_ (which is named ``report.pdf``). It is always possible to modify the original ``report.tex`` file to include analysis specific details (*e.g.* cohort description). .. _genipe-tut-drmaa: Using genipe on a server ------------------------- DRMAA configuration ^^^^^^^^^^^^^^^^^^^^ If the pipeline is to be launch on a computing server, the ``--use-drmaa`` option should be used. This will launch each step on the server using the DRMAA api. On some cluster, supplemental information is required for each task (*i.e.* execution time, number of nodes/processes to reserve). This parametrization is done using a configuration (*ini*) file, describing these parameters for each step. When providing an empty *ini* file, the default walltime and number of nodes/processes will be 15 minutes and 1/1, respectively. Otherwise, different parameters can be used for each step. For example, the following configuration will increase the walltime for all phasing tasks from 15 minutes to 3 hours. It will also run each phasing tasks on one node using 12 processes. .. code-block:: ini [shapeit_phase] walltime = 03:00:00 nodes = 1 ppn = 12 The following example has the same configuration as the previous example, but will increase the walltime for chromosome 2 to 4 hours, with 1 node and 24 processes. .. code-block:: ini [shapeit_phase] walltime = 03:00:00 nodes = 1 ppn = 12 chr2_walltime = 04:00:00 chr2_nodes = 1 chr2_ppn = 24 Since imputation is performed on segments for each chromosome, it is possible to modify the parameters for a single segment. This is usefull when a segment doesn't have time to finish and its imputation requires a rerun. For example, the following parameters will increase the walltime from 15 minutes to 3.5 hours for segment 10,000,001-15,000,000 on chromosome 1. Also, all segments located on chromosome 2 will have a walltime of 4 hours. .. code-block:: ini [impute2] chr1_10000001_15000000_walltime = 03:30:00 chr2_walltime = 04:00:00 We provide a `configuration example `_ including all possible section. Also, here is a list of all possible section (*i.e* pipeline step) that can be parametrized. - ``plink_exclude`` - ``plink_missing_rate`` - ``shapeit_check_1`` - ``plink_flip`` - ``shapeit_check_2`` - ``plink_final_exclude`` - ``shapeit_phase`` - ``impute2`` - ``merge_impute2`` - ``bgzip`` Some cluster doesn't require any configuration at all. To skip configuration, use the ``main`` section of the *ini* file as such: .. code-block:: ini [main] skip_drmaa_config = yes .. note:: Keep in mind that lines starting with a ``#`` are comments and are not used in the DRMAA configuration. This is useful to describe what parameters are used for each step. DRMAA preamble ^^^^^^^^^^^^^^^ When using the ``--use-drmaa`` option, the pipeline creates *bash* script that are launched on the computing cluster. Some clusters require module to be loaded and the python virtual environment to be loaded before executing a script. This is done using the preamble file (the ``--preamble`` option). The content of the file will be added between the first line of the temporary *bash* script (the *shebang*) and the actual command. For example, the following file will load the gcc module (version 4.8.2) and the python virtual environment before launching the task. .. code-block:: bash # Loading the required module module load gcc/4.8.2 # The python virtual environment source $HOME/softwares/python_env/bin/activate .. note:: The preamble file is system dependent, but you should always at least activate the virtual python environment so that the tools provided by :py:mod:`genipe` are automatically in the system path. .. warning:: The preamble will be added **as-is** in the *bash* script that will be executed. Hence, always be careful of what is included in the preamble. .. _genipe-tut-more-details: Data in more details --------------------- First, we will create a project directory where all the analysis will be performed: .. code-block:: bash mkdir -p $HOME/genipe_tutorial cd $HOME/genipe_tutorial .. _genipe-tut-softwares: Required softwares ^^^^^^^^^^^^^^^^^^^ The main :py:mod:`genipe` pipeline requires three external tools: `Plink `_, `IMPUTE2 `_ and `SHAPEIT `_. These tools are not required to be located in your ``PATH`` variable, since you can specify each of their location at runtime. If some of the binaries are missing in you ``PATH`` variable, we will create a directory where the missing binaries will go: .. code-block:: bash mkdir -p $HOME/genipe_tutorial/bin cd $HOME/genipe_tutorial/bin Plink is freely available. Go to `Plink download page `_, read the license, and download the most recent version. Extract the binary and copy it to the ``bin`` directory previously created. IMPUTE2 is freely available for academic use. Go to `IMPUTE2 download page `_, read the license, and download the most recent version. Extract the binary and copy it to the ``bin`` directory previously created. SHAPEIT is freely available for academic use. Go to `SHAPEIT download page `_, read the license, and download the most recent version. Extract the binary and copy it to the ``bin`` directory previously created. The ``bin`` directory should now contain the missing binaries: ``impute2``, ``plink`` and/or ``shapeit``. .. _genipe-tut-input-files: Input files ^^^^^^^^^^^^ Genotypes """""""""" The genotypes of the study cohort have to be in Plink's binary pedfile format. This format allows for time and disk space optimization. Each dataset is comprised of three files: the ``.bed`` (binary file, genotype information), ``.fam`` (text file, sample information) and ``.bim`` (text file, marker information) files. Those file are generated by Plink. Genotype data are provided for this tutorial. They are a subset from the files provided by Plink on its `resources page `_. The build was lifted over from *GRCh36* to *GRCh37* using `USCS liftOver `_. Only CEU samples were kept, along with markers having a completion rate higher or equal to 0.98. To download the data for this tutorial, execute the following command: .. code-block:: bash mkdir -p $HOME/genipe_tutorial/data cd $HOME/genipe_tutorial/data wget http://pgxcentre.github.io/genipe/_static/tutorial/hapmap_CEU_r23a_hg19.tar.bz2 tar -jxf hapmap_CEU_r23a_hg19.tar.bz2 rm hapmap_CEU_r23a_hg19.tar.bz2 Reference panels """"""""""""""""" IMPUTE2 can use publicly available reference datasets. They provide such dataset on their website. Go to IMPUTE2's `reference page `_, and download the most recent reference data (which is over 12Gb). Once the reference is downloaded, extract it in the working directory (``$HOME/genipe_tutorial``). The following commands should download the reference files (1000 Genomes phase 3) and extract them in the required directory. Note that the specified URL might change. .. code-block:: bash cd $HOME/genipe_tutorial wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz tar -zxf 1000GP_Phase3.tgz rm 1000GP_Phase3.tgz Human reference (optional) """"""""""""""""""""""""""" The pipeline include an optional step to check for strand alignment with the reference panel (using *SHAPEIT*). The drawback of this method is that it is impossible to verify the strand of markers which are absent from the *IMPUTE2*'s reference. We have introduce a way to check the strand using the reference genome (in *fasta* format, indexed using *faidx*). We have created such reference using the `UCSC's human reference `_. .. code-block:: bash mkdir -p $HOME/genipe_tutorial/hg19 cd $HOME/genipe_tutorial/hg19 wget http://statgen.org/wp-content/uploads/Softwares/genipe/supp_files/hg19.tar.bz2 tar -jxf hg19.tar.bz2 rm hg19.tar.bz2