Genome-wide imputation pipeline

High-throughput genotyping platforms can assess up to five million variations in thousands of samples. Depending on the genotyped data and the reference panel used, genome-wide imputation tools can infer genotypes for more than 38 million variants (single nucleotides, insertions and deletions). Unfortunately, genome-wide imputation requires high computation power and multiple data processing steps.

The genipe pipeline automates the different steps for pre-phasing and imputation for genome-wide data. The pipeline follows the guideline described by IMPUTE2’s best practices when analyzing genome-wide data (described by IMPUTE2 and by SHAPEIT).

Preparing the data

The genipe module provides a script to prepare all the required files for this tutorial. It will download the appropriate binary files for Plink, IMPUTE2 and SHAPETIT, the genotypes of the study cohort to impute, the IMPUTE2’s reference panels, and the indexed human reference (in fasta format).

Warning

You should review IMPUTE2 licence and SHAPEIT licence to make sure that you have the right to use these tools.

To following command will execute the script (once the virtual Python environment is activated, see Virtual environment activation for more details).

genipe-tutorial

Note

The script will create a directory called genipe_tutorial in the home directory. If another directory is required, the default path can be modify using the --tutorial-path option.

To have more information about the input data, have a look at the Data in more details section below. If the imputation is to be run on a server using a job scheduler, have a look at the Using genipe on a server section.

Input file summary

You should have the following directory structure:

$HOME/genipe_tutorial/
│
├── 1000GP_Phase3/
│   ├── 1000GP_Phase3_chr1.hap.gz
│   ├── 1000GP_Phase3_chr2.hap.gz
│   ├── ...
│   ├── 1000GP_Phase3_chr1.legend.gz
│   ├── 1000GP_Phase3_chr2.legend.gz
│   ├── ...
│   ├── 1000GP_Phase3.sample
│   ├── genetic_map_chr1_combined_b37.txt
│   ├── genetic_map_chr2_combined_b37.txt
│   └── ...
│
├── bin/
│   ├── impute2
│   ├── plink
│   └── shapeit
│
├── data/
│   ├── hapmap_CEU_r23a_hg19.bed
│   ├── hapmap_CEU_r23a_hg19.bim
│   └── hapmap_CEU_r23a_hg19.fam
│
├── genipe_config.ini  # OPTIONAL (--use-drmaa, --drmaa-config)
│
├── hg19/
│   ├── hg19.fasta
│   └── hg19.fasta.fai
│
└── preamble.txt     # OPTIONAL (--use-drmaa, --preamble)

Executing the pipeline

Once all the input files are ready for analysis, you can finally execute the pipeline. Make sure that the virtual Python environment was properly activated (see Virtual environment activation for more details).

When automatically preparing the data using the genipe-tutorial script, a new script named execute.sh will be created to execute the analysis. Here is the content of the execute.sh script.

#!/usr/bin/env bash
# Changing directory
cd $HOME/genipe_tutorial

# Launching the imputation with genipe
genipe-launcher \
    --chrom autosomes \
    --bfile $HOME/genipe_tutorial/data/hapmap_CEU_r23a_hg19 \
    --shapeit-bin $HOME/genipe_tutorial/bin/shapeit \
    --impute2-bin $HOME/genipe_tutorial/bin/impute2 \
    --plink-bin $HOME/genipe_tutorial/bin/plink \
    --reference $HOME/genipe_tutorial/hg19/hg19.fasta \
    --hap-template $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3_chr{chrom}.hap.gz \
    --legend-template $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3_chr{chrom}.legend.gz \
    --map-template $HOME/genipe_tutorial/1000GP_Phase3/genetic_map_chr{chrom}_combined_b37.txt \
    --sample-file $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3.sample \
    --filtering-rules 'ALL<0.01' 'ALL>0.99' \
    --thread 4 \
    --report-title "Tutorial" \
    --report-number "Test Report"

When in the correct working directory, the following command should execute the genome-wide imputation of the HapMap CEU dataset.

cd $HOME/genipe_tutorial

./execute.sh

Note

In the script execute.sh, the --reference option is optional and might be removed.

Note

It is possible to compress IMPUTE2’s output files by adding the --bgzip option (the bgzip software should be in the path).

Note

It is possible to remove the --shapeit-bin, the --impute2-bin and/or the --plink-bin options if the SHAPEIT, IMPUTE2 and/or the Plink binaries are in the PATH variable.

Note

The --chrom autosomes option will run the imputation only on the autosomes. If all chromosomes are required (including chromosomes 23 and 25 (pseudo-autosomal region), then the --chrom option can be removed, and the imputation reference files for chromosome 23 PAR and nonPAR are required (see this section for more information).

Note

It is possible to add extra options to SHAPEIT and IMPUTE2 using the --shapeit-extra and --impute2-extra options, respectively.

The following table describes the options used by genipe in the previous command (see the Usage section for a full list):

Option Description
--chrom The chromosome to impute (here, the autosomes).
--bfile The genotypes of the study cohort to be imputed.
--shapeit-bin The shapeit binary.
--impute2-bin The impute2 binary.
--plink-bin The plink binary.
--reference The fasta file containing the reference genome for initial strand verification (optional).
--hap-template The template for IMPUTE2‘s reference haplotype files ({chrom} will be replaced by the chromosome number).
--legend-template The template for IMPUTE2‘s reference legend files ({chrom} will be replaced by the chromosome number).
--map-template The template for IMPUTE2‘s reference map files ({chrom} will be replaced by the chromosome number).
--sample-file The name of IMPUTE2‘s reference sample file.
--filtering-rules Rules used by IMPUTE2 to exclude sites from its reference files (using the legend files). Each terms are joined using a logical OR. Here, loci with an alternative allele frequency lower than 0.01 and higher than 0.99 will be excluded from the imputation reference.
--thread The number of thread to use for the analysis. When using DRMAA, this will be the number of simultaneous tasks. Four threads will be used.
--report-title The title of the automatic report.
--report-number The number of the report (will appear as sub-title and in the footer of the automatic report).

Note

If the pipeline fails (e.g. not enough memory or the walltime exceeded on a computing cluster), re-running the pipeline (with different number of thread or different walltime) will only launch the task that were not completed.

The pipeline checks if output files are missing. If an output file is deleted, the step producing this file will be run again (but not the subsequent steps).

Note

Four options will modify the report content: --report-number, --report-title, --report-author and --report-background.

Compiling the report

A report containing useful information (such as quality metrics and execution time, among others) is automatically generated once the imputation process is completed. To compile the report, perform the following commands:

cd $HOME/genipe_tutorial/genipe/report

make && make clean

This will generate the following PDF report (which is named report.pdf). It is always possible to modify the original report.tex file to include analysis specific details (e.g. cohort description).

Output files

All results will be located in the genipe directory (or whatever --output-dir links to). Here is the directory tree summarizing the output files.

genipe/
│
├── chr1/
│   ├── chr1.1_5000000.impute2
│   ├── chr1.1_5000000.impute2_info
│   ├── chr1.1_5000000.impute2_info_by_sample
│   ├── chr1.1_5000000.impute2_summary
│   ├── chr1.1_5000000.impute2_warnings
│   ├── ...
│   ├── chr1.final.bed
│   ├── chr1.final.bim
│   ├── chr1.final.fam
│   ├── chr1.final.log
│   ├── chr1.final.phased.haps
│   ├── chr1.final.phased.ind.me
│   ├── chr1.final.phased.ind.mm
│   ├── chr1.final.phased.log
│   ├── chr1.final.phased.sample
│   ├── chr1.final.phased.snp.me
│   ├── chr1.final.phased.snp.mm
│   ├── ...
│   │
│   └── final_impute2/
│       ├── chr1.imputed.alleles
│       ├── chr1.imputed.completion_rates
│       ├── chr1.imputed.good_sites
│       ├── chr1.imputed.impute2.gz
│       ├── chr1.imputed.impute2_info
│       ├── chr1.imputed.imputed_sites
│       ├── chr1.imputed.log
│       ├── chr1.imputed.maf
│       ├── chr1.imputed.map
│       └── chr1.imputed.sample
│
├── .../
│
├── chromosome_lengths.txt
├── exclusion_summary.txt
├── frequency_pie.pdf
├── genipe.log
├── markers_to_exclude.txt
├── markers_to_flip.txt
│
├── missing
│   ├── missing.imiss
│   ├── missing.lmiss
│   └── missing.log
│
├── report
│   ├── frequency_pie.pdf
│   ├── Makefile
│   ├── references.bib
│   ├── references.bst
│   └── report.tex
│
└── tasks.db

genipe directory

This directory contains all the chromosome specific analysis. The specific directory content is describe below. The following files are created inside the genipe directory:

File Description
chromosome_lengths.txt The length of each chromosome (this information is fetched from Ensembl using its REST API and saved to file).
exclusion_summary.txt A file containing the exclusion summary prior to phasing (e.g the number of ambiguous markers, the number of flipped markers, etc.).
frequency_pie.pdf This file contains a pie chart describing the minor allele frequency distribution of the imputed markers. This file is generated only if the matplotlib module is installed.
genipe.log | The log file of the main pipeline.
markers_to_exclude.txt The list of markers to exclude prior to phasing.
markers_to_flip.txt The list of markers to flip prior to phasing.
tasks.db The sqlite database containing information of all tasks (if it’s completed, execution time, etc).

genipe/chrN directories

The chrN directories contain the intermediate files, created throughout the pipeline. The most important files in these directories are the log files (for errors and summary statistics). There will be one directory per autosomal chromosomes.

genipe/chrN/final_impute2 directories

These final_impute2 directories (located in the genipe/chrN directories) contain the final output files from the pipeline for each autosomal chromosomes. They will contain the following files:

Extension Description
.imputed.alleles Description of the reference and alternative allele at each sites.
.imputed.completion_rates Number of missing values and completion rate for all sites (using the probability threshold set by the user, where the default is higher and equal to 0.9).
.imputed.good_sites List of sites which pass the completion rate threshold (set by the user, where the default is higher and equal to 0.98) using the probability threshold (set by the user, where the default is higher and equal to 0.9). The sites also pass the INFO threshold (set by the user, where the default is higher and equal to 0).
.imputed.impute2 or .imputed.impute2.gz Imputation results (merged from the individual segment files. This file might be compress (with the .gz extension) if the --bgzip option was used when launching the pipeline.
.imputed.impute2_info Marker-wise information file with one line per marker and a single header line at the begening. It contains, among others, the information value which is a measure of the observed statistical information associated with the allele frequency estimate.
.imputed.imputed_sites List of imputed sites (excluding sites that were previously genotyped in the study cohort).
.imputed.log The log file of the merging step.
.imputed.maf File containing the minor allele frequency (along with minor allele identification) for all sites using the probabilitty threshold of 0.9. When no genotypes are available (because they are all below the threshold), the MAF is NA.
.imputed.map A map file describing the genomic location of all sites.
.imputed.sample The sample file generated by the phasing step, which describe the sample ordering in the IMPUTE2 files.

genipe/missing directory

The missing directory contains the missing rates for both samples (missing.imiss) and genotypes markers (missing.lmiss). Those files are generated by Plink.

genipe/report directory

This report directory contains the automatically generated report, which provides valuable information about the imputation analysis. Such information contains cross-validation statistics (as provided by IMPUTE2), frequency statistics and completion rates according to user defined parameters.

The automatic report is generated in the LaTeX language (file report.tex), and can be compile using the following command (as long as LaTeX is installed).

cd $HOME/genipe_tutorial/genipe/report
make && make clean

This will generate the following PDF report (which is named report.pdf). It is always possible to modify the original report.tex file to include analysis specific details (e.g. cohort description).

Using genipe on a server

DRMAA configuration

If the pipeline is to be launch on a computing server, the --use-drmaa option should be used. This will launch each step on the server using the DRMAA api. On some cluster, supplemental information is required for each task (i.e. execution time, number of nodes/processes to reserve). This parametrization is done using a configuration (ini) file, describing these parameters for each step.

When providing an empty ini file, the default walltime and number of nodes/processes will be 15 minutes and 1/1, respectively. Otherwise, different parameters can be used for each step. For example, the following configuration will increase the walltime for all phasing tasks from 15 minutes to 3 hours. It will also run each phasing tasks on one node using 12 processes.

[shapeit_phase]
walltime = 03:00:00
nodes    = 1
ppn      = 12

The following example has the same configuration as the previous example, but will increase the walltime for chromosome 2 to 4 hours, with 1 node and 24 processes.

[shapeit_phase]
walltime = 03:00:00
nodes    = 1
ppn      = 12

chr2_walltime = 04:00:00
chr2_nodes    = 1
chr2_ppn      = 24

Since imputation is performed on segments for each chromosome, it is possible to modify the parameters for a single segment. This is usefull when a segment doesn’t have time to finish and its imputation requires a rerun. For example, the following parameters will increase the walltime from 15 minutes to 3.5 hours for segment 10,000,001-15,000,000 on chromosome 1. Also, all segments located on chromosome 2 will have a walltime of 4 hours.

[impute2]
chr1_10000001_15000000_walltime = 03:30:00

chr2_walltime = 04:00:00

We provide a configuration example including all possible section. Also, here is a list of all possible section (i.e pipeline step) that can be parametrized.

  • plink_exclude
  • plink_missing_rate
  • shapeit_check_1
  • plink_flip
  • shapeit_check_2
  • plink_final_exclude
  • shapeit_phase
  • impute2
  • merge_impute2
  • bgzip

Some cluster doesn’t require any configuration at all. To skip configuration, use the main section of the ini file as such:

[main]
skip_drmaa_config = yes

Note

Keep in mind that lines starting with a # are comments and are not used in the DRMAA configuration. This is useful to describe what parameters are used for each step.

DRMAA preamble

When using the --use-drmaa option, the pipeline creates bash script that are launched on the computing cluster. Some clusters require module to be loaded and the python virtual environment to be loaded before executing a script. This is done using the preamble file (the --preamble option).

The content of the file will be added between the first line of the temporary bash script (the shebang) and the actual command. For example, the following file will load the gcc module (version 4.8.2) and the python virtual environment before launching the task.

# Loading the required module
module load gcc/4.8.2

# The python virtual environment
source $HOME/softwares/python_env/bin/activate

Note

The preamble file is system dependent, but you should always at least activate the virtual python environment so that the tools provided by genipe are automatically in the system path.

Warning

The preamble will be added as-is in the bash script that will be executed. Hence, always be careful of what is included in the preamble.

Data in more details

First, we will create a project directory where all the analysis will be performed:

mkdir -p $HOME/genipe_tutorial
cd $HOME/genipe_tutorial

Required softwares

The main genipe pipeline requires three external tools: Plink, IMPUTE2 and SHAPEIT. These tools are not required to be located in your PATH variable, since you can specify each of their location at runtime.

If some of the binaries are missing in you PATH variable, we will create a directory where the missing binaries will go:

mkdir -p $HOME/genipe_tutorial/bin
cd $HOME/genipe_tutorial/bin

Plink is freely available. Go to Plink download page, read the license, and download the most recent version. Extract the binary and copy it to the bin directory previously created.

IMPUTE2 is freely available for academic use. Go to IMPUTE2 download page, read the license, and download the most recent version. Extract the binary and copy it to the bin directory previously created.

SHAPEIT is freely available for academic use. Go to SHAPEIT download page, read the license, and download the most recent version. Extract the binary and copy it to the bin directory previously created.

The bin directory should now contain the missing binaries: impute2, plink and/or shapeit.

Input files

Genotypes

The genotypes of the study cohort have to be in Plink’s binary pedfile format. This format allows for time and disk space optimization. Each dataset is comprised of three files: the .bed (binary file, genotype information), .fam (text file, sample information) and .bim (text file, marker information) files. Those file are generated by Plink.

Genotype data are provided for this tutorial. They are a subset from the files provided by Plink on its resources page. The build was lifted over from GRCh36 to GRCh37 using USCS liftOver. Only CEU samples were kept, along with markers having a completion rate higher or equal to 0.98. To download the data for this tutorial, execute the following command:

mkdir -p $HOME/genipe_tutorial/data
cd $HOME/genipe_tutorial/data

wget http://pgxcentre.github.io/genipe/_static/tutorial/hapmap_CEU_r23a_hg19.tar.bz2
tar -jxf hapmap_CEU_r23a_hg19.tar.bz2
rm hapmap_CEU_r23a_hg19.tar.bz2

Reference panels

IMPUTE2 can use publicly available reference datasets. They provide such dataset on their website. Go to IMPUTE2’s reference page, and download the most recent reference data (which is over 12Gb). Once the reference is downloaded, extract it in the working directory ($HOME/genipe_tutorial).

The following commands should download the reference files (1000 Genomes phase 3) and extract them in the required directory. Note that the specified URL might change.

cd $HOME/genipe_tutorial

wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz
tar -zxf 1000GP_Phase3.tgz
rm 1000GP_Phase3.tgz

Human reference (optional)

The pipeline include an optional step to check for strand alignment with the reference panel (using SHAPEIT). The drawback of this method is that it is impossible to verify the strand of markers which are absent from the IMPUTE2‘s reference. We have introduce a way to check the strand using the reference genome (in fasta format, indexed using faidx).

We have created such reference using the UCSC’s human reference.

mkdir -p $HOME/genipe_tutorial/hg19
cd $HOME/genipe_tutorial/hg19

wget http://statgen.org/wp-content/uploads/Softwares/genipe/supp_files/hg19.tar.bz2
tar -jxf hg19.tar.bz2
rm hg19.tar.bz2