Genome-wide imputation pipeline¶
High-throughput genotyping platforms can assess up to five million variations in thousands of samples. Depending on the genotyped data and the reference panel used, genome-wide imputation tools can infer genotypes for more than 38 million variants (single nucleotides, insertions and deletions). Unfortunately, genome-wide imputation requires high computation power and multiple data processing steps.
The genipe
pipeline automates the different steps for pre-phasing and
imputation for genome-wide data. The pipeline follows the guideline described
by IMPUTE2’s best practices when analyzing genome-wide data (described
by IMPUTE2
and by SHAPEIT).
Preparing the data¶
The genipe
module provides a script to prepare all the required files
for this tutorial. It will download the appropriate binary files for Plink,
IMPUTE2 and SHAPETIT, the genotypes of the study cohort to impute, the
IMPUTE2’s reference panels, and the indexed human reference (in fasta
format).
Warning
You should review IMPUTE2 licence and SHAPEIT licence to make sure that you have the right to use these tools.
To following command will execute the script (once the virtual Python environment is activated, see Virtual environment activation for more details).
genipe-tutorial
Note
The script will create a directory called genipe_tutorial
in the home
directory. If another directory is required, the default path can be modify
using the --tutorial-path
option.
To have more information about the input data, have a look at the Data in more details section below. If the imputation is to be run on a server using a job scheduler, have a look at the Using genipe on a server section.
Input file summary¶
You should have the following directory structure:
$HOME/genipe_tutorial/
│
├── 1000GP_Phase3/
│ ├── 1000GP_Phase3_chr1.hap.gz
│ ├── 1000GP_Phase3_chr2.hap.gz
│ ├── ...
│ ├── 1000GP_Phase3_chr1.legend.gz
│ ├── 1000GP_Phase3_chr2.legend.gz
│ ├── ...
│ ├── 1000GP_Phase3.sample
│ ├── genetic_map_chr1_combined_b37.txt
│ ├── genetic_map_chr2_combined_b37.txt
│ └── ...
│
├── bin/
│ ├── impute2
│ ├── plink
│ └── shapeit
│
├── data/
│ ├── hapmap_CEU_r23a_hg19.bed
│ ├── hapmap_CEU_r23a_hg19.bim
│ └── hapmap_CEU_r23a_hg19.fam
│
├── genipe_config.ini # OPTIONAL (--use-drmaa, --drmaa-config)
│
├── hg19/
│ ├── hg19.fasta
│ └── hg19.fasta.fai
│
└── preamble.txt # OPTIONAL (--use-drmaa, --preamble)
Executing the pipeline¶
Once all the input files are ready for analysis, you can finally execute the pipeline. Make sure that the virtual Python environment was properly activated (see Virtual environment activation for more details).
When automatically preparing the data using the genipe-tutorial
script, a
new script named execute.sh
will be created to execute the analysis. Here
is the content of the execute.sh
script.
#!/usr/bin/env bash
# Changing directory
cd $HOME/genipe_tutorial
# Launching the imputation with genipe
genipe-launcher \
--chrom autosomes \
--bfile $HOME/genipe_tutorial/data/hapmap_CEU_r23a_hg19 \
--shapeit-bin $HOME/genipe_tutorial/bin/shapeit \
--impute2-bin $HOME/genipe_tutorial/bin/impute2 \
--plink-bin $HOME/genipe_tutorial/bin/plink \
--reference $HOME/genipe_tutorial/hg19/hg19.fasta \
--hap-template $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3_chr{chrom}.hap.gz \
--legend-template $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3_chr{chrom}.legend.gz \
--map-template $HOME/genipe_tutorial/1000GP_Phase3/genetic_map_chr{chrom}_combined_b37.txt \
--sample-file $HOME/genipe_tutorial/1000GP_Phase3/1000GP_Phase3.sample \
--filtering-rules 'ALL<0.01' 'ALL>0.99' \
--thread 4 \
--report-title "Tutorial" \
--report-number "Test Report"
When in the correct working directory, the following command should execute the genome-wide imputation of the HapMap CEU dataset.
cd $HOME/genipe_tutorial
./execute.sh
Note
In the script execute.sh
, the --reference
option is optional and
might be removed.
Note
It is possible to compress IMPUTE2’s output files by adding the --bgzip
option (the bgzip
software should be in the path).
Note
It is possible to remove the --shapeit-bin
, the --impute2-bin
and/or
the --plink-bin
options if the SHAPEIT, IMPUTE2 and/or the Plink
binaries are in the PATH
variable.
Note
The --chrom autosomes
option will run the imputation only on the
autosomes. If all chromosomes are required (including chromosomes 23 and 25
(pseudo-autosomal region), then the --chrom
option can be removed, and
the imputation reference files for chromosome 23 PAR and nonPAR are required
(see this section for more information).
Note
It is possible to add extra options to SHAPEIT and IMPUTE2 using the
--shapeit-extra
and --impute2-extra
options, respectively.
The following table describes the options used by genipe
in the
previous command (see the Usage section for a full list):
Option |
Description |
---|---|
|
The chromosome to impute (here, the autosomes). |
|
The genotypes of the study cohort to be imputed. |
|
The |
|
The |
|
The |
|
The fasta file containing the reference genome for initial strand verification (optional). |
|
The template for IMPUTE2’s reference haplotype
files ( |
|
The template for IMPUTE2’s reference legend
files ( |
|
The template for IMPUTE2’s reference map
files ( |
|
The name of IMPUTE2’s reference sample file. |
|
Rules used by IMPUTE2 to exclude sites from its reference files (using the legend files). Each terms are joined using a logical OR. Here, loci with an alternative allele frequency lower than 0.01 and higher than 0.99 will be excluded from the imputation reference. |
|
The number of thread to use for the analysis. When using DRMAA, this will be the number of simultaneous tasks. Four threads will be used. |
|
The title of the automatic report. |
|
The number of the report (will appear as sub-title and in the footer of the automatic report). |
Note
If the pipeline fails (e.g. not enough memory or the walltime exceeded on a computing cluster), re-running the pipeline (with different number of thread or different walltime) will only launch the task that were not completed.
The pipeline checks if output files are missing. If an output file is deleted, the step producing this file will be run again (but not the subsequent steps).
Note
Four options will modify the report content: --report-number
,
--report-title
, --report-author
and --report-background
.
Compiling the report¶
A report containing useful information (such as quality metrics and execution time, among others) is automatically generated once the imputation process is completed. To compile the report, perform the following commands:
cd $HOME/genipe_tutorial/genipe/report
make && make clean
This will generate the following
PDF report
(which is named report.pdf
). It is always possible to modify the original
report.tex
file to include analysis specific details (e.g. cohort
description).
Output files¶
All results will be located in the genipe
directory (or whatever
--output-dir
links to). Here is the directory tree summarizing the output
files.
genipe/
│
├── chr1/
│ ├── chr1.1_5000000.impute2
│ ├── chr1.1_5000000.impute2_info
│ ├── chr1.1_5000000.impute2_info_by_sample
│ ├── chr1.1_5000000.impute2_summary
│ ├── chr1.1_5000000.impute2_warnings
│ ├── ...
│ ├── chr1.final.bed
│ ├── chr1.final.bim
│ ├── chr1.final.fam
│ ├── chr1.final.log
│ ├── chr1.final.phased.haps
│ ├── chr1.final.phased.ind.me
│ ├── chr1.final.phased.ind.mm
│ ├── chr1.final.phased.log
│ ├── chr1.final.phased.sample
│ ├── chr1.final.phased.snp.me
│ ├── chr1.final.phased.snp.mm
│ ├── ...
│ │
│ └── final_impute2/
│ ├── chr1.imputed.alleles
│ ├── chr1.imputed.completion_rates
│ ├── chr1.imputed.good_sites
│ ├── chr1.imputed.impute2.gz
│ ├── chr1.imputed.impute2_info
│ ├── chr1.imputed.imputed_sites
│ ├── chr1.imputed.log
│ ├── chr1.imputed.maf
│ ├── chr1.imputed.map
│ └── chr1.imputed.sample
│
├── .../
│
├── chromosome_lengths.txt
├── exclusion_summary.txt
├── frequency_pie.pdf
├── genipe.log
├── markers_to_exclude.txt
├── markers_to_flip.txt
│
├── missing
│ ├── missing.imiss
│ ├── missing.lmiss
│ └── missing.log
│
├── report
│ ├── frequency_pie.pdf
│ ├── Makefile
│ ├── references.bib
│ ├── references.bst
│ └── report.tex
│
└── tasks.db
genipe
directory¶
This directory contains all the chromosome specific analysis. The specific
directory content is describe below. The following files are created inside the
genipe
directory:
File |
Description |
---|---|
|
The length of each chromosome (this information is fetched from Ensembl using its REST API and saved to file). |
|
A file containing the exclusion summary prior to phasing (e.g the number of ambiguous markers, the number of flipped markers, etc.). |
|
This file contains a pie chart describing
the minor allele frequency distribution of
the imputed markers. This file is generated
only if the |
|
|
|
The list of markers to exclude prior to phasing. |
|
The list of markers to flip prior to phasing. |
|
The sqlite database containing information of all tasks (if it’s completed, execution time, etc). |
genipe/chrN
directories¶
The chrN
directories contain the intermediate files, created throughout the
pipeline. The most important files in these directories are the log files (for
errors and summary statistics). There will be one directory per autosomal
chromosomes.
genipe/chrN/final_impute2
directories¶
These final_impute2
directories (located in the genipe/chrN
directories) contain the final output files from the pipeline for each
autosomal chromosomes. They will contain the following files:
Extension |
Description |
---|---|
|
Description of the reference and alternative allele at each sites. |
|
Number of missing values and completion rate for all sites (using the probability threshold set by the user, where the default is higher and equal to 0.9). |
|
List of sites which pass the completion rate threshold (set by the user, where the default is higher and equal to 0.98) using the probability threshold (set by the user, where the default is higher and equal to 0.9). The sites also pass the INFO threshold (set by the user, where the default is higher and equal to 0). |
|
Imputation results (merged from the
individual segment files. This file
might be compress (with the |
|
Marker-wise information file with one line per marker and a single header line at the begening. It contains, among others, the information value which is a measure of the observed statistical information associated with the allele frequency estimate. |
|
List of imputed sites (excluding sites that were previously genotyped in the study cohort). |
|
The log file of the merging step. |
|
File containing the minor allele
frequency (along with minor allele
identification) for all sites using the
probabilitty threshold of 0.9. When no
genotypes are available (because they
are all below the threshold), the MAF is
|
|
A map file describing the genomic location of all sites. |
|
The sample file generated by the phasing step, which describe the sample ordering in the IMPUTE2 files. |
genipe/missing
directory¶
The missing
directory contains the missing rates for both samples
(missing.imiss
) and genotypes markers (missing.lmiss
). Those files are
generated by Plink.
genipe/report
directory¶
This report
directory contains the automatically generated report, which
provides valuable information about the imputation analysis. Such information
contains cross-validation statistics (as provided by IMPUTE2), frequency
statistics and completion rates according to user defined parameters.
The automatic report is generated in the LaTeX
language (file
report.tex
), and can be compile using the following command (as long as
LaTeX
is installed).
cd $HOME/genipe_tutorial/genipe/report
make && make clean
This will generate the following
PDF report
(which is named report.pdf
). It is always possible to modify the original
report.tex
file to include analysis specific details (e.g. cohort
description).
Using genipe on a server¶
DRMAA configuration¶
If the pipeline is to be launch on a computing server, the --use-drmaa
option should be used. This will launch each step on the server using the DRMAA
api. On some cluster, supplemental information is required for each task
(i.e. execution time, number of nodes/processes to reserve). This
parametrization is done using a configuration (ini) file, describing these
parameters for each step.
When providing an empty ini file, the default walltime and number of nodes/processes will be 15 minutes and 1/1, respectively. Otherwise, different parameters can be used for each step. For example, the following configuration will increase the walltime for all phasing tasks from 15 minutes to 3 hours. It will also run each phasing tasks on one node using 12 processes.
[shapeit_phase]
walltime = 03:00:00
nodes = 1
ppn = 12
The following example has the same configuration as the previous example, but will increase the walltime for chromosome 2 to 4 hours, with 1 node and 24 processes.
[shapeit_phase]
walltime = 03:00:00
nodes = 1
ppn = 12
chr2_walltime = 04:00:00
chr2_nodes = 1
chr2_ppn = 24
Since imputation is performed on segments for each chromosome, it is possible to modify the parameters for a single segment. This is usefull when a segment doesn’t have time to finish and its imputation requires a rerun. For example, the following parameters will increase the walltime from 15 minutes to 3.5 hours for segment 10,000,001-15,000,000 on chromosome 1. Also, all segments located on chromosome 2 will have a walltime of 4 hours.
[impute2]
chr1_10000001_15000000_walltime = 03:30:00
chr2_walltime = 04:00:00
We provide a configuration example including all possible section. Also, here is a list of all possible section (i.e pipeline step) that can be parametrized.
plink_exclude
plink_missing_rate
shapeit_check_1
plink_flip
shapeit_check_2
plink_final_exclude
shapeit_phase
impute2
merge_impute2
bgzip
Some cluster doesn’t require any configuration at all. To skip configuration,
use the main
section of the ini file as such:
[main]
skip_drmaa_config = yes
Note
Keep in mind that lines starting with a #
are comments and are not used
in the DRMAA configuration. This is useful to describe what parameters are
used for each step.
DRMAA preamble¶
When using the --use-drmaa
option, the pipeline creates bash script that
are launched on the computing cluster. Some clusters require module to be
loaded and the python virtual environment to be loaded before executing a
script. This is done using the preamble file (the --preamble
option).
The content of the file will be added between the first line of the temporary bash script (the shebang) and the actual command. For example, the following file will load the gcc module (version 4.8.2) and the python virtual environment before launching the task.
# Loading the required module
module load gcc/4.8.2
# The python virtual environment
source $HOME/softwares/python_env/bin/activate
Note
The preamble file is system dependent, but you should always at least
activate the virtual python environment so that the tools provided by
genipe
are automatically in the system path.
Warning
The preamble will be added as-is in the bash script that will be executed. Hence, always be careful of what is included in the preamble.
Data in more details¶
First, we will create a project directory where all the analysis will be performed:
mkdir -p $HOME/genipe_tutorial
cd $HOME/genipe_tutorial
Required softwares¶
The main genipe
pipeline requires three external tools:
Plink,
IMPUTE2 and
SHAPEIT.
These tools are not required to be located in your PATH
variable, since you
can specify each of their location at runtime.
If some of the binaries are missing in you PATH
variable, we will create a
directory where the missing binaries will go:
mkdir -p $HOME/genipe_tutorial/bin
cd $HOME/genipe_tutorial/bin
Plink is freely available. Go to
Plink download page,
read the license, and download the most recent version. Extract the binary and
copy it to the bin
directory previously created.
IMPUTE2 is freely available for academic use. Go to
IMPUTE2 download page,
read the license, and download the most recent version. Extract the binary and
copy it to the bin
directory previously created.
SHAPEIT is freely available for academic use. Go to
SHAPEIT download page,
read the license, and download the most recent version. Extract the binary and
copy it to the bin
directory previously created.
The bin
directory should now contain the missing binaries: impute2
,
plink
and/or shapeit
.
Input files¶
Genotypes¶
The genotypes of the study cohort have to be in Plink’s binary pedfile format.
This format allows for time and disk space optimization. Each dataset is
comprised of three files: the .bed
(binary file, genotype information),
.fam
(text file, sample information) and .bim
(text file, marker
information) files. Those file are generated by Plink.
Genotype data are provided for this tutorial. They are a subset from the files provided by Plink on its resources page. The build was lifted over from GRCh36 to GRCh37 using USCS liftOver. Only CEU samples were kept, along with markers having a completion rate higher or equal to 0.98. To download the data for this tutorial, execute the following command:
mkdir -p $HOME/genipe_tutorial/data
cd $HOME/genipe_tutorial/data
wget http://pgxcentre.github.io/genipe/_static/tutorial/hapmap_CEU_r23a_hg19.tar.bz2
tar -jxf hapmap_CEU_r23a_hg19.tar.bz2
rm hapmap_CEU_r23a_hg19.tar.bz2
Reference panels¶
IMPUTE2 can use publicly available reference datasets. They provide such
dataset on their website. Go to IMPUTE2’s
reference page,
and download the most recent reference data (which is over 12Gb). Once the
reference is downloaded, extract it in the working directory
($HOME/genipe_tutorial
).
The following commands should download the reference files (1000 Genomes phase 3) and extract them in the required directory. Note that the specified URL might change.
cd $HOME/genipe_tutorial
wget https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.tgz
tar -zxf 1000GP_Phase3.tgz
rm 1000GP_Phase3.tgz
Human reference (optional)¶
The pipeline include an optional step to check for strand alignment with the reference panel (using SHAPEIT). The drawback of this method is that it is impossible to verify the strand of markers which are absent from the IMPUTE2’s reference. We have introduce a way to check the strand using the reference genome (in fasta format, indexed using faidx).
We have created such reference using the UCSC’s human reference.
mkdir -p $HOME/genipe_tutorial/hg19
cd $HOME/genipe_tutorial/hg19
wget http://statgen.org/wp-content/uploads/Softwares/genipe/supp_files/hg19.tar.bz2
tar -jxf hg19.tar.bz2
rm hg19.tar.bz2