.. contents:: Quick navigation
   :depth: 2


Site extraction
================

Genome-wide imputation dataset might be huge. Often, it is required to extract
a subset of imputed sites (*e.g.* specific markers, genomic location, or
markers with a specific minor allele frequency, information value or completion
rate). Also, different format might be required, depending of the underlying
analysis (*e.g.* hard calls or dosage values). We provide an easy tool to
perform site extraction of multiple *impute2* files using either marker
identification number, or genomic location and/or minor allele frequency and/or
call rate and/or information value.

We suppose that you have followed the main :ref:`genipe-tut-page`. The
following command will create the working directory for this tutorial.

.. code-block:: bash

   mkdir -p $HOME/genipe_tutorial/extraction


.. _extract-tut-input-files:

Input files
------------

After running the :py:mod:`genipe` pipeline, all the required files for the
extraction tools are automatically created in the ``final_impute2`` directories
(see the :ref:`genipe-tut-output-files-final_impute2` section in the main
:ref:`genipe-tut-page`).

The files that are required in these directories depends of what kind of
extraction is required (by name, or by genomic location and/or by minor allele
frequency and/or by calling rate and/or by information value).

Once the required *impute2* files are provided to the tool, the other required
files will be automatically fetched (if required).


.. _extract-tut-execute:

Executing the extraction
-------------------------

The first time the tool is used on a set of *impute2* files, indexation will
automatically occur (to speed of the analysis for future extraction). There are
two ways to extract markers: using their identification number (``--extract``),
or using their properties (``--genomic``, ``--maf``, ``--rate`` and/or
``--info``).

.. note::

   It is possible to extract from multiple *impute2* files at the same time (by
   specifying multiple input files).


Extraction by ID
^^^^^^^^^^^^^^^^^

To extract markers using their identification number, you need a file
containing the list of marker to extract (one marker per line).

.. code-block:: bash

   cd $HOME/genipe_tutorial/extraction

   echo "rs76139713:51137523:C:T" > marker_list.txt
   echo "rs372879164:17037188:A:G" >> marker_list.txt

This ``marker_list.txt`` file will contain the following:

.. code-block:: text

   rs76139713:51137523:C:T
   rs372879164:17037188:A:G

Then, the following command (using the ``--extract`` option) will extract those
two markers from the *impute2* file.

.. code-block:: bash

   impute2-extractor \
       --impute2 ../genipe/chr22/final_impute2/chr22.imputed.impute2.gz \
       --extract marker_list.txt

.. note::

   To gather a list of marker identification numbers, refer to the file
   ``chr22.imputed.map``, which contains the list of all sites in the *impute2*
   file.


Extraction by characteristics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are four ways to extract markers according to their characteristics. The
first way is to specify the genomic location of the markers to extract (*i.e.*
the ``--genomic`` option). The second way is to specify a minor allele
frequency threshold (*i.e.* the ``--maf`` option). The third way is to specify
a call rate threshold (*i.e.* the ``--rate`` option). The fourth and final way
is to specify an information value threshold (*i.e.* the ``--info`` option).
Those four ways can be used at the same time (*e.g.* to get markers in a
specific genomic range and a specific call rate).

For example, to extract markers with a MAF :math:`\geq` 0.05 located in the
*CYP2D6* gene, perform the following command:

.. code-block:: bash

   cd $HOME/genipe_tutorial/extraction

   impute2-extractor \
       --impute2 ../genipe/chr22/final_impute2/chr22.imputed.impute2.gz \
       --genomic chr22:42522501-42526883 \
       --maf 0.05 \
       --out cyp2d6_common

To gather all markers with a MAF :math:`\geq` 0.05 and a call rate :math:`\geq`
0.99, perform the following command:

.. code-block:: bash

   impute2-extractor \
       --impute2 ../genipe/chr22/final_impute2/chr22.imputed.impute2.gz \
       --maf 0.05 \
       --rate 0.99 \
       --out common_complete


.. _extract-tut-output-files:

Output files
-------------

The output files will depend on the output format selected (the ``--format``
option). You can specify either ``impute2``, ``dosage``, ``calls`` and/or
``bed``, for the *impute2* format (*i.e.* three probabilities per sample), the
*dosage* format (*i.e.* one value between 0 and 2 per sample), hard calls and
binary *Plink* file.

``.impute2`` file
^^^^^^^^^^^^^^^^^^

This file is generated when the ``impute2`` format is used. It has the same
format as the original *impute2* file.

The general structure of the file contains the following columns (which are
space delimited): the chromosome, the name of the marker, its position and its
two alleles. The subsequent columns correspond to the probabilities of each
genotype (hence, there are three columns per sample). The first value
correspond to the probability of being homozygous of the first allele. The
second value correspond to the probability of being heterozygous. Finally, the
third value correspond to the probability of being homozygous of the second
allele.

The following example shows two lines of the ``.impute2`` file.

.. code-block:: text

   22 rs7289830 16058758 C A 0 0 1 0 0 1 0 1 0 ...
   22 rs6423472 16087621 A G 0 1 0 1 0 0 0 1 0 ...

.. note::

   When extracting using the ``impute2`` format, all the existing companion
   files (``.maf``, ``.map``, etc.) will also be extracted and included in the
   same directory (using the same output prefix).


``.dosage`` file
^^^^^^^^^^^^^^^^^

This file contains the dosage computed from the *impute2* probabilities. The
general structure of the file contains the following columns (which are
tabulation separated): the chromosome, the position on the chromosome, its
name, its minor and major allele and the dosage value. The dosage values vary
between 0 and 2 (inclusively), where values close to 0 represent a higher
chance of been homozygous of the major allele, values close to 1 represent a
higher chance of been heterozygous, and values close to 2 represent a higher
chance of been homozygous of the minor allele.

The following example shows two lines of the ``.dosage`` file.

.. code-block:: text

   22	16058758	rs7289830	C	A	0.0	0.0	1.0	...
   22	16087621	rs6423472	A	G	1.0	2.0	1.0	...

.. note::

   Dosage values computed from probabilities that are below the quality
   threshold (specified by the ``--prob`` option) will have a missing value of
   ``nan``.


``.calls`` file
^^^^^^^^^^^^^^^^

This file contains the hard calls computed from the *impute2* probabilities. It
has the same format as a transposed pedfile (from *Plink*). The general
structure of the file contains the following columns (which are tabulation
separated): the chromosome, the marker name, the genetic position, the genomic
location, and the hard calls.

The following example shows two lines of the ``.calls`` file.

.. code-block:: text

   22	rs7289830	0	16058758	A A	A A	C A	...
   22	rs6423472	0	16087621	A G	A A	A G	...

.. note::

   Hard calls computed from probabilities that are below the quality threshold
   (specified by the ``--prob`` option) will have a missing value of ``0 0``.


Binary *Plink* files
^^^^^^^^^^^^^^^^^^^^^

A set of three files are created (*i.e.* ``.bed``, ``.bim`` and ``.fam`` files.
These represents binary *Plink* files containing hard calls.


.. _extract-tut-usage:

Usage
------

The following command will display the documentation for the extraction
analysis in the console:

.. code-block:: console

   $ impute2-extractor --help
   usage: impute2-extractor [-h] [-v] [--debug] --impute2 FILE [--index]
                            [--out PREFIX] [--format FORMAT [FORMAT ...]]
                            [--long] [--prob FLOAT] [--extract FILE]
                            [--genomic CHR:START-END] [--maf FLOAT]
                            [--rate FLOAT] [--info FLOAT]

   Extract imputed markers located in a specific genomic region. This script is
   part of the 'genipe' package, version 1.4.2.

   optional arguments:
     -h, --help            show this help message and exit
     -v, --version         show program's version number and exit
     --debug               set the logging level to debug

   Input Files:
     --impute2 FILE        The output from IMPUTE2.

   Indexation Options:
     --index               Only perform the indexation.

   Output Options:
     --out PREFIX          The prefix of the output files. [impute2_extractor]
     --format FORMAT [FORMAT ...]
                           The output format. Can specify either 'impute2' for
                           probabilities (same as impute2 format, i.e. 3 values
                           per sample), 'dosage' for dosage values (one value
                           between 0 and 2 by sample), 'calls' for hard calls, or
                           'bed' for Plink binary format (with hard calls).
                           ['impute2']
     --long                Write the output file in the long format (one line per
                           sample per marker). This option is only compatible
                           with the 'calls' and 'dosage' format (option '--
                           format').
     --prob FLOAT          The probability threshold used when creating a file in
                           the dosage or call format. [0.9]

   Extraction Options:
     --extract FILE        File containing marker names to extract.
     --genomic CHR:START-END
                           The range to extract (e.g. 22 1000000 1500000). Can be
                           use in combination with '--rate', '--maf' and '--
                           info'.
     --maf FLOAT           Extract markers with a minor allele frequency equal or
                           higher than the specified threshold. Can be use in
                           combination with '--rate', '--info' and '--genomic'.
     --rate FLOAT          Extract markers with a completion rate equal or higher
                           to the specified threshold. Can be use in combination
                           with '--maf', '--info' and '--genomic'.
     --info FLOAT          Extract markers with an information equal or higher to
                           the specified threshold. Can be use in combination
                           with '--maf', '--rate' and '--genomic'.


.. note::

   When using the ``--index`` option, only the indexation (of files without an
   index) will be performed.