Analysis¶
Multiple statistic models exist. To execute a genome-wide study, you can use
the geneparse
tool.
$ genetest --help
usage: genetest [-h] [-v] [--test] [--nb-cpus NB] --configuration YAML
[--output FILE] [--extract FILE] [--keep FILE] [--maf MAF]
[--sexual-chromosome]
Performs statistical analysis on genotypic data (version 0.4.0).
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--test Execute the test suite and exit.
--nb-cpus NB The number of processes to use for the analysis. [1]
Input Options:
--configuration YAML The configuration file that describe the phenotypes,
genotypes, and model.
Output Options:
--output FILE The output file prefix that will contain the results
and other information. [genetest_results]
Other Options:
--extract FILE A file containing a list of markers to extract prior
to the statistical analysis (one marker per line).
--keep FILE A file containing a list of samples to keep prior to
the statistical analysis (one sample per line).
--maf MAF The MAF threshold to include a marker in the analysis.
[0.01]
--sexual-chromosome Analysis is performed on a sexual chromosome. This
will impact the MAF computation (as males are
hemizygotes on sexual chromosomes). This has an effect
only on a GWAS analysis.
A single configuration file (using the YAML format) describes the genotypes and phenotypes files, and the statistical model to perform. The following describe the different sections.
In all the cases, when values are optional, it needs to be inserted into a
options
subsection, otherwise, the default values will be used (see below
for examples).
Warning
By default, the MAF computed during a GWAS uses the formula for autosomes
(i.e. there is no check for sexual chromosomes). If the analysis is
performed on a sexual chromosome, make sure to use the
--sexual-chromosome
flag and to use the YAML option sex_column
(in
the Phenotype section) to specify the
gender column in the phenotype file.
Also note that when using the --sexual-chromosome
option, all markers
in the genotype file will be treated as being on a sexual chromosome. If
the genotype file contains a mixture of autosomes and sexual chromosomes,
make sure use a combination of the --extract
(extracting only markers
located on a sexual chromosome) and --sexual-chromosome
in order to
properly compute the MAF.
Finally, the column containing the sex in the phenotype file (the
sex_column
option of the Phenotype
section in the YAML file) should have encoding of male=1
and
female=0
.
Genotypes¶
The genotypes
section describes the genetic part of the analysis. Multiple
file formats are available (see below). The required keyword format
describes the file format for the genotypes.
Each of the formats have their own required arguments and options.
impute2¶
The following arguments and options are available for this format.
Argument | Description | Required |
---|---|---|
filename |
The name of the impute2 file. | Yes |
sample_filename |
The name of the sample file. | Yes |
probability_threshold |
The probability threshold. Genotypes with the maximal probability lower to this value will be set as missing. [Default: 0.9] |
Below is an example of a genotypes
section of the YAML configuration
file for an IMPUTE2 format.
genotypes:
format: impute2
filename: cohort.impute2
sample_filename: cohort.sample
options:
probability_threshold: 0.9
bed/bim/fam¶
Only one argument is required.
Argument | Description | Required |
---|---|---|
prefix |
The prefix of the BED/BIM/FAM files | Yes |
Below is an example of a genotypes
section of the YAML configuration file
for a binary Plink format.
genotypes:
format: plink
prefix: cohort
bgen¶
The following arguments and options are available for this format.
Argument | Description | Required |
---|---|---|
filename |
The name of the bgen file. | Yes |
sample_filename |
The name of the sample file. | Yes |
probability_threshold |
The probability threshold. Genotypes with the maximal probability lower to this value will be set as missing. [Default: 0.9] | |
cpus |
The number of CPUs to use while reading the bgen file. [Default: 1] |
Below is an example of a genotypes
section of the YAML configuration file
for a bgen file.
genotypes:
format: bgen
filename: cohort.bgen
sample_filename: cohort.sample
options:
probability_threshold: 0.9
cpus: 1
vcf¶
Only one argument is required.
Argument | Description | Required |
---|---|---|
filename |
The name of the VCF file. | Yes |
Below is an example of a genotypes
section of the YAML configuration file
for the VCF format.
genotypes:
format: vcf
filename: cohort.vcf
Phenotypes¶
The phenotypes
section describes the phenotypes and variables that will be
used in the statistical model. At the moment, only one format is available.
text¶
The following arguments and options are available for this format.
Argument | Description | Required |
---|---|---|
filename |
The name of the bgen file. | Yes |
sample_column |
The name of the column containing the sample ID. This column will be used to match the phenotypes with the genotypes. [Default: sample] | |
field_separator |
The character that separate a field in the file. [Default: ‘\t’] | |
missing_values |
A string (using quotes) that
represents missing values. An empty
field, NA , nan or NaN are
always considered as missing. |
|
repeated_measurements |
Enter ‘Yes’ if the file contains repeated measurements. | |
keep_sample_column |
For now, if repeated measurements are used (i.e. Yes at the previous option), enter ‘Yes’ to tell the parser to keep the sample column for the statistical analysis (will be used for groups in the MixedLM analysis). | |
sex_column |
The name of the column containing the
sex information. Note that males need
to be coded as 1 and females, as
0. The choice of this encoding is
to speed up the MAF computation for
sexual chromosomes. This column will
be used only if the analysis is
performed using the
--sexual-chromosome option. |
Below is an example of a phenotypes
section of the YAML configuration file
for a text file containing repeated measurements. The string -99999
is
considered as a missing value.
phenotypes:
format: text
filename: phenotypes.txt
options:
sample_column: sample_id
missing_values: "-99999"
repeated_measurements: Yes
keep_sample_column: Yes
sex_column: sex
Statistical model¶
For now, a total of 4 different analysis is possible: linear and logistic regressions, repeated measurements analysis using a mixed linear model, and survival analysis using the Cox proportional hazard regression. Each of those models (with their configuration) are described below.
The model is described in the model
section of the YAML configuration
file and using the test
argument.
Linear regression¶
The following arguments and options are available for the linear regression.
Argument | Description | Required |
---|---|---|
formula |
The formula describing the analysis
to be performed. Note that the
formula is similar to the one used in
R. The names of the variables need to
be the same as the columns in the
phenotype file. The keyword SNPs
is used to perform a GWAS. |
Yes |
condition_value_t |
The condition value threshold (for multicollinearity). Usually, values higher than 1000 indicate strong multicollinearity or other numerical problems. [Default: 1000] | |
eigenvals_t |
The Eigen value threshold (for multicollinearity). Usually, values lower than 1e-10 might indicate strong multicollinearity or singular design matrix. [Default: 1e-10] |
Below is an example of a model
section of the YAML configuration file
for a linear regression analysis of the phenotype Pheno over the variables
SNPs (meaning a GWAS), Age and Sex. It also increases the conditional
value threshold from the default value of 1000 to 5000.
model:
test: linear
formula: "Pheno ~ SNPs + Age + factor(Sex)"
options:
condition_value_t: 5000
See genetest.statistics.models.linear.StatsLinear
for more
information about the class.
Logistic regression¶
The logistic regression only requires the formula describing the model.
Argument | Description | Required |
---|---|---|
formula |
The formula describing the analysis
to be performed. Note that the
formula is similar to the one used in
R. The names of the variables need to
be the same as the columns in the
phenotype file. The keyword SNPs
is used to perform a GWAS. |
Yes |
Below is an example of a model
section of the YAML configuration file
for a logistic regression analysis of the phenotype Status over the variables
SNPs (meaning a GWAS), Age and Sex.
model:
test: logistic
formula: "Status ~ SNPs + Age + factor(Sex)"
See genetest.statistics.models.logistic.StatsLogistic
for more
information about the class.
Repeated measurements¶
The repeated measurements analysis requires the following arguments and options.
Argument | Description | Required |
---|---|---|
formula |
The formula describing the analysis
to be performed. Note that the
formula is similar to the one used in
R. The names of the variables need to
be the same as the columns in the
phenotype file. The keyword SNPs
is used to perform a GWAS. |
Yes |
optimize |
Should an optimization be performed by using a two-step approach by fitting one LMM in the first step without the genetic component and, in the second step, fitting a simple regression model, for each SNP at a time. Then, if the p-value is lower than a user defined threshold, a complete LMM is fitted for this marker. Note that this optimization is invalid when using an genetic/environment interaction. [Default: True] | |
p_threshold |
The p-value threshold used for the MixedLM optimization (see above). [Default: 1e-4] |
Below is an example of a model
section of the YAML configuration file
for a repeated measurements analysis of the phenotype Pheno over the
variables SNPs (meaning a GWAS), Age, Sex and Visit using the sample
IDs (SampleID) as the grouping variable.
model:
test: mixedlm
formula: "[outcome=Pheno, groups=SampleID] ~ SNPs + Age + factor(Sex) + factor(Visit)"
options:
optimize: Yes
See genetest.statistics.models.mixedlm.StatsMixedLM
for more
information about the class.
Survival analysis¶
The Cox proportional hazard regression only requires the formula describing the model.
Argument | Description | Required |
---|---|---|
formula |
The formula describing the analysis
to be performed. Note that the
formula is similar to the one used in
R. The names of the variables need to
be the same as the columns in the
phenotype file. The keyword SNPs
is used to perform a GWAS. |
Yes |
Below is an example of a model
section of the YAML configuration file
for a survival analysis (Cox proportional hazard regression) of the event
Event and time to event TTE over the variables SNPs (meaning a GWAS),
Age and Sex.
model:
test: coxph
formula: "[tte=TTE, event=Event] ~ SNPs + Age + factor(Sex)"
See genetest.statistics.models.survival.StatsCoxPH
for more
information about the class.
Execution¶
Assuming the name of the configuration file analysis.yaml
, and that the
list of variant to extract for the analysis is in variants_to_extract.txt
(on variant ID per line), the following command will launch the analysis using
6 CPUs. The resulting files will have the prefix results
.
Note that the --extract
option should be used to extract only the variants
that pass quality control. Since genotypes file might be really big, extracting
only the variants suited for analysis will dramatically decrease the execution
time.
genetest \
--configuration analysis.yaml \
--extract variants_to_extract.txt \
--nb-cpus 6 \
--output results
Output files¶
Using the previous command, three files will be generated (with the results
prefix).
File name | Description |
---|---|
results.log |
File containing the LOG of the analysis. |
results.txt |
File containing the results of the analysis. The file is tab-separated and contain summary information about each variant, along with the statistics specific to the statistical model. |
results_failed_snps.txt |
File containing the list of variants that failed the analysis. Failure can be attributed to low minor allele frequency or convergence issues, for example. A small description is added to describe the failure. |