genipe.tools package¶
Module contents¶
genipe.tools.impute2_extractor module¶
-
genipe.tools.impute2_extractor.
check_args
(args)[source]¶ Checks the arguments and options.
- Parameters
args (argparse.Namespace) – the options to verify
Note
If there is a problem, a
genipe.error.GenipeError
is raised.Note
Noting is checked (apart from the impute2 files) if indexation is asked (
--index
option).
-
genipe.tools.impute2_extractor.
extract_companion_files
(i_prefix, o_prefix, to_extract)[source]¶ Extract markers from companion files (if they exists).
-
genipe.tools.impute2_extractor.
extract_markers
(fn, to_extract, out_prefix, out_format, prob_t, is_long)[source]¶ Extracts according to names.
-
genipe.tools.impute2_extractor.
gather_extraction
(fn, maf, rate, info, extract_filename, genomic_range)[source]¶ Gather positions that are required.
- Parameters
fn (str) – the impute2 filename
maf (float) – the minor allele frequency threshold (might be
None
)rate (float) – the call rate threshold (might be
None
)info (float) – the marker information value threshold (might be
None
)extract_filename (str) – the name of the file containing marker names to extract (might be
None
)genomic_range (str) – the genomic range for extraction
- Returns
the set of markers to extract
- Return type
If extraction by marker name is required, only those markers will be extracted. Otherwise,
maf
,rate
,info
orgenomic_range
can be specified (alone or together) to extract markers according to minor allele frequency, call rate and genomic location.
-
genipe.tools.impute2_extractor.
get_file_prefix
(fn)[source]¶ Gets the filename prefix.
- Parameters
fn (str) – the name of the file from which the prefix is required
- Returns
the prefix of the file
- Return type
This function removes the extension from the file name, and return its prefix (e.g.
test.impute2
returnstest
, and../test.impute2.gz
returns../test
).
-
genipe.tools.impute2_extractor.
get_samples
(fn)[source]¶ Reads the sample files, and extract the information.
- Parameters
fn (str) – the name of the sample file
- Returns
the sample information
- Return type
-
genipe.tools.impute2_extractor.
index_file
(fn)[source]¶ Indexes the impute2 file.
- Parameters
fn (str) – the name of the impute2 file
This function uses the
genipe.formats.index.get_index()
to create the index file if it’s missing.Note
We won’t catch the
genipe.error.GenipeError
exception if it’s raised, since the message will be relevant to the user.
-
genipe.tools.impute2_extractor.
main
(args=None)[source]¶ The main function.
- Parameters
args (argparse.Namespace) – the arguments to be parsed (if
main()
is called by another modulel)
-
genipe.tools.impute2_extractor.
parse_args
(parser, args=None)[source]¶ Parses the command line options and arguments.
- Parameters
parser (argparse.ArgumentParser) – the argument parser
args (list) – the list of arguments (if not taken from
sys.argv
)
- Returns
the list of options and arguments
- Return type
Note
The only check that is done here is by the parser itself. Values are verified later by the
check_args()
function.
genipe.tools.impute2_merger module¶
-
genipe.tools.impute2_merger.
check_args
(args)[source]¶ Checks the arguments and options.
- Parameters
args (argparse.Namespace) – the options to verify
Note
If there is a problem, a
genipe.error.GenipeError
is raised.
-
genipe.tools.impute2_merger.
concatenate_files
(i_filenames, out_prefix, real_chrom, options)[source]¶ Concatenates and extracts information from IMPUTE2 GEN file(s).
- Parameters
i_filenames (list) – the list of input filenames (to concatenate)
out_prefix (str) – the output prefix for the output files
real_chrom (str) – the chromosome contained in all the input files
options (argparse.Namespace) – the options
This function will create the following seven files:
File name
Description
.impute2
Imputation results (merged from all the input files).
.alleles
Description of the reference and alternative allele at each sites.
.imputed_sites
List of imputed sites (excluding sites that were previously genotyped in the study cohort).
.impute2_info
SNP-wise information file with one line per SNP and a single header line at the beginning.
.completion_rates
Number of missing values and completion rate for all sites (using the probability threshold set by the user, where the default is higher and equal to 0.9).
.good_sites
List of sites which pass the completion rate threshold (set by the user, where the default is higher and equal to 0.98) using the probability threshold (set by the user, where the default is higher and equal to 0.9).
.map
A map file describing the genomic location of all sites.
.maf
File containing the minor allele frequency (along with minor allele identification) for all sites using the probabilitty threshold of 0.9. When no genotypes are available (because they are all below the threshold), the MAF is
NA
.
-
genipe.tools.impute2_merger.
main
(args=None)[source]¶ The main function.
- Parameters
args (argparse.Namespace) – the arguments to be parsed (if
main()
is called by another modulel)
-
genipe.tools.impute2_merger.
parse_args
(parser, args=None)[source]¶ Parses the command line options and arguments.
- Parameters
parser (argparse.ArgumentParser) – the argument parser
args (list) – the list of arguments (if not taken from
sys.argv
)
- Returns
the list of options and arguments
- Return type
Note
The only check that is done here is by the parser itself. Values are verified later by the
check_args()
function.
genipe.tools.imputed_stats module¶
-
genipe.tools.imputed_stats.
check_args
(args)[source]¶ Checks the arguments and options.
- Parameters
args (argparse.Namespace) – the options to verify
Note
If there is a problem, a
genipe.error.GenipeError
is raised.
-
genipe.tools.imputed_stats.
compute_statistics
(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, options)[source]¶ Parses IMPUTE2 file while computing statistics.
- Parameters
impute2_filename (str) – the name of the input file
samples (pandas.DataFrame) – the list of samples
markers_to_extract (set) – the set of markers to extract
phenotypes (pandas.DataFrame) – the phenotypes
remove_gender (bool) – whether or not to remove the gender column
out_prefix (str) – the output prefix
options (argparse.Namespace) – the options
This function takes care of parallelism. It reads the Impute2 file and fills a queue that will trigger the analysis when full.
If the number of process to launch is 1, the rows are analyzed as they come.
-
genipe.tools.imputed_stats.
fit_cox
(data, time_to_event, event, formula, result_col, **kwargs)[source]¶ Fit a Cox’ proportional hazard to the data.
- Parameters
data (pandas.DataFrame) – the data to analyse
time_to_event (str) – the time to event column for the survival analysis
event (str) – the event column for the survival analysis
formula (str) – the formula for the data preparation
result_col (str) – the column that will contain the results
- Returns
the results from the survival analysis
- Return type
numpy.array
Note
Using alpha of 0.95, and default parameters.
-
genipe.tools.imputed_stats.
fit_linear
(data, formula, result_col, **kwargs)[source]¶ Fit a linear regression to the data.
- Parameters
data (pandas.DataFrame) – the data to analyse
formula (str) – the formula for the linear regression
result_col (str) – the column that will contain the results
- Returns
the results from the linear regression
- Return type
-
genipe.tools.imputed_stats.
fit_logistic
(data, formula, result_col, **kwargs)[source]¶ Fit a logistic regression to the data.
- Parameters
data (pandas.DataFrame) – the data to analyse
formula (str) – the formula for the logistic regression
result_col (str) – the column that will contain the results
- Returns
the results from the logistic regression
- Return type
-
genipe.tools.imputed_stats.
fit_mixedlm
(data, formula, use_ml, groups, result_col, random_effects, mixedlm_p, interaction, **kwargs)[source]¶ Fit a linear mixed effects model to the data.
- Parameters
data (pandas.DataFrame) – the data to analyse
formula (str) – the formula for the linear mixed effects model
use_ml (bool) – whether to use ML instead of REML
groups (str) – the column containing the groups
result_col (str) – the column that will contain the results
random_effects (pandas.Series) – the random effects
mixedlm_p (float) – the p-value threshold for which loci will be computed with the real MixedLM analysis
interaction (bool) – Whether there is an interaction or not
- Returns
the results from the linear mixed effects model
- Return type
-
genipe.tools.imputed_stats.
get_formula
(phenotype, covars, interaction, gender_c, categorical)[source]¶ Creates the linear/logistic regression formula (for statsmodel).
- Parameters
- Returns
the formula for the statistical analysis
- Return type
Note
The phenotype column needs to be specified. The list of co variables might be empty (if no co variables are necessary). The interaction column can be set to
None
if there is no interaction.Note
The gender column should be categorical (hence, the formula requires the gender to be included into
C()
, e.g.C(Gender)
).
-
genipe.tools.imputed_stats.
is_file_like
(fn)[source]¶ Checks if the path is like a file (it might be a named pipe).
-
genipe.tools.imputed_stats.
main
(args=None)[source]¶ The main function.
- Parameters
args (argparse.Namespace) – the arguments to be parsed (if
main()
is called by another modulel)
-
genipe.tools.imputed_stats.
parse_args
(parser, args=None)[source]¶ Parses the command line options and arguments.
- Parameters
parser (argparse.ArgumentParser) – the argument parser
args (list) – the list of arguments (if not taken from
sys.argv
)
- Returns
the list of options and arguments
- Return type
Note
The only check that is done here is by the parser itself. Values are verified later by the
check_args()
function.
-
genipe.tools.imputed_stats.
process_impute2_site
(site_info)[source]¶ Process an IMPUTE2 site (a line in an IMPUTE2 file).
-
genipe.tools.imputed_stats.
read_phenotype
(i_filename, opts, check_duplicated=True)[source]¶ Reads the phenotype file.
- Parameters
i_filename (str) – the name of the input file
opts (argparse.Namespace) – the options
check_duplicated (bool) – whether or not to check for duplicated samples
- Returns
the phenotypes
- Return type
This file is expected to be a tab separated file of phenotypes and covariates. The columns to use will be determined by the
--sample-column
and the--covar
options.For analysis including the X chromosome, the gender is automatically added as a covariate. The results are not shown to the user unless asked for.
-
genipe.tools.imputed_stats.
read_samples
(i_filename)[source]¶ Reads the sample file (produced by SHAPEIT).
- Parameters
i_filename (str) – the name of the input file
- Returns
the list of samples
- Return type
This file contains the list of samples that are contained in the
impute2
file (with same order). The expected format for this file is a tab separated file with a first row containing the following columns:ID_1 ID_2 missing father mother sex plink_pheno
The subsequent row will be discarded and should contain:
0 0 0 D D D B
Notes
We are mostly interested in the sample IDs corresponding to the
ID_2
column. Their uniqueness is verified by pandas.
-
genipe.tools.imputed_stats.
read_sites_to_extract
(i_filename)[source]¶ Reads the list of sites to extract.
- Parameters
i_filename (str) – The input filename containing the IDs of the variants to consider for the analysis.
- Returns
A set containing the variants.
- Return type
The expected file format is simply a list of variants. Every row should correspond to a single variant identifier.
3:60069:t rs123456:A 3:60322:A
Typically, this is used to analyze only variants that passed some QC threshold. The
genipe
pipeline generates this file at the ‘merge_impute2’ step.
-
genipe.tools.imputed_stats.
samples_with_hetero_calls
(data, hetero_c)[source]¶ Gets male and heterozygous calls.
- Parameters
data (pandas.DataFrame) – the probability matrix
hetero_c (str) – the name of the heterozygous column
- Returns
samples where call is heterozygous
- Return type
Note
If there are no data (i.e. no males), an empty list is returned.
-
genipe.tools.imputed_stats.
skat_parse_impute2
(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, args)[source]¶ Read the impute2 file and run the SKAT analysis.
- Parameters
impute2_filename (str) – the name of the input file
samples (pandas.DataFrame) – the samples
markers_to_extract (set) – the set of markers to analyse
phenotypes (pandas.DataFrame) – the phenotypes
remove_gender (bool) – whether or not to remove the gender column
out_prefix (str) – the output prefix
args (argparse.Namespace) – the options
This function does most of the “dispatching” to run SKAT. It writes the input files to the disk, runs the generated R scripts to do the actual analysis and then writes the results to disk.
-
genipe.tools.imputed_stats.
skat_read_snp_set
(i_filename)[source]¶ Reads the SKAT SNP set file.
- Parameters
i_filename (str) – the name of the input file
- Returns
the SNP set for the SKAT analysis
- Return type
This file has to be supplied by the user. The recognized columns are:
variant
,snp_set
andweight
. Theweight
column is optional and can be used to specify a custom weighting scheme for SKAT. If nothing is specified, the default Beta weights are used.The file has to be tab delimited.
genipe.tools.genipe_tutorial¶
-
genipe.tools.genipe_tutorial.
download_file
(url, path)[source]¶ Downloads a file from a URL to a path.
-
genipe.tools.genipe_tutorial.
generate_bash
(path)[source]¶ Generates a bash script to launch the imputation pipeline.
- Parameters
path (str) – the path to write the bash script
-
genipe.tools.genipe_tutorial.
get_genotypes
(path)[source]¶ Gets the genotypes files.
- Parameters
path (str) – the path where to put the genotypes
-
genipe.tools.genipe_tutorial.
get_hg19
(path)[source]¶ Gets the hg19 reference file.
- Parameters
path (str) – the path where to put the reference
-
genipe.tools.genipe_tutorial.
get_impute2
(os_name, arch, path)[source]¶ Gets impute2 depending of the system, and puts it in ‘path’.
- Parameters
Note
If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.
-
genipe.tools.genipe_tutorial.
get_impute2_ref
(path)[source]¶ Gets the impute2’s reference files.
- Parameters
path (str) – the path where to put the reference files
-
genipe.tools.genipe_tutorial.
get_os_info
()[source]¶ Getting the OS information.
- Returns
- first element is the name of the os, and the second is the
system’s architecture
- Return type
Note
The tutorial does not work on the Windows operating system. The script will quit unless the operating system is Linux or Darwin (MacOSX).
-
genipe.tools.genipe_tutorial.
get_plink
(os_name, arch, path)[source]¶ Gets Plink depending of the system, and puts it in ‘path’.
- Parameters
Note
If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.
-
genipe.tools.genipe_tutorial.
get_shapeit
(os_name, arch, path)[source]¶ Gets shapeit depending of the system, and puts it in ‘path’.
- Parameters
Note
If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.
-
genipe.tools.genipe_tutorial.
main
(args=None)[source]¶ The main function.
- Parameters
args (argparse.Namespace) – the arguments to be parsed (if
main()
is called by another modulel)
-
genipe.tools.genipe_tutorial.
parse_args
(parser, args=None)[source]¶ Parses the command line options and arguments.
- Parameters
parser (argparse.ArgumentParser) – the argument parser
args (list) – the list of arguments (if not taken from
sys.argv
)
- Returns
the list of options and arguments
- Return type
Note
The only check that is done here is by the parser itself. Values are verified later by the
check_args()
function.