genipe.tools package¶

Module contents¶

genipe.tools.impute2_extractor module¶

genipe.tools.impute2_extractor.check_args(args)[source]¶

Checks the arguments and options.

Parameters: args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

Note

Noting is checked (apart from the impute2 files) if indexation is asked (--index option).

genipe.tools.impute2_extractor.extract_companion_files(i_prefix, o_prefix, to_extract)[source]¶

Extract markers from companion files (if they exists).

Parameters

i_prefix (str) – the prefix of the input file
o_prefix (str) – the prefix of the output file
to_extract (set) – the set of markers to extract

genipe.tools.impute2_extractor.extract_markers(fn, to_extract, out_prefix, out_format, prob_t, is_long)[source]¶

Extracts according to names.

Parameters

fn (str) – the name of the input file
to_extract (set) – the list of markers to extract for each input file
out_prefix (str) – the output prefix
out_format (list) – the output format(s)
prob_t (float) – the probability threshold
is_long (bool) – True if format needs to be long

genipe.tools.impute2_extractor.gather_extraction(fn, maf, rate, info, extract_filename, genomic_range)[source]¶

Gather positions that are required.

Parameters

fn (str) – the impute2 filename
maf (float) – the minor allele frequency threshold (might be None)
rate (float) – the call rate threshold (might be None)
info (float) – the marker information value threshold (might be None)
extract_filename (str) – the name of the file containing marker names to extract (might be None)
genomic_range (str) – the genomic range for extraction

Returns

the set of markers to extract

Return type

set

If extraction by marker name is required, only those markers will be extracted. Otherwise, maf, rate, info or genomic_range can be specified (alone or together) to extract markers according to minor allele frequency, call rate and genomic location.

genipe.tools.impute2_extractor.get_file_prefix(fn)[source]¶

Gets the filename prefix.

Parameters: fn (str) – the name of the file from which the prefix is required
Returns: the prefix of the file
Return type: str

This function removes the extension from the file name, and return its prefix (e.g. test.impute2 returns test, and ../test.impute2.gz returns ../test).

genipe.tools.impute2_extractor.get_samples(fn)[source]¶

Reads the sample files, and extract the information.

Parameters: fn (str) – the name of the sample file
Returns: the sample information
Return type: pandas.DataFrame

genipe.tools.impute2_extractor.index_file(fn)[source]¶

Indexes the impute2 file.

Parameters: fn (str) – the name of the impute2 file

This function uses the genipe.formats.index.get_index() to create the index file if it’s missing.

Note

We won’t catch the genipe.error.GenipeError exception if it’s raised, since the message will be relevant to the user.

genipe.tools.impute2_extractor.main(args=None)[source]¶

The main function.

Parameters: args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)

genipe.tools.impute2_extractor.parse_args(parser, args=None)[source]¶

Parses the command line options and arguments.

Parameters

parser (argparse.ArgumentParser) – the argument parser
args (list) – the list of arguments (if not taken from sys.argv)

Returns

the list of options and arguments

Return type

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.impute2_extractor.print_data(o_files, prob_t, fid, iid, is_long, *, line=None, row=None)[source]¶

Prints an impute2 line.

Parameters

o_files (dict) – the output files
prob_t (float) – the probability threshold
fid (list) – the list of family IDs
iid (list) – the list of sample IDs
is_long (bool) – True if the format is long (dosage, calls)
line (str) – the impute2 line
row (list) – the impute2 line, split by spaces

genipe.tools.impute2_merger module¶

genipe.tools.impute2_merger.check_args(args)[source]¶

Checks the arguments and options.

Parameters: args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

genipe.tools.impute2_merger.concatenate_files(i_filenames, out_prefix, real_chrom, options)[source]¶

Concatenates and extracts information from IMPUTE2 GEN file(s).

Parameters

i_filenames (list) – the list of input filenames (to concatenate)
out_prefix (str) – the output prefix for the output files
real_chrom (str) – the chromosome contained in all the input files
options (argparse.Namespace) – the options

This function will create the following seven files:

File name	Description
`.impute2`	Imputation results (merged from all the input files).
`.alleles`	Description of the reference and alternative allele at each sites.
`.imputed_sites`	List of imputed sites (excluding sites that were previously genotyped in the study cohort).
`.impute2_info`	SNP-wise information file with one line per SNP and a single header line at the beginning.
`.completion_rates`	Number of missing values and completion rate for all sites (using the probability threshold set by the user, where the default is higher and equal to 0.9).
`.good_sites`	List of sites which pass the completion rate threshold (set by the user, where the default is higher and equal to 0.98) using the probability threshold (set by the user, where the default is higher and equal to 0.9).
`.map`	A map file describing the genomic location of all sites.
`.maf`	File containing the minor allele frequency (along with minor allele identification) for all sites using the probabilitty threshold of 0.9. When no genotypes are available (because they are all below the threshold), the MAF is `NA`.

genipe.tools.impute2_merger.main(args=None)[source]¶

The main function.

Parameters: args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)

genipe.tools.impute2_merger.parse_args(parser, args=None)[source]¶

Parses the command line options and arguments.

Parameters

parser (argparse.ArgumentParser) – the argument parser
args (list) – the list of arguments (if not taken from sys.argv)

Returns

the list of options and arguments

Return type

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.imputed_stats module¶

genipe.tools.imputed_stats.check_args(args)[source]¶

Checks the arguments and options.

Parameters: args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

genipe.tools.imputed_stats.compute_statistics(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, options)[source]¶

Parses IMPUTE2 file while computing statistics.

Parameters

impute2_filename (str) – the name of the input file
samples (pandas.DataFrame) – the list of samples
markers_to_extract (set) – the set of markers to extract
phenotypes (pandas.DataFrame) – the phenotypes
remove_gender (bool) – whether or not to remove the gender column
out_prefix (str) – the output prefix
options (argparse.Namespace) – the options

This function takes care of parallelism. It reads the Impute2 file and fills a queue that will trigger the analysis when full.

If the number of process to launch is 1, the rows are analyzed as they come.

genipe.tools.imputed_stats.fit_cox(data, time_to_event, event, formula, result_col, **kwargs)[source]¶

Fit a Cox’ proportional hazard to the data.

Parameters

data (pandas.DataFrame) – the data to analyse
time_to_event (str) – the time to event column for the survival analysis
event (str) – the event column for the survival analysis
formula (str) – the formula for the data preparation
result_col (str) – the column that will contain the results

Returns

the results from the survival analysis

Return type

numpy.array

Note

Using alpha of 0.95, and default parameters.

genipe.tools.imputed_stats.fit_linear(data, formula, result_col, **kwargs)[source]¶

Fit a linear regression to the data.

Parameters

data (pandas.DataFrame) – the data to analyse
formula (str) – the formula for the linear regression
result_col (str) – the column that will contain the results

Returns

the results from the linear regression

Return type

list

genipe.tools.imputed_stats.fit_logistic(data, formula, result_col, **kwargs)[source]¶

Fit a logistic regression to the data.

Parameters

data (pandas.DataFrame) – the data to analyse
formula (str) – the formula for the logistic regression
result_col (str) – the column that will contain the results

Returns

the results from the logistic regression

Return type

list

genipe.tools.imputed_stats.fit_mixedlm(data, formula, use_ml, groups, result_col, random_effects, mixedlm_p, interaction, **kwargs)[source]¶

Fit a linear mixed effects model to the data.

Parameters

data (pandas.DataFrame) – the data to analyse
formula (str) – the formula for the linear mixed effects model
use_ml (bool) – whether to use ML instead of REML
groups (str) – the column containing the groups
result_col (str) – the column that will contain the results
random_effects (pandas.Series) – the random effects
mixedlm_p (float) – the p-value threshold for which loci will be computed with the real MixedLM analysis
interaction (bool) – Whether there is an interaction or not

Returns

the results from the linear mixed effects model

Return type

list

genipe.tools.imputed_stats.get_formula(phenotype, covars, interaction, gender_c, categorical)[source]¶

Creates the linear/logistic regression formula (for statsmodel).

Parameters

phenotype (str) – the phenotype column
covars (list) – the list of co variable columns
interaction (str) – the interaction column

Returns

the formula for the statistical analysis

Return type

str

Note

The phenotype column needs to be specified. The list of co variables might be empty (if no co variables are necessary). The interaction column can be set to None if there is no interaction.

Note

The gender column should be categorical (hence, the formula requires the gender to be included into C(), e.g. C(Gender)).

genipe.tools.imputed_stats.is_file_like(fn)[source]¶

Checks if the path is like a file (it might be a named pipe).

Parameters: fn (str) – the path to check
Returns: True if path is like a file, False otherwise.
Return type: bool

genipe.tools.imputed_stats.main(args=None)[source]¶

The main function.

Parameters: args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)

genipe.tools.imputed_stats.parse_args(parser, args=None)[source]¶

Parses the command line options and arguments.

Parameters

parser (argparse.ArgumentParser) – the argument parser
args (list) – the list of arguments (if not taken from sys.argv)

Returns

the list of options and arguments

Return type

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.imputed_stats.process_impute2_site(site_info)[source]¶

Process an IMPUTE2 site (a line in an IMPUTE2 file).

Parameters: site_info (list) – the impute2 line (split by space)
Returns: the results of the analysis
Return type: list

genipe.tools.imputed_stats.read_phenotype(i_filename, opts, check_duplicated=True)[source]¶

Reads the phenotype file.

Parameters

i_filename (str) – the name of the input file
opts (argparse.Namespace) – the options
check_duplicated (bool) – whether or not to check for duplicated samples

Returns

the phenotypes

Return type

pandas.DataFrame

This file is expected to be a tab separated file of phenotypes and covariates. The columns to use will be determined by the --sample-column and the --covar options.

For analysis including the X chromosome, the gender is automatically added as a covariate. The results are not shown to the user unless asked for.

genipe.tools.imputed_stats.read_samples(i_filename)[source]¶

Reads the sample file (produced by SHAPEIT).

Parameters: i_filename (str) – the name of the input file
Returns: the list of samples
Return type: pandas.DataFrame

This file contains the list of samples that are contained in the impute2 file (with same order). The expected format for this file is a tab separated file with a first row containing the following columns:

ID_1    ID_2    missing father  mother  sex     plink_pheno

The subsequent row will be discarded and should contain:

0       0       0 D     D       D       B

Notes

We are mostly interested in the sample IDs corresponding to the ID_2 column. Their uniqueness is verified by pandas.

genipe.tools.imputed_stats.read_sites_to_extract(i_filename)[source]¶

Reads the list of sites to extract.

Parameters: i_filename (str) – The input filename containing the IDs of the variants to consider for the analysis.
Returns: A set containing the variants.
Return type: set

The expected file format is simply a list of variants. Every row should correspond to a single variant identifier.

3:60069:t
rs123456:A
3:60322:A

Typically, this is used to analyze only variants that passed some QC threshold. The genipe pipeline generates this file at the ‘merge_impute2’ step.

genipe.tools.imputed_stats.samples_with_hetero_calls(data, hetero_c)[source]¶

Gets male and heterozygous calls.

Parameters

data (pandas.DataFrame) – the probability matrix
hetero_c (str) – the name of the heterozygous column

Returns

samples where call is heterozygous

Return type

pandas.Index

Note

If there are no data (i.e. no males), an empty list is returned.

genipe.tools.imputed_stats.skat_parse_impute2(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, args)[source]¶

Read the impute2 file and run the SKAT analysis.

Parameters

impute2_filename (str) – the name of the input file
samples (pandas.DataFrame) – the samples
markers_to_extract (set) – the set of markers to analyse
phenotypes (pandas.DataFrame) – the phenotypes
remove_gender (bool) – whether or not to remove the gender column
out_prefix (str) – the output prefix
args (argparse.Namespace) – the options

This function does most of the “dispatching” to run SKAT. It writes the input files to the disk, runs the generated R scripts to do the actual analysis and then writes the results to disk.

genipe.tools.imputed_stats.skat_read_snp_set(i_filename)[source]¶

Reads the SKAT SNP set file.

Parameters: i_filename (str) – the name of the input file
Returns: the SNP set for the SKAT analysis
Return type: pandas.DataFrame

This file has to be supplied by the user. The recognized columns are: variant, snp_set and weight. The weight column is optional and can be used to specify a custom weighting scheme for SKAT. If nothing is specified, the default Beta weights are used.

The file has to be tab delimited.

genipe.tools.genipe_tutorial¶

genipe.tools.genipe_tutorial.check_files(*filenames)[source]¶

Checks that all files exists.

Parameters: filenames (list) – the list of file to check
Returns: True if all files exist, False otherwise
Return type: bool

genipe.tools.genipe_tutorial.download_file(url, path)[source]¶

Downloads a file from a URL to a path.

Parameters

url (str) – the url to download
path (str) – the path where to save the file

genipe.tools.genipe_tutorial.generate_bash(path)[source]¶

Generates a bash script to launch the imputation pipeline.

Parameters: path (str) – the path to write the bash script

genipe.tools.genipe_tutorial.get_genotypes(path)[source]¶

Gets the genotypes files.

Parameters: path (str) – the path where to put the genotypes

genipe.tools.genipe_tutorial.get_hg19(path)[source]¶

Gets the hg19 reference file.

Parameters: path (str) – the path where to put the reference

genipe.tools.genipe_tutorial.get_impute2(os_name, arch, path)[source]¶

Gets impute2 depending of the system, and puts it in ‘path’.

Parameters

os_name (str) – the name of the OS
arch (str) – the architecture of the system
path (str) – the path where to put impute2

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.get_impute2_ref(path)[source]¶

Gets the impute2’s reference files.

Parameters: path (str) – the path where to put the reference files

genipe.tools.genipe_tutorial.get_os_info()[source]¶

Getting the OS information.

Returns

first element is the name of the os, and the second is the: system’s architecture

Return type

tuple

Note

The tutorial does not work on the Windows operating system. The script will quit unless the operating system is Linux or Darwin (MacOSX).

genipe.tools.genipe_tutorial.get_plink(os_name, arch, path)[source]¶

Gets Plink depending of the system, and puts it in ‘path’.

Parameters

os_name (str) – the name of the OS
arch (str) – the architecture of the system
path (str) – the path where to put Plink

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.get_shapeit(os_name, arch, path)[source]¶

Gets shapeit depending of the system, and puts it in ‘path’.

Parameters

os_name (str) – the name of the OS
arch (str) – the architecture of the system
path (str) – the path where to put shapeit

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.main(args=None)[source]¶

The main function.

Parameters: args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)

genipe.tools.genipe_tutorial.parse_args(parser, args=None)[source]¶

Parses the command line options and arguments.

Parameters

parser (argparse.ArgumentParser) – the argument parser
args (list) – the list of arguments (if not taken from sys.argv)

Returns

the list of options and arguments

Return type

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.genipe_tutorial.untar_file(path, fn)[source]¶

Extracts a tar archive.

Parameters

path (str) – the path to where the file will be extracted
fn (str) – the name of the tar archive

genipe.tools package¶

Module contents¶

genipe.tools.impute2_extractor module¶

genipe.tools.impute2_merger module¶

genipe.tools.imputed_stats module¶

genipe.tools.genipe_tutorial¶

Table of Contents

Previous topic

This Page