# genipe.tools package¶

## genipe.tools.impute2_extractor module¶

genipe.tools.impute2_extractor.check_args(args)[source]

Checks the arguments and options.

Parameters: args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

Note

Noting is checked (apart from the impute2 files) if indexation is asked (--index option).

genipe.tools.impute2_extractor.extract_companion_files(i_prefix, o_prefix, to_extract)[source]

Extract markers from companion files (if they exists).

Parameters: i_prefix (str) – the prefix of the input file o_prefix (str) – the prefix of the output file to_extract (set) – the set of markers to extract
genipe.tools.impute2_extractor.extract_markers(fn, to_extract, out_prefix, out_format, prob_t, is_long)[source]

Extracts according to names.

Parameters: fn (str) – the name of the input file to_extract (set) – the list of markers to extract for each input file out_prefix (str) – the output prefix out_format (list) – the output format(s) prob_t (float) – the probability threshold is_long (bool) – True if format needs to be long
genipe.tools.impute2_extractor.gather_extraction(fn, maf, rate, info, extract_filename, genomic_range)[source]

Gather positions that are required.

Parameters: fn (str) – the impute2 filename maf (float) – the minor allele frequency threshold (might be None) rate (float) – the call rate threshold (might be None) info (float) – the marker information value threshold (might be None) extract_filename (str) – the name of the file containing marker names to extract (might be None) genomic_range (str) – the genomic range for extraction the set of markers to extract set

If extraction by marker name is required, only those markers will be extracted. Otherwise, maf, rate, info or genomic_range can be specified (alone or together) to extract markers according to minor allele frequency, call rate and genomic location.

genipe.tools.impute2_extractor.get_file_prefix(fn)[source]

Gets the filename prefix.

Parameters: fn (str) – the name of the file from which the prefix is required the prefix of the file str

This function removes the extension from the file name, and return its prefix (e.g. test.impute2 returns test, and ../test.impute2.gz returns ../test).

genipe.tools.impute2_extractor.get_samples(fn)[source]

Reads the sample files, and extract the information.

Parameters: fn (str) – the name of the sample file the sample information pandas.DataFrame
genipe.tools.impute2_extractor.index_file(fn)[source]

Indexes the impute2 file.

Parameters: fn (str) – the name of the impute2 file

This function uses the genipe.formats.index.get_index() to create the index file if it’s missing.

Note

We won’t catch the genipe.error.GenipeError exception if it’s raised, since the message will be relevant to the user.

genipe.tools.impute2_extractor.main(args=None)[source]

The main function.

Parameters: args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)
genipe.tools.impute2_extractor.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters: parser (argparse.ArgumentParser) – the argument parser args (list) – the list of arguments (if not taken from sys.argv) the list of options and arguments argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.impute2_extractor.print_data(o_files, prob_t, fid, iid, is_long, *, line=None, row=None)[source]

Prints an impute2 line.

Parameters: o_files (dict) – the output files prob_t (float) – the probability threshold fid (list) – the list of family IDs iid (list) – the list of sample IDs is_long (bool) – True if the format is long (dosage, calls) line (str) – the impute2 line row (list) – the impute2 line, split by spaces

## genipe.tools.impute2_merger module¶

genipe.tools.impute2_merger.check_args(args)[source]

Checks the arguments and options.

Parameters: args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

genipe.tools.impute2_merger.concatenate_files(i_filenames, out_prefix, real_chrom, options)[source]

Concatenates and extracts information from IMPUTE2 GEN file(s).

Parameters: i_filenames (list) – the list of input filenames (to concatenate) out_prefix (str) – the output prefix for the output files real_chrom (str) – the chromosome contained in all the input files options (argparse.Namespace) – the options

This function will create the following seven files:

File name Description
.impute2 Imputation results (merged from all the input files).
.alleles Description of the reference and alternative allele at each sites.
.imputed_sites List of imputed sites (excluding sites that were previously genotyped in the study cohort).
.impute2_info SNP-wise information file with one line per SNP and a single header line at the beginning.
.completion_rates Number of missing values and completion rate for all sites (using the probability threshold set by the user, where the default is higher and equal to 0.9).
.good_sites List of sites which pass the completion rate threshold (set by the user, where the default is higher and equal to 0.98) using the probability threshold (set by the user, where the default is higher and equal to 0.9).
.map A map file describing the genomic location of all sites.
.maf File containing the minor allele frequency (along with minor allele identification) for all sites using the probabilitty threshold of 0.9. When no genotypes are available (because they are all below the threshold), the MAF is NA.
genipe.tools.impute2_merger.main(args=None)[source]

The main function.

Parameters: args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)
genipe.tools.impute2_merger.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters: parser (argparse.ArgumentParser) – the argument parser args (list) – the list of arguments (if not taken from sys.argv) the list of options and arguments argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

## genipe.tools.imputed_stats module¶

genipe.tools.imputed_stats.check_args(args)[source]

Checks the arguments and options.

Parameters: args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

genipe.tools.imputed_stats.compute_statistics(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, options)[source]

Parses IMPUTE2 file while computing statistics.

Parameters: impute2_filename (str) – the name of the input file samples (pandas.DataFrame) – the list of samples markers_to_extract (set) – the set of markers to extract phenotypes (pandas.DataFrame) – the phenotypes remove_gender (bool) – whether or not to remove the gender column out_prefix (str) – the output prefix options (argparse.Namespace) – the options

This function takes care of parallelism. It reads the Impute2 file and fills a queue that will trigger the analysis when full.

If the number of process to launch is 1, the rows are analyzed as they come.

genipe.tools.imputed_stats.fit_cox(data, time_to_event, event, formula, result_col, **kwargs)[source]

Fit a Cox’ proportional hazard to the data.

Parameters: data (pandas.DataFrame) – the data to analyse time_to_event (str) – the time to event column for the survival analysis event (str) – the event column for the survival analysis formula (str) – the formula for the data preparation result_col (str) – the column that will contain the results the results from the survival analysis numpy.array

Note

The tie method used is Efron. Normalization is set to False.

genipe.tools.imputed_stats.fit_linear(data, formula, result_col, **kwargs)[source]

Fit a linear regression to the data.

Parameters: data (pandas.DataFrame) – the data to analyse formula (str) – the formula for the linear regression result_col (str) – the column that will contain the results the results from the linear regression list
genipe.tools.imputed_stats.fit_logistic(data, formula, result_col, **kwargs)[source]

Fit a logistic regression to the data.

Parameters: data (pandas.DataFrame) – the data to analyse formula (str) – the formula for the logistic regression result_col (str) – the column that will contain the results the results from the logistic regression list
genipe.tools.imputed_stats.fit_mixedlm(data, formula, use_ml, groups, result_col, random_effects, mixedlm_p, interaction, **kwargs)[source]

Fit a linear mixed effects model to the data.

Parameters: data (pandas.DataFrame) – the data to analyse formula (str) – the formula for the linear mixed effects model use_ml (bool) – whether to use ML instead of REML groups (str) – the column containing the groups result_col (str) – the column that will contain the results random_effects (pandas.Series) – the random effects mixedlm_p (float) – the p-value threshold for which loci will be computed with the real MixedLM analysis interaction (bool) – Whether there is an interaction or not the results from the linear mixed effects model list
genipe.tools.imputed_stats.get_formula(phenotype, covars, interaction, gender_c, categorical)[source]

Creates the linear/logistic regression formula (for statsmodel).

Parameters: phenotype (str) – the phenotype column covars (list) – the list of co variable columns interaction (str) – the interaction column the formula for the statistical analysis str

Note

The phenotype column needs to be specified. The list of co variables might be empty (if no co variables are necessary). The interaction column can be set to None if there is no interaction.

Note

The gender column should be categorical (hence, the formula requires the gender to be included into C(), e.g. C(Gender)).

genipe.tools.imputed_stats.is_file_like(fn)[source]

Checks if the path is like a file (it might be a named pipe).

Parameters: fn (str) – the path to check True if path is like a file, False otherwise. bool
genipe.tools.imputed_stats.main(args=None)[source]

The main function.

Parameters: args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)
genipe.tools.imputed_stats.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters: parser (argparse.ArgumentParser) – the argument parser args (list) – the list of arguments (if not taken from sys.argv) the list of options and arguments argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.imputed_stats.process_impute2_site(site_info)[source]

Process an IMPUTE2 site (a line in an IMPUTE2 file).

Parameters: site_info (list) – the impute2 line (split by space) the results of the analysis list
genipe.tools.imputed_stats.read_phenotype(i_filename, opts, check_duplicated=True)[source]

Parameters: i_filename (str) – the name of the input file opts (argparse.Namespace) – the options check_duplicated (bool) – whether or not to check for duplicated samples the phenotypes pandas.DataFrame

This file is expected to be a tab separated file of phenotypes and covariates. The columns to use will be determined by the --sample-column and the --covar options.

For analysis including the X chromosome, the gender is automatically added as a covariate. The results are not shown to the user unless asked for.

genipe.tools.imputed_stats.read_samples(i_filename)[source]

Reads the sample file (produced by SHAPEIT).

Parameters: i_filename (str) – the name of the input file the list of samples pandas.DataFrame

This file contains the list of samples that are contained in the impute2 file (with same order). The expected format for this file is a tab separated file with a first row containing the following columns:

ID_1    ID_2    missing father  mother  sex     plink_pheno


The subsequent row will be discarded and should contain:

0       0       0 D     D       D       B


Notes

We are mostly interested in the sample IDs corresponding to the ID_2 column. Their uniqueness is verified by pandas.

genipe.tools.imputed_stats.read_sites_to_extract(i_filename)[source]

Reads the list of sites to extract.

Parameters: i_filename (str) – The input filename containing the IDs of the variants to consider for the analysis. A set containing the variants. set

The expected file format is simply a list of variants. Every row should correspond to a single variant identifier.

3:60069:t
rs123456:A
3:60322:A


Typically, this is used to analyze only variants that passed some QC threshold. The genipe pipeline generates this file at the ‘merge_impute2’ step.

genipe.tools.imputed_stats.samples_with_hetero_calls(data, hetero_c)[source]

Gets male and heterozygous calls.

Parameters: data (pandas.DataFrame) – the probability matrix hetero_c (str) – the name of the heterozygous column samples where call is heterozygous pandas.Index

Note

If there are no data (i.e. no males), an empty list is returned.

genipe.tools.imputed_stats.skat_parse_impute2(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, args)[source]

Read the impute2 file and run the SKAT analysis.

Parameters: impute2_filename (str) – the name of the input file samples (pandas.DataFrame) – the samples markers_to_extract (set) – the set of markers to analyse phenotypes (pandas.DataFrame) – the phenotypes remove_gender (bool) – whether or not to remove the gender column out_prefix (str) – the output prefix args (argparse.Namespace) – the options

This function does most of the “dispatching” to run SKAT. It writes the input files to the disk, runs the generated R scripts to do the actual analysis and then writes the results to disk.

genipe.tools.imputed_stats.skat_read_snp_set(i_filename)[source]

Reads the SKAT SNP set file.

Parameters: i_filename (str) – the name of the input file the SNP set for the SKAT analysis pandas.DataFrame

This file has to be supplied by the user. The recognized columns are: variant, snp_set and weight. The weight column is optional and can be used to specify a custom weighting scheme for SKAT. If nothing is specified, the default Beta weights are used.

The file has to be tab delimited.

## genipe.tools.genipe_tutorial¶

genipe.tools.genipe_tutorial.check_files(*filenames)[source]

Checks that all files exists.

Parameters: filenames (list) – the list of file to check True if all files exist, False otherwise bool
genipe.tools.genipe_tutorial.download_file(url, path)[source]

Parameters: url (str) – the url to download path (str) – the path where to save the file
genipe.tools.genipe_tutorial.generate_bash(path)[source]

Generates a bash script to launch the imputation pipeline.

Parameters: path (str) – the path to write the bash script
genipe.tools.genipe_tutorial.get_genotypes(path)[source]

Gets the genotypes files.

Parameters: path (str) – the path where to put the genotypes
genipe.tools.genipe_tutorial.get_hg19(path)[source]

Gets the hg19 reference file.

Parameters: path (str) – the path where to put the reference
genipe.tools.genipe_tutorial.get_impute2(os_name, arch, path)[source]

Gets impute2 depending of the system, and puts it in ‘path’.

Parameters: os_name (str) – the name of the OS arch (str) – the architecture of the system path (str) – the path where to put impute2

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.get_impute2_ref(path)[source]

Gets the impute2’s reference files.

Parameters: path (str) – the path where to put the reference files
genipe.tools.genipe_tutorial.get_os_info()[source]

Getting the OS information.

Returns: first element is the name of the os, and the second is the system’s architecture tuple

Note

The tutorial does not work on the Windows operating system. The script will quit unless the operating system is Linux or Darwin (MacOSX).

Gets Plink depending of the system, and puts it in ‘path’.

Parameters: os_name (str) – the name of the OS arch (str) – the architecture of the system path (str) – the path where to put Plink

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.get_shapeit(os_name, arch, path)[source]

Gets shapeit depending of the system, and puts it in ‘path’.

Parameters: os_name (str) – the name of the OS arch (str) – the architecture of the system path (str) – the path where to put shapeit

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.main(args=None)[source]

The main function.

Parameters: args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)
genipe.tools.genipe_tutorial.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters: parser (argparse.ArgumentParser) – the argument parser args (list) – the list of arguments (if not taken from sys.argv) the list of options and arguments argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.genipe_tutorial.untar_file(path, fn)[source]

Extracts a tar archive.

Parameters: path (str) – the path to where the file will be extracted fn (str) – the name of the tar archive