genipe.tools package

Module contents

genipe.tools.impute2_extractor module

genipe.tools.impute2_extractor.check_args(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

Note

Noting is checked (apart from the impute2 files) if indexation is asked (--index option).

genipe.tools.impute2_extractor.extract_companion_files(i_prefix, o_prefix, to_extract)[source]

Extract markers from companion files (if they exists).

Parameters:
  • i_prefix (str) – the prefix of the input file
  • o_prefix (str) – the prefix of the output file
  • to_extract (set) – the set of markers to extract
genipe.tools.impute2_extractor.extract_markers(fn, to_extract, out_prefix, out_format, prob_t, is_long)[source]

Extracts according to names.

Parameters:
  • fn (str) – the name of the input file
  • to_extract (set) – the list of markers to extract for each input file
  • out_prefix (str) – the output prefix
  • out_format (list) – the output format(s)
  • prob_t (float) – the probability threshold
  • is_long (bool) – True if format needs to be long
genipe.tools.impute2_extractor.gather_extraction(fn, maf, rate, info, extract_filename, genomic_range)[source]

Gather positions that are required.

Parameters:
  • fn (str) – the impute2 filename
  • maf (float) – the minor allele frequency threshold (might be None)
  • rate (float) – the call rate threshold (might be None)
  • info (float) – the marker information value threshold (might be None)
  • extract_filename (str) – the name of the file containing marker names to extract (might be None)
  • genomic_range (str) – the genomic range for extraction
Returns:

the set of markers to extract

Return type:

set

If extraction by marker name is required, only those markers will be extracted. Otherwise, maf, rate, info or genomic_range can be specified (alone or together) to extract markers according to minor allele frequency, call rate and genomic location.

genipe.tools.impute2_extractor.get_file_prefix(fn)[source]

Gets the filename prefix.

Parameters:fn (str) – the name of the file from which the prefix is required
Returns:the prefix of the file
Return type:str

This function removes the extension from the file name, and return its prefix (e.g. test.impute2 returns test, and ../test.impute2.gz returns ../test).

genipe.tools.impute2_extractor.get_samples(fn)[source]

Reads the sample files, and extract the information.

Parameters:fn (str) – the name of the sample file
Returns:the sample information
Return type:pandas.DataFrame
genipe.tools.impute2_extractor.index_file(fn)[source]

Indexes the impute2 file.

Parameters:fn (str) – the name of the impute2 file

This function uses the genipe.formats.index.get_index() to create the index file if it’s missing.

Note

We won’t catch the genipe.error.GenipeError exception if it’s raised, since the message will be relevant to the user.

genipe.tools.impute2_extractor.main(args=None)[source]

The main function.

Parameters:args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)
genipe.tools.impute2_extractor.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters:
Returns:

the list of options and arguments

Return type:

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.impute2_extractor.print_data(o_files, prob_t, fid, iid, is_long, *, line=None, row=None)[source]

Prints an impute2 line.

Parameters:
  • o_files (dict) – the output files
  • prob_t (float) – the probability threshold
  • fid (list) – the list of family IDs
  • iid (list) – the list of sample IDs
  • is_long (bool) – True if the format is long (dosage, calls)
  • line (str) – the impute2 line
  • row (list) – the impute2 line, split by spaces

genipe.tools.impute2_merger module

genipe.tools.impute2_merger.check_args(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

genipe.tools.impute2_merger.concatenate_files(i_filenames, out_prefix, real_chrom, options)[source]

Concatenates and extracts information from IMPUTE2 GEN file(s).

Parameters:
  • i_filenames (list) – the list of input filenames (to concatenate)
  • out_prefix (str) – the output prefix for the output files
  • real_chrom (str) – the chromosome contained in all the input files
  • options (argparse.Namespace) – the options

This function will create the following seven files:

File name Description
.impute2 Imputation results (merged from all the input files).
.alleles Description of the reference and alternative allele at each sites.
.imputed_sites List of imputed sites (excluding sites that were previously genotyped in the study cohort).
.impute2_info SNP-wise information file with one line per SNP and a single header line at the beginning.
.completion_rates Number of missing values and completion rate for all sites (using the probability threshold set by the user, where the default is higher and equal to 0.9).
.good_sites List of sites which pass the completion rate threshold (set by the user, where the default is higher and equal to 0.98) using the probability threshold (set by the user, where the default is higher and equal to 0.9).
.map A map file describing the genomic location of all sites.
.maf File containing the minor allele frequency (along with minor allele identification) for all sites using the probabilitty threshold of 0.9. When no genotypes are available (because they are all below the threshold), the MAF is NA.
genipe.tools.impute2_merger.main(args=None)[source]

The main function.

Parameters:args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)
genipe.tools.impute2_merger.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters:
Returns:

the list of options and arguments

Return type:

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.imputed_stats module

genipe.tools.imputed_stats.check_args(args)[source]

Checks the arguments and options.

Parameters:args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

genipe.tools.imputed_stats.compute_statistics(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, options)[source]

Parses IMPUTE2 file while computing statistics.

Parameters:
  • impute2_filename (str) – the name of the input file
  • samples (pandas.DataFrame) – the list of samples
  • markers_to_extract (set) – the set of markers to extract
  • phenotypes (pandas.DataFrame) – the phenotypes
  • remove_gender (bool) – whether or not to remove the gender column
  • out_prefix (str) – the output prefix
  • options (argparse.Namespace) – the options

This function takes care of parallelism. It reads the Impute2 file and fills a queue that will trigger the analysis when full.

If the number of process to launch is 1, the rows are analyzed as they come.

genipe.tools.imputed_stats.fit_cox(data, time_to_event, event, formula, result_col, **kwargs)[source]

Fit a Cox’ proportional hazard to the data.

Parameters:
  • data (pandas.DataFrame) – the data to analyse
  • time_to_event (str) – the time to event column for the survival analysis
  • event (str) – the event column for the survival analysis
  • formula (str) – the formula for the data preparation
  • result_col (str) – the column that will contain the results
Returns:

the results from the survival analysis

Return type:

numpy.array

Note

The tie method used is Efron. Normalization is set to False.

genipe.tools.imputed_stats.fit_linear(data, formula, result_col, **kwargs)[source]

Fit a linear regression to the data.

Parameters:
  • data (pandas.DataFrame) – the data to analyse
  • formula (str) – the formula for the linear regression
  • result_col (str) – the column that will contain the results
Returns:

the results from the linear regression

Return type:

list

genipe.tools.imputed_stats.fit_logistic(data, formula, result_col, **kwargs)[source]

Fit a logistic regression to the data.

Parameters:
  • data (pandas.DataFrame) – the data to analyse
  • formula (str) – the formula for the logistic regression
  • result_col (str) – the column that will contain the results
Returns:

the results from the logistic regression

Return type:

list

genipe.tools.imputed_stats.fit_mixedlm(data, formula, use_ml, groups, result_col, random_effects, mixedlm_p, interaction, **kwargs)[source]

Fit a linear mixed effects model to the data.

Parameters:
  • data (pandas.DataFrame) – the data to analyse
  • formula (str) – the formula for the linear mixed effects model
  • use_ml (bool) – whether to use ML instead of REML
  • groups (str) – the column containing the groups
  • result_col (str) – the column that will contain the results
  • random_effects (pandas.Series) – the random effects
  • mixedlm_p (float) – the p-value threshold for which loci will be computed with the real MixedLM analysis
  • interaction (bool) – Whether there is an interaction or not
Returns:

the results from the linear mixed effects model

Return type:

list

genipe.tools.imputed_stats.get_formula(phenotype, covars, interaction, gender_c, categorical)[source]

Creates the linear/logistic regression formula (for statsmodel).

Parameters:
  • phenotype (str) – the phenotype column
  • covars (list) – the list of co variable columns
  • interaction (str) – the interaction column
Returns:

the formula for the statistical analysis

Return type:

str

Note

The phenotype column needs to be specified. The list of co variables might be empty (if no co variables are necessary). The interaction column can be set to None if there is no interaction.

Note

The gender column should be categorical (hence, the formula requires the gender to be included into C(), e.g. C(Gender)).

genipe.tools.imputed_stats.is_file_like(fn)[source]

Checks if the path is like a file (it might be a named pipe).

Parameters:fn (str) – the path to check
Returns:True if path is like a file, False otherwise.
Return type:bool
genipe.tools.imputed_stats.main(args=None)[source]

The main function.

Parameters:args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)
genipe.tools.imputed_stats.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters:
Returns:

the list of options and arguments

Return type:

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.imputed_stats.process_impute2_site(site_info)[source]

Process an IMPUTE2 site (a line in an IMPUTE2 file).

Parameters:site_info (list) – the impute2 line (split by space)
Returns:the results of the analysis
Return type:list
genipe.tools.imputed_stats.read_phenotype(i_filename, opts, check_duplicated=True)[source]

Reads the phenotype file.

Parameters:
  • i_filename (str) – the name of the input file
  • opts (argparse.Namespace) – the options
  • check_duplicated (bool) – whether or not to check for duplicated samples
Returns:

the phenotypes

Return type:

pandas.DataFrame

This file is expected to be a tab separated file of phenotypes and covariates. The columns to use will be determined by the --sample-column and the --covar options.

For analysis including the X chromosome, the gender is automatically added as a covariate. The results are not shown to the user unless asked for.

genipe.tools.imputed_stats.read_samples(i_filename)[source]

Reads the sample file (produced by SHAPEIT).

Parameters:i_filename (str) – the name of the input file
Returns:the list of samples
Return type:pandas.DataFrame

This file contains the list of samples that are contained in the impute2 file (with same order). The expected format for this file is a tab separated file with a first row containing the following columns:

ID_1    ID_2    missing father  mother  sex     plink_pheno

The subsequent row will be discarded and should contain:

0       0       0 D     D       D       B

Notes

We are mostly interested in the sample IDs corresponding to the ID_2 column. Their uniqueness is verified by pandas.

genipe.tools.imputed_stats.read_sites_to_extract(i_filename)[source]

Reads the list of sites to extract.

Parameters:i_filename (str) – The input filename containing the IDs of the variants to consider for the analysis.
Returns:A set containing the variants.
Return type:set

The expected file format is simply a list of variants. Every row should correspond to a single variant identifier.

3:60069:t
rs123456:A
3:60322:A

Typically, this is used to analyze only variants that passed some QC threshold. The genipe pipeline generates this file at the ‘merge_impute2’ step.

genipe.tools.imputed_stats.samples_with_hetero_calls(data, hetero_c)[source]

Gets male and heterozygous calls.

Parameters:
  • data (pandas.DataFrame) – the probability matrix
  • hetero_c (str) – the name of the heterozygous column
Returns:

samples where call is heterozygous

Return type:

pandas.Index

Note

If there are no data (i.e. no males), an empty list is returned.

genipe.tools.imputed_stats.skat_parse_impute2(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, args)[source]

Read the impute2 file and run the SKAT analysis.

Parameters:
  • impute2_filename (str) – the name of the input file
  • samples (pandas.DataFrame) – the samples
  • markers_to_extract (set) – the set of markers to analyse
  • phenotypes (pandas.DataFrame) – the phenotypes
  • remove_gender (bool) – whether or not to remove the gender column
  • out_prefix (str) – the output prefix
  • args (argparse.Namespace) – the options

This function does most of the “dispatching” to run SKAT. It writes the input files to the disk, runs the generated R scripts to do the actual analysis and then writes the results to disk.

genipe.tools.imputed_stats.skat_read_snp_set(i_filename)[source]

Reads the SKAT SNP set file.

Parameters:i_filename (str) – the name of the input file
Returns:the SNP set for the SKAT analysis
Return type:pandas.DataFrame

This file has to be supplied by the user. The recognized columns are: variant, snp_set and weight. The weight column is optional and can be used to specify a custom weighting scheme for SKAT. If nothing is specified, the default Beta weights are used.

The file has to be tab delimited.

genipe.tools.genipe_tutorial

genipe.tools.genipe_tutorial.check_files(*filenames)[source]

Checks that all files exists.

Parameters:filenames (list) – the list of file to check
Returns:True if all files exist, False otherwise
Return type:bool
genipe.tools.genipe_tutorial.download_file(url, path)[source]

Downloads a file from a URL to a path.

Parameters:
  • url (str) – the url to download
  • path (str) – the path where to save the file
genipe.tools.genipe_tutorial.generate_bash(path)[source]

Generates a bash script to launch the imputation pipeline.

Parameters:path (str) – the path to write the bash script
genipe.tools.genipe_tutorial.get_genotypes(path)[source]

Gets the genotypes files.

Parameters:path (str) – the path where to put the genotypes
genipe.tools.genipe_tutorial.get_hg19(path)[source]

Gets the hg19 reference file.

Parameters:path (str) – the path where to put the reference
genipe.tools.genipe_tutorial.get_impute2(os_name, arch, path)[source]

Gets impute2 depending of the system, and puts it in ‘path’.

Parameters:
  • os_name (str) – the name of the OS
  • arch (str) – the architecture of the system
  • path (str) – the path where to put impute2

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.get_impute2_ref(path)[source]

Gets the impute2’s reference files.

Parameters:path (str) – the path where to put the reference files
genipe.tools.genipe_tutorial.get_os_info()[source]

Getting the OS information.

Returns:
first element is the name of the os, and the second is the
system’s architecture
Return type:tuple

Note

The tutorial does not work on the Windows operating system. The script will quit unless the operating system is Linux or Darwin (MacOSX).

Gets Plink depending of the system, and puts it in ‘path’.

Parameters:
  • os_name (str) – the name of the OS
  • arch (str) – the architecture of the system
  • path (str) – the path where to put Plink

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.get_shapeit(os_name, arch, path)[source]

Gets shapeit depending of the system, and puts it in ‘path’.

Parameters:
  • os_name (str) – the name of the OS
  • arch (str) – the architecture of the system
  • path (str) – the path where to put shapeit

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.main(args=None)[source]

The main function.

Parameters:args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)
genipe.tools.genipe_tutorial.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters:
Returns:

the list of options and arguments

Return type:

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.genipe_tutorial.untar_file(path, fn)[source]

Extracts a tar archive.

Parameters:
  • path (str) – the path to where the file will be extracted
  • fn (str) – the name of the tar archive