genipe.tools package

Module contents

genipe.tools.impute2_extractor module

genipe.tools.impute2_extractor.check_args(args)[source]

Checks the arguments and options.

Parameters

args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

Note

Noting is checked (apart from the impute2 files) if indexation is asked (--index option).

genipe.tools.impute2_extractor.extract_companion_files(i_prefix, o_prefix, to_extract)[source]

Extract markers from companion files (if they exists).

Parameters
  • i_prefix (str) – the prefix of the input file

  • o_prefix (str) – the prefix of the output file

  • to_extract (set) – the set of markers to extract

genipe.tools.impute2_extractor.extract_markers(fn, to_extract, out_prefix, out_format, prob_t, is_long)[source]

Extracts according to names.

Parameters
  • fn (str) – the name of the input file

  • to_extract (set) – the list of markers to extract for each input file

  • out_prefix (str) – the output prefix

  • out_format (list) – the output format(s)

  • prob_t (float) – the probability threshold

  • is_long (bool) – True if format needs to be long

genipe.tools.impute2_extractor.gather_extraction(fn, maf, rate, info, extract_filename, genomic_range)[source]

Gather positions that are required.

Parameters
  • fn (str) – the impute2 filename

  • maf (float) – the minor allele frequency threshold (might be None)

  • rate (float) – the call rate threshold (might be None)

  • info (float) – the marker information value threshold (might be None)

  • extract_filename (str) – the name of the file containing marker names to extract (might be None)

  • genomic_range (str) – the genomic range for extraction

Returns

the set of markers to extract

Return type

set

If extraction by marker name is required, only those markers will be extracted. Otherwise, maf, rate, info or genomic_range can be specified (alone or together) to extract markers according to minor allele frequency, call rate and genomic location.

genipe.tools.impute2_extractor.get_file_prefix(fn)[source]

Gets the filename prefix.

Parameters

fn (str) – the name of the file from which the prefix is required

Returns

the prefix of the file

Return type

str

This function removes the extension from the file name, and return its prefix (e.g. test.impute2 returns test, and ../test.impute2.gz returns ../test).

genipe.tools.impute2_extractor.get_samples(fn)[source]

Reads the sample files, and extract the information.

Parameters

fn (str) – the name of the sample file

Returns

the sample information

Return type

pandas.DataFrame

genipe.tools.impute2_extractor.index_file(fn)[source]

Indexes the impute2 file.

Parameters

fn (str) – the name of the impute2 file

This function uses the genipe.formats.index.get_index() to create the index file if it’s missing.

Note

We won’t catch the genipe.error.GenipeError exception if it’s raised, since the message will be relevant to the user.

genipe.tools.impute2_extractor.main(args=None)[source]

The main function.

Parameters

args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)

genipe.tools.impute2_extractor.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters
Returns

the list of options and arguments

Return type

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.impute2_extractor.print_data(o_files, prob_t, fid, iid, is_long, *, line=None, row=None)[source]

Prints an impute2 line.

Parameters
  • o_files (dict) – the output files

  • prob_t (float) – the probability threshold

  • fid (list) – the list of family IDs

  • iid (list) – the list of sample IDs

  • is_long (bool) – True if the format is long (dosage, calls)

  • line (str) – the impute2 line

  • row (list) – the impute2 line, split by spaces

genipe.tools.impute2_merger module

genipe.tools.impute2_merger.check_args(args)[source]

Checks the arguments and options.

Parameters

args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

genipe.tools.impute2_merger.concatenate_files(i_filenames, out_prefix, real_chrom, options)[source]

Concatenates and extracts information from IMPUTE2 GEN file(s).

Parameters
  • i_filenames (list) – the list of input filenames (to concatenate)

  • out_prefix (str) – the output prefix for the output files

  • real_chrom (str) – the chromosome contained in all the input files

  • options (argparse.Namespace) – the options

This function will create the following seven files:

File name

Description

.impute2

Imputation results (merged from all the input files).

.alleles

Description of the reference and alternative allele at each sites.

.imputed_sites

List of imputed sites (excluding sites that were previously genotyped in the study cohort).

.impute2_info

SNP-wise information file with one line per SNP and a single header line at the beginning.

.completion_rates

Number of missing values and completion rate for all sites (using the probability threshold set by the user, where the default is higher and equal to 0.9).

.good_sites

List of sites which pass the completion rate threshold (set by the user, where the default is higher and equal to 0.98) using the probability threshold (set by the user, where the default is higher and equal to 0.9).

.map

A map file describing the genomic location of all sites.

.maf

File containing the minor allele frequency (along with minor allele identification) for all sites using the probabilitty threshold of 0.9. When no genotypes are available (because they are all below the threshold), the MAF is NA.

genipe.tools.impute2_merger.main(args=None)[source]

The main function.

Parameters

args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)

genipe.tools.impute2_merger.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters
Returns

the list of options and arguments

Return type

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.imputed_stats module

genipe.tools.imputed_stats.check_args(args)[source]

Checks the arguments and options.

Parameters

args (argparse.Namespace) – the options to verify

Note

If there is a problem, a genipe.error.GenipeError is raised.

genipe.tools.imputed_stats.compute_statistics(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, options)[source]

Parses IMPUTE2 file while computing statistics.

Parameters
  • impute2_filename (str) – the name of the input file

  • samples (pandas.DataFrame) – the list of samples

  • markers_to_extract (set) – the set of markers to extract

  • phenotypes (pandas.DataFrame) – the phenotypes

  • remove_gender (bool) – whether or not to remove the gender column

  • out_prefix (str) – the output prefix

  • options (argparse.Namespace) – the options

This function takes care of parallelism. It reads the Impute2 file and fills a queue that will trigger the analysis when full.

If the number of process to launch is 1, the rows are analyzed as they come.

genipe.tools.imputed_stats.fit_cox(data, time_to_event, event, formula, result_col, **kwargs)[source]

Fit a Cox’ proportional hazard to the data.

Parameters
  • data (pandas.DataFrame) – the data to analyse

  • time_to_event (str) – the time to event column for the survival analysis

  • event (str) – the event column for the survival analysis

  • formula (str) – the formula for the data preparation

  • result_col (str) – the column that will contain the results

Returns

the results from the survival analysis

Return type

numpy.array

Note

Using alpha of 0.95, and default parameters.

genipe.tools.imputed_stats.fit_linear(data, formula, result_col, **kwargs)[source]

Fit a linear regression to the data.

Parameters
  • data (pandas.DataFrame) – the data to analyse

  • formula (str) – the formula for the linear regression

  • result_col (str) – the column that will contain the results

Returns

the results from the linear regression

Return type

list

genipe.tools.imputed_stats.fit_logistic(data, formula, result_col, **kwargs)[source]

Fit a logistic regression to the data.

Parameters
  • data (pandas.DataFrame) – the data to analyse

  • formula (str) – the formula for the logistic regression

  • result_col (str) – the column that will contain the results

Returns

the results from the logistic regression

Return type

list

genipe.tools.imputed_stats.fit_mixedlm(data, formula, use_ml, groups, result_col, random_effects, mixedlm_p, interaction, **kwargs)[source]

Fit a linear mixed effects model to the data.

Parameters
  • data (pandas.DataFrame) – the data to analyse

  • formula (str) – the formula for the linear mixed effects model

  • use_ml (bool) – whether to use ML instead of REML

  • groups (str) – the column containing the groups

  • result_col (str) – the column that will contain the results

  • random_effects (pandas.Series) – the random effects

  • mixedlm_p (float) – the p-value threshold for which loci will be computed with the real MixedLM analysis

  • interaction (bool) – Whether there is an interaction or not

Returns

the results from the linear mixed effects model

Return type

list

genipe.tools.imputed_stats.get_formula(phenotype, covars, interaction, gender_c, categorical)[source]

Creates the linear/logistic regression formula (for statsmodel).

Parameters
  • phenotype (str) – the phenotype column

  • covars (list) – the list of co variable columns

  • interaction (str) – the interaction column

Returns

the formula for the statistical analysis

Return type

str

Note

The phenotype column needs to be specified. The list of co variables might be empty (if no co variables are necessary). The interaction column can be set to None if there is no interaction.

Note

The gender column should be categorical (hence, the formula requires the gender to be included into C(), e.g. C(Gender)).

genipe.tools.imputed_stats.is_file_like(fn)[source]

Checks if the path is like a file (it might be a named pipe).

Parameters

fn (str) – the path to check

Returns

True if path is like a file, False otherwise.

Return type

bool

genipe.tools.imputed_stats.main(args=None)[source]

The main function.

Parameters

args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)

genipe.tools.imputed_stats.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters
Returns

the list of options and arguments

Return type

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.imputed_stats.process_impute2_site(site_info)[source]

Process an IMPUTE2 site (a line in an IMPUTE2 file).

Parameters

site_info (list) – the impute2 line (split by space)

Returns

the results of the analysis

Return type

list

genipe.tools.imputed_stats.read_phenotype(i_filename, opts, check_duplicated=True)[source]

Reads the phenotype file.

Parameters
  • i_filename (str) – the name of the input file

  • opts (argparse.Namespace) – the options

  • check_duplicated (bool) – whether or not to check for duplicated samples

Returns

the phenotypes

Return type

pandas.DataFrame

This file is expected to be a tab separated file of phenotypes and covariates. The columns to use will be determined by the --sample-column and the --covar options.

For analysis including the X chromosome, the gender is automatically added as a covariate. The results are not shown to the user unless asked for.

genipe.tools.imputed_stats.read_samples(i_filename)[source]

Reads the sample file (produced by SHAPEIT).

Parameters

i_filename (str) – the name of the input file

Returns

the list of samples

Return type

pandas.DataFrame

This file contains the list of samples that are contained in the impute2 file (with same order). The expected format for this file is a tab separated file with a first row containing the following columns:

ID_1    ID_2    missing father  mother  sex     plink_pheno

The subsequent row will be discarded and should contain:

0       0       0 D     D       D       B

Notes

We are mostly interested in the sample IDs corresponding to the ID_2 column. Their uniqueness is verified by pandas.

genipe.tools.imputed_stats.read_sites_to_extract(i_filename)[source]

Reads the list of sites to extract.

Parameters

i_filename (str) – The input filename containing the IDs of the variants to consider for the analysis.

Returns

A set containing the variants.

Return type

set

The expected file format is simply a list of variants. Every row should correspond to a single variant identifier.

3:60069:t
rs123456:A
3:60322:A

Typically, this is used to analyze only variants that passed some QC threshold. The genipe pipeline generates this file at the ‘merge_impute2’ step.

genipe.tools.imputed_stats.samples_with_hetero_calls(data, hetero_c)[source]

Gets male and heterozygous calls.

Parameters
  • data (pandas.DataFrame) – the probability matrix

  • hetero_c (str) – the name of the heterozygous column

Returns

samples where call is heterozygous

Return type

pandas.Index

Note

If there are no data (i.e. no males), an empty list is returned.

genipe.tools.imputed_stats.skat_parse_impute2(impute2_filename, samples, markers_to_extract, phenotypes, remove_gender, out_prefix, args)[source]

Read the impute2 file and run the SKAT analysis.

Parameters
  • impute2_filename (str) – the name of the input file

  • samples (pandas.DataFrame) – the samples

  • markers_to_extract (set) – the set of markers to analyse

  • phenotypes (pandas.DataFrame) – the phenotypes

  • remove_gender (bool) – whether or not to remove the gender column

  • out_prefix (str) – the output prefix

  • args (argparse.Namespace) – the options

This function does most of the “dispatching” to run SKAT. It writes the input files to the disk, runs the generated R scripts to do the actual analysis and then writes the results to disk.

genipe.tools.imputed_stats.skat_read_snp_set(i_filename)[source]

Reads the SKAT SNP set file.

Parameters

i_filename (str) – the name of the input file

Returns

the SNP set for the SKAT analysis

Return type

pandas.DataFrame

This file has to be supplied by the user. The recognized columns are: variant, snp_set and weight. The weight column is optional and can be used to specify a custom weighting scheme for SKAT. If nothing is specified, the default Beta weights are used.

The file has to be tab delimited.

genipe.tools.genipe_tutorial

genipe.tools.genipe_tutorial.check_files(*filenames)[source]

Checks that all files exists.

Parameters

filenames (list) – the list of file to check

Returns

True if all files exist, False otherwise

Return type

bool

genipe.tools.genipe_tutorial.download_file(url, path)[source]

Downloads a file from a URL to a path.

Parameters
  • url (str) – the url to download

  • path (str) – the path where to save the file

genipe.tools.genipe_tutorial.generate_bash(path)[source]

Generates a bash script to launch the imputation pipeline.

Parameters

path (str) – the path to write the bash script

genipe.tools.genipe_tutorial.get_genotypes(path)[source]

Gets the genotypes files.

Parameters

path (str) – the path where to put the genotypes

genipe.tools.genipe_tutorial.get_hg19(path)[source]

Gets the hg19 reference file.

Parameters

path (str) – the path where to put the reference

genipe.tools.genipe_tutorial.get_impute2(os_name, arch, path)[source]

Gets impute2 depending of the system, and puts it in ‘path’.

Parameters
  • os_name (str) – the name of the OS

  • arch (str) – the architecture of the system

  • path (str) – the path where to put impute2

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.get_impute2_ref(path)[source]

Gets the impute2’s reference files.

Parameters

path (str) – the path where to put the reference files

genipe.tools.genipe_tutorial.get_os_info()[source]

Getting the OS information.

Returns

first element is the name of the os, and the second is the

system’s architecture

Return type

tuple

Note

The tutorial does not work on the Windows operating system. The script will quit unless the operating system is Linux or Darwin (MacOSX).

Gets Plink depending of the system, and puts it in ‘path’.

Parameters
  • os_name (str) – the name of the OS

  • arch (str) – the architecture of the system

  • path (str) – the path where to put Plink

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.get_shapeit(os_name, arch, path)[source]

Gets shapeit depending of the system, and puts it in ‘path’.

Parameters
  • os_name (str) – the name of the OS

  • arch (str) – the architecture of the system

  • path (str) – the path where to put shapeit

Note

If the binary is in the system path, it is copied to the destination path. Otherwise, we download it.

genipe.tools.genipe_tutorial.main(args=None)[source]

The main function.

Parameters

args (argparse.Namespace) – the arguments to be parsed (if main() is called by another modulel)

genipe.tools.genipe_tutorial.parse_args(parser, args=None)[source]

Parses the command line options and arguments.

Parameters
Returns

the list of options and arguments

Return type

argparse.Namespace

Note

The only check that is done here is by the parser itself. Values are verified later by the check_args() function.

genipe.tools.genipe_tutorial.untar_file(path, fn)[source]

Extracts a tar archive.

Parameters
  • path (str) – the path to where the file will be extracted

  • fn (str) – the name of the tar archive