genipe.formats package

Module contents

Submodules

genipe.formats.impute2 module

genipe.formats.impute2.additive_from_probs(a1, a2, probs)[source]

Compute additive format from probability matrix.

Parameters
  • a1 (str) – the a1 allele

  • a2 (str) – the a2 allele

  • probs (numpy.array) – the probability matrix

Returns

the additive format computed from the probabilities, the minor

and major allele.

Return type

tuple

The encoding is as follow: 0 when homozygous major allele, 1 when heterozygous and 2 when homozygous minor allele.

The minor and major alleles are inferred by looking at the MAF. By default, we think a2 is the minor allele, but flip if required.

genipe.formats.impute2.dosage_from_probs(homo_probs, hetero_probs, scale=2)[source]

Computes dosage from probability matrix (for the minor allele).

Parameters
  • homo_probs (numpy.array) – the probabilities for the homozygous genotype

  • hetero_probs (numpy.array) – the probabilities for the heterozygous genotype

  • scale (int) – the scale value

Returns

the dosage computed from the probabilities

Return type

numpy.array

genipe.formats.impute2.get_good_probs(prob_matrix, min_prob=0.9)[source]

Gathers good imputed genotypes (>= probability threshold).

Parameters
  • prob_matrix (numpy.array) – the probability matrix

  • min_prob (float) – the probability threshold

Returns

a mask array containing the positions where the

probabilities are equal or higher to the threshold

Return type

numpy.array

genipe.formats.impute2.hard_calls_from_probs(a1, a2, probs)[source]

Computes hard calls from probability matrix.

Parameters
  • a1 (str) – the first allele

  • a2 (str) – the second allele

  • probs (numpy.array) – the probability matrix

Returns

the hard calls computed from the probabilities

Return type

numpy.array

genipe.formats.impute2.maf_dosage_from_probs(prob_matrix, a1, a2, scale=2, gender=None, site_name=None)[source]

Computes MAF and dosage vector from probs matrix.

Parameters
  • prob_matrix (numpy.array) – the probability matrix

  • a1 (str) – the first allele

  • a2 (str) – the second allele

  • scale (int) – the scale value

  • gender (numpy.array) – the gender of the samples

  • site_name (str) – the name for this site

Returns

a tuple containing four values: the dosage vector, the minor

allele frequency, the minor and the major allele.

Return type

tuple

When ‘gender’ is not None, we assume that the MAF on chromosome X is required (hence, males count as 1, and females as 2 alleles). There is also an Exception raised if there are any heterozygous males.

genipe.formats.impute2.maf_from_probs(prob_matrix, a1, a2, gender=None, site_name=None)[source]

Computes MAF from a probability matrix (and gender if chromosome X).

Parameters
  • prob_matrix (numpy.array) – the probability matrix

  • a1 (str) – the first allele

  • a2 (str) – the second allele

  • gender (numpy.array) – the gender of the samples

  • site_name (str) – the name for this site

Returns

a tuple containing three values: the minor allele frequency, the

minor and the major allele.

Return type

tuple

When ‘gender’ is not None, we assume that the MAF on chromosome X is required (hence, males count as 1, and females as 2 alleles). There is also an Exception raised if there are any heterozygous males.

genipe.formats.impute2.matrix_from_line(impute2_line)[source]

Generates the probability matrix from an IMPUTE2 line.

Parameters

impute2_line (list) – a single line from IMPUTE2’s result (split by space)

Returns

a tuple containing the marker’s information (first five values

of the line) and the matrix probability (numpy array, float)

Return type

tuple

The shape of the matrix is n x 3 where n is the number of samples. The columns represent the probability for AA, AB and BB.

Note

The impute2_line variable is a list of str, corresponding to a line from the IMPUTE2’s result, split by space.

genipe.formats.index module

genipe.formats.index.get_index(fn, cols, names, sep)[source]

Restores the index for a given file.

Parameters
  • fn (str) – the name of the file

  • cols (list) – a list containing column to keep (as int)

  • names (list) – the name corresponding to the column to keep (as str)

  • sep (str) – the field separator

Returns

the index

Return type

pandas.DataFrame

If the index doesn’t exist for the file, it is first created.

genipe.formats.index.get_open_func(fn, return_fmt=False)[source]

Get the opening function.

Parameters
  • fn (str) – the name of the file

  • return_fmt (bool) – if the file format needs to be returned

Returns

either a tuple containing two elements: a boolean telling if the

format is bgzip, and the opening function.

Return type

tuple