tools

To analyze the deep sequencing data in pandas.

set_wd

 set_wd ()

source

Dseq

 Dseq (pandas_obj)

Class for operating on the dataframe created from the .bam file.

source

mask_wt

 mask_wt (df3:pandas.core.frame.DataFrame, values:str,
          columns:list=['reference', 'read'])

Mask the values for WT i.e. reference==read.

Parameters: values (str): column with values. columns (list): the columns to check the WT in. defaults to [‘reference’,‘read’].

Returns: pd.DataFrame

source

to_df

 to_df (bam_path:str, start:int, end:int, contig:str, verbose:bool=False,
        **kws_fetch)

Make a dataframe from a bam file.

Parameters: bam_path (str): path to the bam file. start (int): start position. end (int): end position. contig (str): contig name. verbose (bool): verbose.

Keyword parameters: kws_fetch: parameters provided to aligned.fetch function of pysam.

Returns: pd.DataFrame

source

get_reads_per_contig

 get_reads_per_contig (aligned, plot:bool=False)

Cumulative read depth per contigs.

Parameters: aliigned: aligned object of pysam. plot (bool): make a plot.

Returns: pd.DataFrame

source

read_bam

 read_bam (bam_path:str, verbose:bool=False, force:bool=False)

Read the aligned file

Parameters: bam_path (str): path to the .bam file.

Returns: pysam’s aligned object.

Tests

Read the bam file as a pandas dataframe

df=to_df(
    bam_path='inputs/test.bam', # path to the bam file
    contig='14',              # i.e. reference name/ target
    start=74846395,           # start position
    end=74846545,             # end position
    verbose=True,             # verbose
)
df.head(1)

CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 4.77 µs

	query id	CIGAR	read position	reference position	reference	read
0	M04377:211:000000000-D637H:1:1101:7912:5949::u...	151M	0.0	74846395.0	C	C

print(df.shape)
assert df.shape == (56424, 6)

(56424, 6)

Coverage per position (`QC`)

data=df.dseq.get_coverage()
data.head(1)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.53 µs

	reference position	reference	depth
0	74846395.0	C	377

print(data.shape)
assert data.shape == (151, 3)

(151, 3)

Number of reads per mutations (`QC`)

data=df.dseq.get_counts_per_mutation()
data.head(1)

CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 15.7 µs

4    1
dtype: int64

print(data.shape)
assert data.shape == (5,)

(5,)

Read depth per nucleotide mutations

data=df.dseq.get_counts_per_nt_mutation()
data.head(1)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.29 µs

	reference position	reference	read	depth
0	74846395.0	C	C	NaN

print(data.shape)
assert data.shape == (212, 4)

(212, 4)

Read depth per codons

data=df.dseq.get_counts_per_codon_mutation()
data.head(1)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs

	reference position	codon reference	codon read	depth
0	74846395.0	CCT	CCC	2.0

print(data.shape)
assert data.shape == (330, 4)

(330, 4)

Read depth per amino acids

data=df.dseq.get_counts_per_aa_mutation()
data.head(1)

CPU times: user 7 µs, sys: 1 µs, total: 8 µs
Wall time: 13.6 µs

	reference position	codon reference	codon read	depth	aa reference	aa read
0	74846395.0	CCT	CCC	2.0	P	P

print(data.shape)
assert data.shape == (330, 6)

(330, 6)

Normalize read depth of a mutations by the cumulative read depth, to calculate mutation freqency

data=df.dseq.get_frequencies_per_nt_mutation()
data.head(1)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs

	reference position	reference	read	depth mutation	depth total	frequency
0	74846395.0	C	C	NaN	377	NaN

print(data.shape)
assert data.shape == (212, 6)

(212, 6)

Extract indels

data=(df
    .query("`CIGAR`.str.contains('I') or `CIGAR`.str.contains('D')") ## keep indels
    )
data.head(1)

CPU times: user 7 µs, sys: 0 ns, total: 7 µs
Wall time: 14.8 µs

	query id	CIGAR	read position	reference position	reference	read
37	M04377:211:000000000-D637H:1:1101:10676:7283::...	22M1D129M	36.0	74846395.0	C	C

print(data.shape)
assert data.shape == (2699, 6)

(2699, 6)

set_wd

Dseq

mask_wt

to_df

get_reads_per_contig

read_bam

Tests

Read the bam file as a pandas dataframe

Coverage per position (QC)

Number of reads per mutations (QC)

Read depth per nucleotide mutations

Read depth per codons

Read depth per amino acids

Normalize read depth of a mutations by the cumulative read depth, to calculate mutation freqency

Extract indels

Coverage per position (`QC`)

Number of reads per mutations (`QC`)