tools

To analyze the deep sequencing data in pandas.

source

set_wd

 set_wd ()

source

Dseq

 Dseq (pandas_obj)

Class for operating on the dataframe created from the .bam file.


source

mask_wt

 mask_wt (df3:pandas.core.frame.DataFrame, values:str,
          columns:list=['reference', 'read'])

Mask the values for WT i.e. reference==read.

Parameters: values (str): column with values. columns (list): the columns to check the WT in. defaults to [‘reference’,‘read’].

Returns: pd.DataFrame


source

to_df

 to_df (bam_path:str, start:int, end:int, contig:str, verbose:bool=False,
        **kws_fetch)

Make a dataframe from a bam file.

Parameters: bam_path (str): path to the bam file. start (int): start position. end (int): end position. contig (str): contig name. verbose (bool): verbose.

Keyword parameters: kws_fetch: parameters provided to aligned.fetch function of pysam.

Returns: pd.DataFrame


source

get_reads_per_contig

 get_reads_per_contig (aligned, plot:bool=False)

Cumulative read depth per contigs.

Parameters: aliigned: aligned object of pysam. plot (bool): make a plot.

Returns: pd.DataFrame


source

read_bam

 read_bam (bam_path:str, verbose:bool=False, force:bool=False)

Read the aligned file

Parameters: bam_path (str): path to the .bam file.

Returns: pysam’s aligned object.

Tests

Read the bam file as a pandas dataframe

df=to_df(
    bam_path='inputs/test.bam', # path to the bam file
    contig='14',              # i.e. reference name/ target
    start=74846395,           # start position
    end=74846545,             # end position
    verbose=True,             # verbose
)
df.head(1)
CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 4.77 µs
query id CIGAR read position reference position reference read
0 M04377:211:000000000-D637H:1:1101:7912:5949::u... 151M 0.0 74846395.0 C C
print(df.shape)
assert df.shape == (56424, 6)
(56424, 6)

Coverage per position (QC)

data=df.dseq.get_coverage()
data.head(1)
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.53 µs
reference position reference depth
0 74846395.0 C 377
print(data.shape)
assert data.shape == (151, 3)
(151, 3)

Number of reads per mutations (QC)

data=df.dseq.get_counts_per_mutation()
data.head(1)
CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 15.7 µs
4    1
dtype: int64
print(data.shape)
assert data.shape == (5,)
(5,)

Read depth per nucleotide mutations

data=df.dseq.get_counts_per_nt_mutation()
data.head(1)
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.29 µs
reference position reference read depth
0 74846395.0 C C NaN
print(data.shape)
assert data.shape == (212, 4)
(212, 4)

Read depth per codons

data=df.dseq.get_counts_per_codon_mutation()
data.head(1)
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs
reference position codon reference codon read depth
0 74846395.0 CCT CCC 2.0
print(data.shape)
assert data.shape == (330, 4)
(330, 4)

Read depth per amino acids

data=df.dseq.get_counts_per_aa_mutation()
data.head(1)
CPU times: user 7 µs, sys: 1 µs, total: 8 µs
Wall time: 13.6 µs
reference position codon reference codon read depth aa reference aa read
0 74846395.0 CCT CCC 2.0 P P
print(data.shape)
assert data.shape == (330, 6)
(330, 6)

Normalize read depth of a mutations by the cumulative read depth, to calculate mutation freqency

data=df.dseq.get_frequencies_per_nt_mutation()
data.head(1)
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs
reference position reference read depth mutation depth total frequency
0 74846395.0 C C NaN 377 NaN
print(data.shape)
assert data.shape == (212, 6)
(212, 6)

Extract indels

data=(df
    .query("`CIGAR`.str.contains('I') or `CIGAR`.str.contains('D')") ## keep indels
    )
data.head(1)
CPU times: user 7 µs, sys: 0 ns, total: 7 µs
Wall time: 14.8 µs
query id CIGAR read position reference position reference read
37 M04377:211:000000000-D637H:1:1101:10676:7283::... 22M1D129M 36.0 74846395.0 C C
print(data.shape)
assert data.shape == (2699, 6)
(2699, 6)