source
Dseq
Dseq (pandas_obj)
Class for operating on the dataframe created from the .bam file.
source
mask_wt
mask_wt (df3:pandas.core.frame.DataFrame, values:str,
columns:list=['reference', 'read'])
Mask the values for WT i.e. reference==read.
Parameters: values (str): column with values. columns (list): the columns to check the WT in. defaults to [‘reference’,‘read’].
Returns: pd.DataFrame
source
to_df
to_df (bam_path:str, start:int, end:int, contig:str, verbose:bool=False,
**kws_fetch)
Make a dataframe from a bam file.
Parameters: bam_path (str): path to the bam file. start (int): start position. end (int): end position. contig (str): contig name. verbose (bool): verbose.
Keyword parameters: kws_fetch: parameters provided to aligned.fetch
function of pysam.
Returns: pd.DataFrame
source
get_reads_per_contig
get_reads_per_contig (aligned, plot:bool=False)
Cumulative read depth per contigs.
Parameters: aliigned: aligned object of pysam. plot (bool): make a plot.
Returns: pd.DataFrame
source
read_bam
read_bam (bam_path:str, verbose:bool=False, force:bool=False)
Read the aligned file
Parameters: bam_path (str): path to the .bam file.
Returns: pysam’s aligned object.
Tests
Read the bam file as a pandas dataframe
df= to_df(
bam_path= 'inputs/test.bam' , # path to the bam file
contig= '14' , # i.e. reference name/ target
start= 74846395 , # start position
end= 74846545 , # end position
verbose= True , # verbose
)
df.head(1 )
CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 4.77 µs
0
M04377:211:000000000-D637H:1:1101:7912:5949::u...
151M
0.0
74846395.0
C
C
print (df.shape)
assert df.shape == (56424 , 6 )
Coverage per position (QC
)
data= df.dseq.get_coverage()
data.head(1 )
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.53 µs
print (data.shape)
assert data.shape == (151 , 3 )
Number of reads per mutations (QC
)
data= df.dseq.get_counts_per_mutation()
data.head(1 )
CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 15.7 µs
print (data.shape)
assert data.shape == (5 ,)
Read depth per nucleotide mutations
data= df.dseq.get_counts_per_nt_mutation()
data.head(1 )
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.29 µs
print (data.shape)
assert data.shape == (212 , 4 )
Read depth per codons
data= df.dseq.get_counts_per_codon_mutation()
data.head(1 )
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs
print (data.shape)
assert data.shape == (330 , 4 )
Read depth per amino acids
data= df.dseq.get_counts_per_aa_mutation()
data.head(1 )
CPU times: user 7 µs, sys: 1 µs, total: 8 µs
Wall time: 13.6 µs
0
74846395.0
CCT
CCC
2.0
P
P
print (data.shape)
assert data.shape == (330 , 6 )
Normalize read depth of a mutations by the cumulative read depth, to calculate mutation freqency
data= df.dseq.get_frequencies_per_nt_mutation()
data.head(1 )
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs
0
74846395.0
C
C
NaN
377
NaN
print (data.shape)
assert data.shape == (212 , 6 )