Skip to content

implemented parallel option for splitting bam files

Hendrik Schultheis requested to merge parallel-split-bam into bam_splitting

This adds a parallel version to the split_bam_clusters function. It will read and write multiple bams at once. For testing, I prepared 16 input bams with each 1 mil reads which are split into 42 output files. Sadly, this ran about 4 times slower (~2min vs. ~30 sec). The reason seems to be that pysam objects are not easily serializable (see here). So sending pysam objects between processes creates a lot of overhead. If we could avoid this somehow parallelization might be a viable option. But until then I wouldn't use it. Also, another thing is that logging becomes a lot harder with multiple processes. It should be possible to get progress bars but I didn't bother beyond adding basic print messages due to the runtime.

Merge request reports

Loading