ATAC/03 normalization and batch correction notebook restructure
The atac 03 notebook has some limitations we should address. The general idea is first to normalize the data and second to remove potential batch effects. Here is the workflow:
As of now each normalization is run and the user has to select "the best". But we neither have a metric nor any other way of deciding what normalization should be chosen. Due to running multiple normalizations, it is also not possible to manually select a subset of LSI/PCA dimensions. To remedy this, I propose a workflow like this:
The issues I described above can be addressed by doing only one of the normalizations (the user can still choose which one). Since this is ATAC we should use TF-IDF as the default.
Additional notes:
- the neighbor graph and UMAP should be added to the batch correction part if necessary.
- should we enable log1p for tf-idf?
- the normalization should be compared to the raw data
- we should think about adding a probplot. This compares the data distribution to a normal distribution. The closer the blue points are to the red line the closer it is to a normal dist.
import scipy.stats as stats
fig, ax = plt.subplots(ncols=3, figsize=(15, 4.5))
for i, (a, t) in enumerate([
(adata, "raw"),
(normalizations["tfidf"], "tfidf"),
(normalizations["total"], "total")
]):
arr = a.X.toarray().flatten()
arr = arr[arr > 0]
stats.probplot(arr, plot=ax[i])
ax[i].set_title(t)