scRNA-Seq binarization with scBoolSeq

scRNA-Seq binarization with scBoolSeq#

The tool scBoolSeq can be used to binarize scRNA-Seq datasets according to gene-wise pseudocount distributions learnt from a reference scRNA-Seq dataset.

from scboolseq import scBoolSeq
import pandas as pd

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

1. Learning of reference pseudocount distributions

scBoolSeq learns pseudocount distributions from highly-variables genes of a reference scRNA-Seq dataset.

!test -f GSE81682_Hematopoiesis.csv || curl -fOL \
    https://github.com/bnediction/scBoolSeq-supplementary/raw/main/data_filtered_vargenes/GSE81682_Hematopoiesis.csv
ref_data = pd.read_csv("GSE81682_Hematopoiesis.csv", index_col=0)
# for the sake of the demo, we shrink the number of genes
ref_data = ref_data.iloc[:,42:242]
ref_data
2900041M22Rik Klk8 Gm37637 Gp9 Idh3a Akr1c13 2810408A11Rik Npr2 Ephx1 Pik3ip1 ... Ctsl Iigp1 P2ry14 Cd82 Slc18a2 Cd302 Parp12 Isyna1 S100a8 B130034C11Rik
HSPC_025 0.000000 0.000000 0.000000 1.832751 8.774285 1.189716 1.189716 1.832751 0.0 3.118770 ... 1.832751 1.189716 7.790156 7.421084 0.000000 0.000000 6.779938 2.275971 6.079295 0.000000
HSPC_031 0.686872 0.686872 0.000000 0.000000 2.827390 8.409573 0.686872 1.782055 0.0 8.765434 ... 0.000000 0.000000 8.900430 2.397399 1.500480 0.000000 0.686872 3.582099 0.000000 0.000000
HSPC_037 0.000000 7.275944 0.000000 0.000000 8.862937 0.000000 0.000000 1.218731 0.0 8.954567 ... 7.160213 1.869807 10.783395 8.057904 0.000000 0.000000 6.235258 3.164226 0.000000 0.000000
LT-HSC_001 0.000000 7.520353 0.000000 0.000000 2.364517 0.000000 0.000000 3.217169 0.0 7.232592 ... 8.058464 4.909226 3.749470 3.749470 0.000000 2.364517 8.700371 0.000000 0.000000 2.364517
HSPC_001 0.000000 4.414804 0.377367 7.107768 8.010188 5.867946 0.000000 9.303379 0.0 8.824449 ... 8.121239 0.676211 8.514683 1.762030 0.377367 0.000000 0.000000 1.628905 0.377367 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Prog_834 5.354924 5.205256 5.661187 6.153235 7.324615 5.824095 0.000000 0.488810 0.0 5.464396 ... 7.791800 6.454615 6.605636 6.581538 7.992470 2.642176 5.020429 6.042844 0.000000 4.965622
Prog_840 0.000000 5.009981 5.770268 0.470364 7.716333 5.780421 0.000000 0.470364 0.0 5.790503 ... 6.429388 6.605327 6.486320 6.152863 6.390145 1.345870 0.470364 7.106828 0.000000 0.000000
Prog_846 1.024409 1.024409 0.601281 0.601281 7.614214 5.350986 0.000000 0.000000 0.0 4.133070 ... 6.809524 0.601281 4.215689 3.042448 2.499161 0.000000 0.601281 7.156591 4.293831 0.000000
Prog_852 0.000000 0.000000 0.000000 0.000000 7.665835 6.581736 1.624428 0.000000 0.0 2.857900 ... 2.369158 0.000000 4.448371 5.227124 1.029699 0.000000 2.044323 7.334552 4.963769 0.000000
Prog_810 0.329207 7.215698 0.597074 0.000000 7.464188 6.855898 0.000000 0.000000 0.0 7.404325 ... 1.190055 0.000000 3.556499 4.748924 1.190055 0.000000 0.329207 1.725482 2.838296 4.374125

1656 rows × 200 columns

%time scbool = scBoolSeq().fit(ref_data)
Computing bimodality index for 89/200 genes
Computing bimodality index for 139/200 genes
CPU times: user 5.99 s, sys: 582 ms, total: 6.57 s
Wall time: 1.89 s

2. (optional) Access to the learnt pseudocount distributions

scbool.criteria_[['Category', *scbool.criteria_]]
Category Mean MeanNZ Median MedianNZ GeometricMean HarmonicMean Variance VarianceNZ DropOutRate Amplitude Dip Kurtosis Skewness DenPeak BI Category
2900041M22Rik ZeroInf 0.336663 2.844458 0.000000 1.434885 1.799128 1.053994 1.538907 5.868876 0.881643 9.051052 0.586866 17.547098 4.240360 0.004969 0.000000 ZeroInf
Klk8 Bimodal 3.284505 4.368787 1.742776 5.107305 2.969196 1.640595 10.048549 8.628783 0.248188 9.538356 0.000000 -1.530713 0.390004 0.373447 2.919056 Bimodal
Gm37637 ZeroInf 0.260708 2.585227 0.000000 1.232339 1.499722 0.910336 1.248158 6.367534 0.899155 8.323065 0.955873 26.252765 5.111231 0.004177 0.000000 ZeroInf
Gp9 Bimodal 1.674165 3.376878 0.000000 1.967766 2.150308 1.288318 6.875951 8.119299 0.504227 10.033402 0.000000 0.878241 1.530903 0.030183 2.745009 Bimodal
Idh3a Bimodal 6.243458 6.370404 7.497524 7.541152 5.551172 4.379675 7.536478 6.881015 0.019928 9.882836 0.000000 -0.897883 -0.709034 8.245655 2.228428 Bimodal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Cd302 ZeroInf 0.527272 2.359899 0.000000 1.281987 1.474975 0.904786 2.130893 5.212369 0.776570 8.945604 0.199354 11.506878 3.443353 0.015439 0.000000 ZeroInf
Parp12 Bimodal 1.831223 3.434322 0.405306 1.998044 2.149494 1.244920 7.447516 8.461700 0.466787 10.432446 0.000900 0.585548 1.424771 0.056362 2.519183 Bimodal
Isyna1 Bimodal 5.861969 6.014511 7.093528 7.152238 5.082111 3.756957 8.006213 7.297088 0.025362 10.269083 0.000000 -1.108027 -0.601258 7.928887 2.419874 Bimodal
S100a8 ZeroInf 0.710715 3.155347 0.000000 1.983719 2.039669 1.233042 3.424959 7.492051 0.774758 16.763306 0.735221 11.336264 3.220410 0.021852 0.000000 ZeroInf
B130034C11Rik ZeroInf 0.544910 2.570856 0.000000 1.591329 1.661072 1.029217 2.138895 4.882779 0.788043 8.442426 0.696348 9.344361 3.137202 0.011575 0.000000 ZeroInf

200 rows × 17 columns

3. Binarize data

Here, we binarize the reference dataset:

%time bindata = scbool.binarize(ref_data)
bindata
CPU times: user 94.6 ms, sys: 1.05 ms, total: 95.6 ms
Wall time: 94.9 ms
2900041M22Rik Klk8 Gm37637 Gp9 Idh3a Akr1c13 2810408A11Rik Npr2 Ephx1 Pik3ip1 ... Ctsl Iigp1 P2ry14 Cd82 Slc18a2 Cd302 Parp12 Isyna1 S100a8 B130034C11Rik
HSPC_025 NaN 0.0 NaN 1.0 1.0 0.0 1.0 1.0 NaN 0.0 ... 0.0 1.0 1.0 1.0 0.0 NaN 1.0 0.0 1.0 NaN
HSPC_031 1.0 0.0 NaN 0.0 0.0 1.0 1.0 1.0 NaN 1.0 ... 0.0 NaN 1.0 0.0 0.0 NaN 1.0 0.0 NaN NaN
HSPC_037 NaN 1.0 NaN 0.0 1.0 0.0 NaN 1.0 NaN 1.0 ... 1.0 1.0 1.0 1.0 0.0 NaN 1.0 0.0 NaN NaN
LT-HSC_001 NaN 1.0 NaN 0.0 0.0 0.0 NaN 1.0 NaN 1.0 ... 1.0 1.0 NaN 0.0 0.0 1.0 1.0 0.0 NaN 1.0
HSPC_001 NaN 1.0 1.0 1.0 1.0 1.0 NaN 1.0 NaN 1.0 ... 1.0 1.0 1.0 0.0 0.0 NaN 0.0 0.0 1.0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Prog_834 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0 NaN NaN ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN 1.0
Prog_840 NaN 1.0 1.0 1.0 1.0 1.0 NaN 1.0 NaN 1.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN NaN
Prog_846 1.0 0.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN ... 1.0 1.0 NaN 0.0 0.0 NaN 1.0 1.0 1.0 NaN
Prog_852 NaN 0.0 NaN 0.0 1.0 1.0 1.0 NaN NaN 0.0 ... 0.0 NaN NaN NaN 0.0 NaN 1.0 1.0 1.0 NaN
Prog_810 1.0 1.0 1.0 0.0 1.0 1.0 NaN NaN NaN 1.0 ... 0.0 NaN NaN NaN 0.0 NaN 1.0 0.0 1.0 1.0

1656 rows × 200 columns