scRNA-Seq binarization with scBoolSeq#
The tool scBoolSeq can be used to binarize scRNA-Seq datasets according to gene-wise pseudocount distributions learnt from a reference scRNA-Seq dataset.
from scboolseq import scBoolSeq
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
1. Learning of reference pseudocount distributions
scBoolSeq learns pseudocount distributions from highly-variables genes of a reference scRNA-Seq dataset.
!test -f GSE81682_Hematopoiesis.csv || curl -fOL \
https://github.com/bnediction/scBoolSeq-supplementary/raw/main/data_filtered_vargenes/GSE81682_Hematopoiesis.csv
ref_data = pd.read_csv("GSE81682_Hematopoiesis.csv", index_col=0)
# for the sake of the demo, we shrink the number of genes
ref_data = ref_data.iloc[:,42:242]
ref_data
2900041M22Rik | Klk8 | Gm37637 | Gp9 | Idh3a | Akr1c13 | 2810408A11Rik | Npr2 | Ephx1 | Pik3ip1 | ... | Ctsl | Iigp1 | P2ry14 | Cd82 | Slc18a2 | Cd302 | Parp12 | Isyna1 | S100a8 | B130034C11Rik | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HSPC_025 | 0.000000 | 0.000000 | 0.000000 | 1.832751 | 8.774285 | 1.189716 | 1.189716 | 1.832751 | 0.0 | 3.118770 | ... | 1.832751 | 1.189716 | 7.790156 | 7.421084 | 0.000000 | 0.000000 | 6.779938 | 2.275971 | 6.079295 | 0.000000 |
HSPC_031 | 0.686872 | 0.686872 | 0.000000 | 0.000000 | 2.827390 | 8.409573 | 0.686872 | 1.782055 | 0.0 | 8.765434 | ... | 0.000000 | 0.000000 | 8.900430 | 2.397399 | 1.500480 | 0.000000 | 0.686872 | 3.582099 | 0.000000 | 0.000000 |
HSPC_037 | 0.000000 | 7.275944 | 0.000000 | 0.000000 | 8.862937 | 0.000000 | 0.000000 | 1.218731 | 0.0 | 8.954567 | ... | 7.160213 | 1.869807 | 10.783395 | 8.057904 | 0.000000 | 0.000000 | 6.235258 | 3.164226 | 0.000000 | 0.000000 |
LT-HSC_001 | 0.000000 | 7.520353 | 0.000000 | 0.000000 | 2.364517 | 0.000000 | 0.000000 | 3.217169 | 0.0 | 7.232592 | ... | 8.058464 | 4.909226 | 3.749470 | 3.749470 | 0.000000 | 2.364517 | 8.700371 | 0.000000 | 0.000000 | 2.364517 |
HSPC_001 | 0.000000 | 4.414804 | 0.377367 | 7.107768 | 8.010188 | 5.867946 | 0.000000 | 9.303379 | 0.0 | 8.824449 | ... | 8.121239 | 0.676211 | 8.514683 | 1.762030 | 0.377367 | 0.000000 | 0.000000 | 1.628905 | 0.377367 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Prog_834 | 5.354924 | 5.205256 | 5.661187 | 6.153235 | 7.324615 | 5.824095 | 0.000000 | 0.488810 | 0.0 | 5.464396 | ... | 7.791800 | 6.454615 | 6.605636 | 6.581538 | 7.992470 | 2.642176 | 5.020429 | 6.042844 | 0.000000 | 4.965622 |
Prog_840 | 0.000000 | 5.009981 | 5.770268 | 0.470364 | 7.716333 | 5.780421 | 0.000000 | 0.470364 | 0.0 | 5.790503 | ... | 6.429388 | 6.605327 | 6.486320 | 6.152863 | 6.390145 | 1.345870 | 0.470364 | 7.106828 | 0.000000 | 0.000000 |
Prog_846 | 1.024409 | 1.024409 | 0.601281 | 0.601281 | 7.614214 | 5.350986 | 0.000000 | 0.000000 | 0.0 | 4.133070 | ... | 6.809524 | 0.601281 | 4.215689 | 3.042448 | 2.499161 | 0.000000 | 0.601281 | 7.156591 | 4.293831 | 0.000000 |
Prog_852 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.665835 | 6.581736 | 1.624428 | 0.000000 | 0.0 | 2.857900 | ... | 2.369158 | 0.000000 | 4.448371 | 5.227124 | 1.029699 | 0.000000 | 2.044323 | 7.334552 | 4.963769 | 0.000000 |
Prog_810 | 0.329207 | 7.215698 | 0.597074 | 0.000000 | 7.464188 | 6.855898 | 0.000000 | 0.000000 | 0.0 | 7.404325 | ... | 1.190055 | 0.000000 | 3.556499 | 4.748924 | 1.190055 | 0.000000 | 0.329207 | 1.725482 | 2.838296 | 4.374125 |
1656 rows × 200 columns
%time scbool = scBoolSeq().fit(ref_data)
Computing bimodality index for 89/200 genes
Computing bimodality index for 139/200 genes
CPU times: user 5.99 s, sys: 582 ms, total: 6.57 s
Wall time: 1.89 s
2. (optional) Access to the learnt pseudocount distributions
scbool.criteria_[['Category', *scbool.criteria_]]
Category | Mean | MeanNZ | Median | MedianNZ | GeometricMean | HarmonicMean | Variance | VarianceNZ | DropOutRate | Amplitude | Dip | Kurtosis | Skewness | DenPeak | BI | Category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2900041M22Rik | ZeroInf | 0.336663 | 2.844458 | 0.000000 | 1.434885 | 1.799128 | 1.053994 | 1.538907 | 5.868876 | 0.881643 | 9.051052 | 0.586866 | 17.547098 | 4.240360 | 0.004969 | 0.000000 | ZeroInf |
Klk8 | Bimodal | 3.284505 | 4.368787 | 1.742776 | 5.107305 | 2.969196 | 1.640595 | 10.048549 | 8.628783 | 0.248188 | 9.538356 | 0.000000 | -1.530713 | 0.390004 | 0.373447 | 2.919056 | Bimodal |
Gm37637 | ZeroInf | 0.260708 | 2.585227 | 0.000000 | 1.232339 | 1.499722 | 0.910336 | 1.248158 | 6.367534 | 0.899155 | 8.323065 | 0.955873 | 26.252765 | 5.111231 | 0.004177 | 0.000000 | ZeroInf |
Gp9 | Bimodal | 1.674165 | 3.376878 | 0.000000 | 1.967766 | 2.150308 | 1.288318 | 6.875951 | 8.119299 | 0.504227 | 10.033402 | 0.000000 | 0.878241 | 1.530903 | 0.030183 | 2.745009 | Bimodal |
Idh3a | Bimodal | 6.243458 | 6.370404 | 7.497524 | 7.541152 | 5.551172 | 4.379675 | 7.536478 | 6.881015 | 0.019928 | 9.882836 | 0.000000 | -0.897883 | -0.709034 | 8.245655 | 2.228428 | Bimodal |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Cd302 | ZeroInf | 0.527272 | 2.359899 | 0.000000 | 1.281987 | 1.474975 | 0.904786 | 2.130893 | 5.212369 | 0.776570 | 8.945604 | 0.199354 | 11.506878 | 3.443353 | 0.015439 | 0.000000 | ZeroInf |
Parp12 | Bimodal | 1.831223 | 3.434322 | 0.405306 | 1.998044 | 2.149494 | 1.244920 | 7.447516 | 8.461700 | 0.466787 | 10.432446 | 0.000900 | 0.585548 | 1.424771 | 0.056362 | 2.519183 | Bimodal |
Isyna1 | Bimodal | 5.861969 | 6.014511 | 7.093528 | 7.152238 | 5.082111 | 3.756957 | 8.006213 | 7.297088 | 0.025362 | 10.269083 | 0.000000 | -1.108027 | -0.601258 | 7.928887 | 2.419874 | Bimodal |
S100a8 | ZeroInf | 0.710715 | 3.155347 | 0.000000 | 1.983719 | 2.039669 | 1.233042 | 3.424959 | 7.492051 | 0.774758 | 16.763306 | 0.735221 | 11.336264 | 3.220410 | 0.021852 | 0.000000 | ZeroInf |
B130034C11Rik | ZeroInf | 0.544910 | 2.570856 | 0.000000 | 1.591329 | 1.661072 | 1.029217 | 2.138895 | 4.882779 | 0.788043 | 8.442426 | 0.696348 | 9.344361 | 3.137202 | 0.011575 | 0.000000 | ZeroInf |
200 rows × 17 columns
3. Binarize data
Here, we binarize the reference dataset:
%time bindata = scbool.binarize(ref_data)
bindata
CPU times: user 94.6 ms, sys: 1.05 ms, total: 95.6 ms
Wall time: 94.9 ms
2900041M22Rik | Klk8 | Gm37637 | Gp9 | Idh3a | Akr1c13 | 2810408A11Rik | Npr2 | Ephx1 | Pik3ip1 | ... | Ctsl | Iigp1 | P2ry14 | Cd82 | Slc18a2 | Cd302 | Parp12 | Isyna1 | S100a8 | B130034C11Rik | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HSPC_025 | NaN | 0.0 | NaN | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | ... | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | NaN | 1.0 | 0.0 | 1.0 | NaN |
HSPC_031 | 1.0 | 0.0 | NaN | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | ... | 0.0 | NaN | 1.0 | 0.0 | 0.0 | NaN | 1.0 | 0.0 | NaN | NaN |
HSPC_037 | NaN | 1.0 | NaN | 0.0 | 1.0 | 0.0 | NaN | 1.0 | NaN | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | NaN | 1.0 | 0.0 | NaN | NaN |
LT-HSC_001 | NaN | 1.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | 1.0 | NaN | 1.0 | ... | 1.0 | 1.0 | NaN | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | NaN | 1.0 |
HSPC_001 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | 1.0 | ... | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 1.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Prog_834 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 |
Prog_840 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | NaN |
Prog_846 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | ... | 1.0 | 1.0 | NaN | 0.0 | 0.0 | NaN | 1.0 | 1.0 | 1.0 | NaN |
Prog_852 | NaN | 0.0 | NaN | 0.0 | 1.0 | 1.0 | 1.0 | NaN | NaN | 0.0 | ... | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 1.0 | 1.0 | 1.0 | NaN |
Prog_810 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | NaN | NaN | NaN | 1.0 | ... | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 1.0 | 0.0 | 1.0 | 1.0 |
1656 rows × 200 columns