The scvi-ntegration with all pediatric samples worked really well
/lustre/scratch127/cellgen/cellgeni/aljes/integration/data/integrated_umap_annotated.h5ad
We will repeat the same strategy adding three new samples (from adult ovaries, so we can compare pediatric vs adult). Samples are:
M23s20 = /nfs/t292_imaging/0XeniumExports/JA_POV/20240801_130116SGP180_Hsa_RPT_Run2/output-XETG003350027061M23-OVR-0-FO-1-S20-ii20240801_130136/
M23s25 = /nfs/t292_imaging/0XeniumExports/JA_POV/20240815_114153SGP180_RPT_run2r/output-XETG001550027062M23-OVR-0-FO-S25-iii20240815_114214
M23s5 =
/nfs/t292_imaging/0XeniumExports/JA_POV/20240801_130116SGP180_Hsa_RPT_Run2/output-XETG003350027061M23-OVR-0-FO-1-S5-iii20240801_130136/
NOTE: the dataset is ~ 6 million cells.
Copy samples to lustre
mkdir -p data/adult
cp -r /nfs/t292_imaging/0XeniumExports/JA_POV/20240801__130116__SGP180_Hsa_RPT_Run2/output-XETG00335__0027061__M23-OVR-0-FO-1-S20-ii__20240801__130136/ data/adult/
cp -r /nfs/t292_imaging/0XeniumExports/JA_POV/20240815__114153__SGP180_RPT_run2r/output-XETG00155__0027062__M23-OVR-0-FO-S25-iii__20240815__114214/ data/adult/
cp -r /nfs/t292_imaging/0XeniumExports/JA_POV/20240801__130116__SGP180_Hsa_RPT_Run2/output-XETG00335__0027061__M23-OVR-0-FO-1-S5-iii__20240801__130136/ data/adult/
cp /nfs/team292/ct27/ovarian/all_donors.h5ad data/all_donors.h5ad
Copy annotation
cp /nfs/team292/lg18/paediatric_gonads/annotation/xenium/all_donor_scVI_integration_annotation.csv data/annotation.csv
Run interactive job
bsub -Is -G cellgeni -q "cpu-interactive" -n 4 -M "32GB" -R "span[hosts=1] select[mem>32GB] rusage[mem=32GB]" /bin/bash
>>>
Job <377679> is submitted to queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on node-14-13>>
singularity exec \
--bind /lustre,/nfs \
/nfs/cellgeni/singularity/images/toh5ad.sif \
jupyter notebook \
--no-browser --port=7777 --ip=0.0.0.0 \
--IdentityProvider.hashed_password='' \
--IdentityProvider.token='lolkek'
Open ssh connection in separate terminal
ssh -L 7777:localhost:7777 node-14-13
Concat and process all data: notebooks/process_xenium.ipynb
Submit integration
scripts/submit_grid_integration.sh "$PWD/data/adult_pediatric_processed.h5ad" "$PWD/results"
Calculate a number of successful jobs
success=$(cat *Output*log| grep "Successfully completed." | wc -l | awk '{print $1}')
total=$(ls -1 *Output*log | wc -l | awk '{print $1}')
echo "Successfully completed ${success}/${total}"
>>>
Create umap list
scripts/create_viz_list.sh "$PWD/data/adult_pediatric_processed.h5ad" "$PWD/results" umap_list_pediatric_adult.tsv
Run umap creation
integration_list="$PWD/umap_list_pediatric_adult.tsv"
celltype_col="lineage"
scripts/submit_umap_creation.sh $integration_list $celltype_col
Get a list of all .h5ad
files for clustering
ls $PWD/results/removeM23S25*/*/*.h5ad -1 > clustering.list
Run clustering
bsub -J "clustering-tic-3970[1-8]" < scripts/clustering.bsub
scVI did not work well, so we will try ResolVI
with the following set of hyperparameters:
n_hidden = [64, 128, 256]
n_latent = [15, 25, 35, 40]
n_layers = [2, 3]
dropout = [0.1]
lr = [0.0001]
batch_size = [512]
dispersion = ['gene', 'gene-batch']
likelihood = ['poisson', 'nb']
n_epochs = [100]
Run the pipeline
bsub < scripts/run_integration.bsub
Results can be found in results/resolvi
folder
We will run ResolVI
for separate populations. I need to create separate objects for that first:
cp /warehouse/team292_wh01/reproductive_atlas/ovary/annotations/xenium_directannotation_scvi-v1.csv data/xenium_directannotation_scvi-v1.csv
notebooks/split_populations.ipynb
Then run ResolVI
for each population:
bsub -env "all, ADATA=data/adata_population1.h5ad, SAMPLE_ID=population1, OUTPUT=results/resolvi_population1" < scripts/run_integration.bsub
bsub -env "all, ADATA=data/adata_population2.h5ad, SAMPLE_ID=population2, OUTPUT=results/resolvi_population2" < scripts/run_integration.bsub
bsub -env "all, ADATA=data/adata_population3.h5ad, SAMPLE_ID=population3, OUTPUT=results/resolvi_population3" < scripts/run_integration.bsub
bsub -env "all, ADATA=data/adata_population4.h5ad, SAMPLE_ID=population4, OUTPUT=results/resolvi_population4" < scripts/run_integration.bsub
The following set of hyperparameters was used:
batch_key = ['donor_batch']
n_hidden = [128]
n_latent = [40]
n_layers = [2]
dropout = [0.1]
lr = [0.0001]
batch_size = [512]
dispersion = ['gene']
likelihood = ['nb']
n_epochs = [100]