Concatenate datasets to a single array store¶
In the previous notebooks, we’ve seen how to incrementally create a collection of scRNA-seq datasets and train models on it.
Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata.
This is also what CELLxGENE does to create Census: a number of .h5ad
files are concatenated to give rise to a single tiledbsoma
array store (CELLxGENE: scRNA-seq).
import lamindb as ln
import pandas as pd
import scanpy as sc
import tiledbsoma.io
from functools import reduce
ln.context.uid = "oJN8WmVrxI8m0000"
ln.context.track()
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ notebook imports: lamindb==0.76.8 pandas==2.2.3 scanpy==1.10.3 tiledbsoma==1.14.1
→ created Transform(uid='oJN8WmVrxI8m0000') & created Run(started_at='2024-09-25 19:57:13 UTC')
Query the collection of h5ad
files that we’d like to concatenate into a single array.
collection = ln.Collection.get(name="My versioned scRNA-seq collection", version="2")
collection.describe()
Show code cell output
Collection(uid='tRuWMElJ7wedgUqM0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='nrEhls7ejeCFKPtKSr09Wg', visibility=1, updated_at='2024-09-25 19:56:38 UTC')
Provenance
.created_by = 'testuser1'
.transform = 'Standardize and append a batch of data'
.run = '2024-09-25 19:56:09 UTC'
Usage
.input_of_runs = '2024-09-25 19:56:49 UTC', '2024-09-25 19:57:06 UTC'
Prepare the AnnData objects¶
To concatenate the AnnData
objects into a single tiledbsoma.Experiment
, they need to have the same .var
and .obs
columns.
# load a number of AnnData objects that's small enough to fit into memory
adatas = [artifact.load() for artifact in collection.ordered_artifacts]
# compute the intersection of columns for these objects
var_columns = reduce(
pd.Index.intersection, [adata.var.columns for adata in adatas]
) # this only affects metadata columns of features (say, gene annotations)
var_raw_columns = reduce(
pd.Index.intersection, [adata.raw.var.columns for adata in adatas]
)
obs_columns = reduce(
pd.Index.intersection, [adata.obs.columns for adata in adatas]
) # this actually subsets features (dataset dimensions)
Prepare the AnnData
objects for concatenation. Prepare id fields, sanitize index
names, intersect columns, drop .obsp
, .uns
and columns that aren’t part of the intersection.
for i, adata in enumerate(adatas):
del adata.obsp # not supported by tiledbsoma
del adata.uns # not supported by tiledbsoma
adata.obs = adata.obs.filter(obs_columns) # filter columns to intersection
adata.obs["obs_id"] = (
adata.obs.index
) # prepare a column for tiledbsoma to use as an index
adata.obs["dataset"] = i
adata.obs.index.name = None
adata.var = adata.var.filter(var_columns) # filter columns to intersection
adata.var["var_id"] = adata.var.index
adata.var.index.name = None
drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
adata.raw.var["var_id"] = adata.raw.var.index
adata.raw.var.index.name = None
Create the array store¶
Save the AnnData
objects in one array store referenced by an Artifact
.
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
adatas,
description="tiledbsoma experiment",
measurement_name="RNA",
obs_id_name="obs_id",
var_id_name="var_id",
append_obsm_varm=True,
)
Show code cell output
<frozen abc>:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
<frozen abc>:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
Note
Provenance is tracked by writing the current run.uid
to tiledbsoma.Experiment.obs
as lamin_run_uid
.
If you know tiledbsoma
API, then note that save_tiledbsoma_experiment()
abstracts over both tiledbsoma.io.register_anndatas
and tiledbsoma.io.from_anndata
.
Query the array store¶
Here we query the obs
from the array store.
with soma_artifact.open() as soma_store:
obs = soma_store["obs"]
var = soma_store["ms"]["RNA"]["var"]
obs_columns_store = obs.schema.names
var_columns_store = var.schema.names
obs_store_df = obs.read().concat().to_pandas()
display(obs_store_df)
Show code cell output
soma_joinid | cell_type | obs_id | dataset | lamin_run_uid | |
---|---|---|---|---|---|
0 | 0 | dendritic cell | GCAGGGCTGGATTC-1 | 0 | oB4ca6lumZMEtU0hBUQI |
1 | 1 | B cell, CD19-positive | CTTTAGTGGTTACG-6 | 0 | oB4ca6lumZMEtU0hBUQI |
2 | 2 | dendritic cell | TGACTGGAACCATG-7 | 0 | oB4ca6lumZMEtU0hBUQI |
3 | 3 | B cell, CD19-positive | TCAATCACCCTTCG-8 | 0 | oB4ca6lumZMEtU0hBUQI |
4 | 4 | effector memory CD4-positive, alpha-beta T cel... | CGTTATACAGTACC-8 | 0 | oB4ca6lumZMEtU0hBUQI |
... | ... | ... | ... | ... | ... |
1713 | 1713 | naive thymus-derived CD4-positive, alpha-beta ... | Pan_T7991594_CTCACACTCCAGGGCT | 1 | oB4ca6lumZMEtU0hBUQI |
1714 | 1714 | naive thymus-derived CD4-positive, alpha-beta ... | Pan_T7980358_CGAGCACAGAAGATTC | 1 | oB4ca6lumZMEtU0hBUQI |
1715 | 1715 | naive thymus-derived CD4-positive, alpha-beta ... | CZINY-0064_AGACCATCACGCTGCA | 1 | oB4ca6lumZMEtU0hBUQI |
1716 | 1716 | CD8-positive, alpha-beta memory T cell | CZINY-0050_TCGATTTAGATGTTGA | 1 | oB4ca6lumZMEtU0hBUQI |
1717 | 1717 | naive thymus-derived CD4-positive, alpha-beta ... | CZINY-0064_AGTGTTGTCCGAGCTG | 1 | oB4ca6lumZMEtU0hBUQI |
1718 rows × 5 columns
Append to the array store¶
Prepare a new AnnData
object to be appended to the store.
ln.core.datasets.anndata_with_obs().write_h5ad("adata_to_append.h5ad")
!lamin save adata_to_append.h5ad --description "adata to append"
Show code cell output
→ connected lamindb: testuser1/test-scrna
! no run & transform got linked, call `ln.context.track()` & re-run
→ saved: Artifact(uid='Qhfq07ijDFXWNFnp0000', is_latest=True, description='adata to append', suffix='.h5ad', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, created_by_id=1, updated_at='2024-09-25 19:57:24 UTC')
→ storage path: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Qhfq07ijDFXWNFnp0000.h5ad
adata = ln.Artifact.filter(description="adata to append").one().load()
adata.obs_names_make_unique()
adata.var_names_make_unique()
adata.obs["obs_id"] = adata.obs.index
adata.var["var_id"] = adata.var.index
adata.obs["dataset"] = obs_store_df["dataset"].max()
obs_columns_same = [
obs_col for obs_col in adata.obs.columns if obs_col in obs_columns_store
]
adata.obs = adata.obs[obs_columns_same]
var_columns_same = [
var_col for var_col in adata.var.columns if var_col in var_columns_store
]
adata.var = adata.var[var_columns_same]
adata.write_h5ad("adata_to_append.h5ad")
Show code cell output
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/_core/anndata.py:1756: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
Append the AnnData
object from disk by revising soma_artifact
.
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
["adata_to_append.h5ad"],
revises=soma_artifact,
measurement_name="RNA",
obs_id_name="obs_id",
var_id_name="var_id",
)
Show code cell output
<frozen abc>:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
Update the array store¶
Add a new embedding to the existing array store.
# read the data matrix
with soma_artifact.open() as soma_store:
ms_rna = soma_store["ms"]["RNA"]
n_obs = len(soma_store["obs"])
n_var = len(ms_rna["var"])
X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()
# calculate PCA embedding from the queried `X`
pca_array = sc.pp.pca(X, n_comps=2)
Open the array store in write mode and add PCA.
with soma_artifact.open(mode="w") as soma_store:
tiledbsoma.io.add_matrix_to_collection(
exp=soma_store,
measurement_name="RNA",
collection_name="obsm",
matrix_name="pca",
matrix_data=pca_array,
)
Show code cell output
<frozen abc>:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.
For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.
For creation, use `anndata.experimental.sparse_dataset(X)` instead.
See array store mutations¶
During the append-to and update operations, the data in the array store was changed. LaminDB automatically tracks these revisions recording the number of objects, hashes, and provenance.
soma_artifact.versions.df()
Show code cell output
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
4 | CHHL3PfDui9ycvi30000 | None | False | tiledbsoma experiment | None | .tiledbsoma | None | 15056696 | P52V2JJF0FmEOZdtf6n_Qw | 149 | 1718.0 | md5-d | tiledbsoma | 1 | True | 1 | 6 | 6 | 1 | 2024-09-25 19:57:25.023843+00:00 |
6 | CHHL3PfDui9ycvi30001 | None | False | tiledbsoma experiment | None | .tiledbsoma | None | 15068506 | WWlRTRsdp9zeGupTqaO_0A | 173 | 1758.0 | md5-d | tiledbsoma | 1 | True | 1 | 6 | 6 | 1 | 2024-09-25 19:57:26.076378+00:00 |
7 | CHHL3PfDui9ycvi30002 | None | True | tiledbsoma experiment | None | .tiledbsoma | None | 15089310 | ZsaZKYbfU3gqn1FPe_YQZw | 182 | NaN | md5-d | None | 1 | True | 1 | 6 | 6 | 1 | 2024-09-25 19:57:26.077182+00:00 |
View lineage of the array store¶
Check the generating flow of the array store.
soma_artifact.view_lineage()
Note
For the underlying API, see the tiledbsoma documentation.