scrna6/6

Concatenate datasets to a single array store

In the previous notebooks, we’ve seen how to incrementally create a collection of scRNA-seq datasets and train models on it.

Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata.

This is also what CELLxGENE does to create Census: a number of .h5ad files are concatenated to give rise to a single tiledbsoma array store (CELLxGENE: scRNA-seq).

import lamindb as ln
import pandas as pd
import scanpy as sc
import tiledbsoma.io
from functools import reduce

ln.context.uid = "oJN8WmVrxI8m0000"
ln.context.track()
Hide code cell output
→ connected lamindb: testuser1/test-scrna
→ notebook imports: lamindb==0.76.8 pandas==2.2.3 scanpy==1.10.3 tiledbsoma==1.14.1
→ created Transform(uid='oJN8WmVrxI8m0000') & created Run(started_at='2024-09-25 19:57:13 UTC')

Query the collection of h5ad files that we’d like to concatenate into a single array.

collection = ln.Collection.get(name="My versioned scRNA-seq collection", version="2")
collection.describe()
Hide code cell output
Collection(uid='tRuWMElJ7wedgUqM0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='nrEhls7ejeCFKPtKSr09Wg', visibility=1, updated_at='2024-09-25 19:56:38 UTC')
  Provenance
    .created_by = 'testuser1'
    .transform = 'Standardize and append a batch of data'
    .run = '2024-09-25 19:56:09 UTC'
  Usage
    .input_of_runs = '2024-09-25 19:56:49 UTC', '2024-09-25 19:57:06 UTC'

Prepare the AnnData objects

To concatenate the AnnData objects into a single tiledbsoma.Experiment, they need to have the same .var and .obs columns.

# load a number of AnnData objects that's small enough to fit into memory
adatas = [artifact.load() for artifact in collection.ordered_artifacts]

# compute the intersection of columns for these objects
var_columns = reduce(
    pd.Index.intersection, [adata.var.columns for adata in adatas]
)  # this only affects metadata columns of features (say, gene annotations)
var_raw_columns = reduce(
    pd.Index.intersection, [adata.raw.var.columns for adata in adatas]
)
obs_columns = reduce(
    pd.Index.intersection, [adata.obs.columns for adata in adatas]
)  # this actually subsets features (dataset dimensions)

Prepare the AnnData objects for concatenation. Prepare id fields, sanitize index names, intersect columns, drop .obsp, .uns and columns that aren’t part of the intersection.

for i, adata in enumerate(adatas):
    del adata.obsp  # not supported by tiledbsoma
    del adata.uns  # not supported by tiledbsoma

    adata.obs = adata.obs.filter(obs_columns)  # filter columns to intersection
    adata.obs["obs_id"] = (
        adata.obs.index
    )  # prepare a column for tiledbsoma to use as an index
    adata.obs["dataset"] = i
    adata.obs.index.name = None

    adata.var = adata.var.filter(var_columns)  # filter columns to intersection
    adata.var["var_id"] = adata.var.index
    adata.var.index.name = None

    drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
    adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
    adata.raw.var["var_id"] = adata.raw.var.index
    adata.raw.var.index.name = None

Create the array store

Save the AnnData objects in one array store referenced by an Artifact.

soma_artifact = ln.integrations.save_tiledbsoma_experiment(
    adatas,
    description="tiledbsoma experiment",
    measurement_name="RNA",
    obs_id_name="obs_id",
    var_id_name="var_id",
    append_obsm_varm=True,
)
Hide code cell output
<frozen abc>:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

<frozen abc>:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

Note

Provenance is tracked by writing the current run.uid to tiledbsoma.Experiment.obs as lamin_run_uid.

If you know tiledbsoma API, then note that save_tiledbsoma_experiment() abstracts over both tiledbsoma.io.register_anndatas and tiledbsoma.io.from_anndata.

Query the array store

Here we query the obs from the array store.

with soma_artifact.open() as soma_store:
    obs = soma_store["obs"]
    var = soma_store["ms"]["RNA"]["var"]

    obs_columns_store = obs.schema.names
    var_columns_store = var.schema.names

    obs_store_df = obs.read().concat().to_pandas()

    display(obs_store_df)
Hide code cell output
soma_joinid cell_type obs_id dataset lamin_run_uid
0 0 dendritic cell GCAGGGCTGGATTC-1 0 oB4ca6lumZMEtU0hBUQI
1 1 B cell, CD19-positive CTTTAGTGGTTACG-6 0 oB4ca6lumZMEtU0hBUQI
2 2 dendritic cell TGACTGGAACCATG-7 0 oB4ca6lumZMEtU0hBUQI
3 3 B cell, CD19-positive TCAATCACCCTTCG-8 0 oB4ca6lumZMEtU0hBUQI
4 4 effector memory CD4-positive, alpha-beta T cel... CGTTATACAGTACC-8 0 oB4ca6lumZMEtU0hBUQI
... ... ... ... ... ...
1713 1713 naive thymus-derived CD4-positive, alpha-beta ... Pan_T7991594_CTCACACTCCAGGGCT 1 oB4ca6lumZMEtU0hBUQI
1714 1714 naive thymus-derived CD4-positive, alpha-beta ... Pan_T7980358_CGAGCACAGAAGATTC 1 oB4ca6lumZMEtU0hBUQI
1715 1715 naive thymus-derived CD4-positive, alpha-beta ... CZINY-0064_AGACCATCACGCTGCA 1 oB4ca6lumZMEtU0hBUQI
1716 1716 CD8-positive, alpha-beta memory T cell CZINY-0050_TCGATTTAGATGTTGA 1 oB4ca6lumZMEtU0hBUQI
1717 1717 naive thymus-derived CD4-positive, alpha-beta ... CZINY-0064_AGTGTTGTCCGAGCTG 1 oB4ca6lumZMEtU0hBUQI

1718 rows × 5 columns

Append to the array store

Prepare a new AnnData object to be appended to the store.

ln.core.datasets.anndata_with_obs().write_h5ad("adata_to_append.h5ad")
!lamin save adata_to_append.h5ad --description "adata to append"
Hide code cell output
→ connected lamindb: testuser1/test-scrna
! no run & transform got linked, call `ln.context.track()` & re-run
→ saved: Artifact(uid='Qhfq07ijDFXWNFnp0000', is_latest=True, description='adata to append', suffix='.h5ad', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, created_by_id=1, updated_at='2024-09-25 19:57:24 UTC')
→ storage path: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/Qhfq07ijDFXWNFnp0000.h5ad
adata = ln.Artifact.filter(description="adata to append").one().load()

adata.obs_names_make_unique()
adata.var_names_make_unique()

adata.obs["obs_id"] = adata.obs.index
adata.var["var_id"] = adata.var.index

adata.obs["dataset"] = obs_store_df["dataset"].max()

obs_columns_same = [
    obs_col for obs_col in adata.obs.columns if obs_col in obs_columns_store
]
adata.obs = adata.obs[obs_columns_same]

var_columns_same = [
    var_col for var_col in adata.var.columns if var_col in var_columns_store
]
adata.var = adata.var[var_columns_same]

adata.write_h5ad("adata_to_append.h5ad")
Hide code cell output
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/_core/anndata.py:1756: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

Append the AnnData object from disk by revising soma_artifact.

soma_artifact = ln.integrations.save_tiledbsoma_experiment(
    ["adata_to_append.h5ad"],
    revises=soma_artifact,
    measurement_name="RNA",
    obs_id_name="obs_id",
    var_id_name="var_id",
)
Hide code cell output
<frozen abc>:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

Update the array store

Add a new embedding to the existing array store.

# read the data matrix
with soma_artifact.open() as soma_store:
    ms_rna = soma_store["ms"]["RNA"]
    n_obs = len(soma_store["obs"])
    n_var = len(ms_rna["var"])
    X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()

# calculate PCA embedding from the queried `X`
pca_array = sc.pp.pca(X, n_comps=2)

Open the array store in write mode and add PCA.

with soma_artifact.open(mode="w") as soma_store:
    tiledbsoma.io.add_matrix_to_collection(
        exp=soma_store,
        measurement_name="RNA",
        collection_name="obsm",
        matrix_name="pca",
        matrix_data=pca_array,
    )
Hide code cell output
<frozen abc>:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

See array store mutations

During the append-to and update operations, the data in the array store was changed. LaminDB automatically tracks these revisions recording the number of objects, hashes, and provenance.

soma_artifact.versions.df()
Hide code cell output
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
4 CHHL3PfDui9ycvi30000 None False tiledbsoma experiment None .tiledbsoma None 15056696 P52V2JJF0FmEOZdtf6n_Qw 149 1718.0 md5-d tiledbsoma 1 True 1 6 6 1 2024-09-25 19:57:25.023843+00:00
6 CHHL3PfDui9ycvi30001 None False tiledbsoma experiment None .tiledbsoma None 15068506 WWlRTRsdp9zeGupTqaO_0A 173 1758.0 md5-d tiledbsoma 1 True 1 6 6 1 2024-09-25 19:57:26.076378+00:00
7 CHHL3PfDui9ycvi30002 None True tiledbsoma experiment None .tiledbsoma None 15089310 ZsaZKYbfU3gqn1FPe_YQZw 182 NaN md5-d None 1 True 1 6 6 1 2024-09-25 19:57:26.077182+00:00

View lineage of the array store

Check the generating flow of the array store.

soma_artifact.view_lineage()
_images/e54bf74e6e1ddb2b3f5d37f94784b180bd58b241073f5340a11433f21f8c9d79.svg

Note

For the underlying API, see the tiledbsoma documentation.