Bulk RNA-seq¶

Note

More comprehensive examples are provided for these data types:

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage test-bulkrna --schema bionty

import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad
from pathlib import Path

Ingest data¶

Access ¶

We start by simulating a nf-core RNA-seq run which yields us a count matrix artifact.

(See Nextflow for running this with Nextflow.)

# pretend we're running a bulk RNA-seq pipeline
ln.track(
    transform=ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq")
)
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.core.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
    populate_registries=True
)
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.Artifact(path, description="Merged Bulk RNA counts").save()

Transform ¶

ln.context.uid = "s5V0dNMVwL9i0000"
ln.context.track()

Let’s query the artifact:

artifact = ln.Artifact.get(description="Merged Bulk RNA counts")

df = artifact.load()

If we look at it, we realize it deviates far from the tidy data standard Wickham14, conventions of statistics & machine learning Hastie09, Murphy12 and the major Python & R data packages.

Variables are not in columns and observations are not in rows:

df

Show code cell output Hide code cell output

	gene_id	gene_name	RAP1_IAA_30M_REP1	RAP1_UNINDUCED_REP1	RAP1_UNINDUCED_REP2	WT_REP1	WT_REP2
0	Gfp_transgene_gene	Gfp_transgene_gene	0.0	0.000	0.0	0.0	0.0
1	HRA1	HRA1	0.0	8.572	0.0	0.0	0.0
2	snR18	snR18	3.0	8.000	4.0	8.0	8.0
3	tA(UGC)A	TGA1	0.0	0.000	0.0	0.0	0.0
4	tL(CAA)A	SUP56	0.0	0.000	0.0	0.0	0.0
...	...	...	...	...	...	...	...
120	YAR064W	YAR064W	0.0	2.000	0.0	0.0	0.0
121	YAR066W	YAR066W	3.0	13.000	8.0	5.0	11.0
122	YAR068W	YAR068W	9.0	28.000	24.0	5.0	7.0
123	YAR069C	YAR069C	0.0	0.000	0.0	0.0	1.0
124	YAR070C	YAR070C	0.0	0.000	0.0	0.0	0.0

125 rows × 7 columns

Let’s change that and move observations into rows:

df = df.T
df

Show code cell output Hide code cell output

	0	1	2	3	4	5	6	7	8	9	...	115	116	117	118	119	120	121	122	123	124
gene_id	Gfp_transgene_gene	HRA1	snR18	tA(UGC)A	tL(CAA)A	tP(UGG)A	tS(AGA)A	YAL001C	YAL002W	YAL003W	...	YAR050W	YAR053W	YAR060C	YAR061W	YAR062W	YAR064W	YAR066W	YAR068W	YAR069C	YAR070C
gene_name	Gfp_transgene_gene	HRA1	snR18	TGA1	SUP56	TRN1	tS(AGA)A	TFC3	VPS8	EFB1	...	FLO1	YAR053W	YAR060C	YAR061W	YAR062W	YAR064W	YAR066W	YAR068W	YAR069C	YAR070C
RAP1_IAA_30M_REP1	0.0	0.0	3.0	0.0	0.0	0.0	1.0	55.0	36.0	632.0	...	4.357	0.0	1.0	0.0	1.0	0.0	3.0	9.0	0.0	0.0
RAP1_UNINDUCED_REP1	0.0	8.572	8.0	0.0	0.0	0.0	0.0	72.0	33.0	810.0	...	15.72	0.0	0.0	0.0	3.0	2.0	13.0	28.0	0.0	0.0
RAP1_UNINDUCED_REP2	0.0	0.0	4.0	0.0	0.0	0.0	0.0	115.0	82.0	1693.0	...	13.772	0.0	4.0	0.0	2.0	0.0	8.0	24.0	0.0	0.0
WT_REP1	0.0	0.0	8.0	0.0	0.0	1.0	0.0	60.0	63.0	1115.0	...	13.465	0.0	0.0	0.0	1.0	0.0	5.0	5.0	0.0	0.0
WT_REP2	0.0	0.0	8.0	0.0	0.0	0.0	0.0	30.0	25.0	704.0	...	6.891	0.0	1.0	0.0	0.0	0.0	11.0	7.0	1.0	0.0

7 rows × 125 columns

Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.

Let’s create an AnnData object to model this. First, create a dataframe for the variables:

var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])

var.head()

Show code cell output Hide code cell output

	gene_name
gene_id
Gfp_transgene_gene	Gfp_transgene_gene
HRA1	HRA1
snR18	snR18
tA(UGC)A	TGA1
tL(CAA)A	SUP56

Now, let’s create an AnnData object:

# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)
adata

The AnnData object is in tidy form and complies with conventions of statistics and machine learning:

adata.to_df()

Show code cell output Hide code cell output

gene_id	HRA1	snR18	tP(UGG)A	tS(AGA)A	YAL001C	YAL002W	YAL003W	...	YAR050W	YAR060C	YAR062W	YAR064W	YAR066W	YAR068W	YAR069C
RAP1_IAA_30M_REP1	0.000	3.0	0.0	1.0	55.0	36.0	632.0	...	4.357	1.0	1.0	0.0	3.0	9.0	0.0
RAP1_UNINDUCED_REP1	8.572	8.0	0.0	0.0	72.0	33.0	810.0	...	15.720	0.0	3.0	2.0	13.0	28.0	0.0
RAP1_UNINDUCED_REP2	0.000	4.0	0.0	0.0	115.0	82.0	1693.0	...	13.772	4.0	2.0	0.0	8.0	24.0	0.0
WT_REP1	0.000	8.0	1.0	0.0	60.0	63.0	1115.0	...	13.465	0.0	1.0	0.0	5.0	5.0	0.0
WT_REP2	0.000	8.0	0.0	0.0	30.0	25.0	704.0	...	6.891	1.0	0.0	0.0	11.0	7.0	1.0

5 rows × 125 columns

Validate ¶

Let’s create a Artifact object from this AnnData.

Almost all gene IDs are validated:

genes = bt.Gene.from_values(
    adata.var.index,
    bt.Gene.stable_id,
    organism="saccharomyces cerevisiae",  # or set globally with bt.settings.organism
)

# also register the 2 non-validated genes obtained from Bionty
ln.save(genes)

Register ¶

efs = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()

curated_file = ln.Artifact.from_anndata(adata, description="Curated bulk RNA counts")

Hence, let’s save this artifact:

curated_file.save()

Link to validated metadata records:

curated_file.features._add_set_from_anndata(
    var_field=bt.Gene.stable_id, organism="saccharomyces cerevisiae"
)

curated_file.labels.add(efs.rna_seq, features.assay)
curated_file.labels.add(organism.saccharomyces_cerevisiae, features.organism)

curated_file.describe()

Query data¶

We have two files in the artifact registry:

ln.Artifact.df()

Show code cell output Hide code cell output

	uid	version	is_latest	description	key	suffix	type	size	hash	n_objects	n_observations	_hash_type	_accessor	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
2	mjb4nVHgZSwhSPwA0000	None	True	Curated bulk RNA counts	None	.h5ad	dataset	28180	6bieh8XjOCCz6bJToN4u1g	None	None	md5	AnnData	1	True	1	2	2	1	2024-09-25 19:58:05.202526+00:00
1	vg4zgeWnRyIqz9UJ0000	None	True	Merged Bulk RNA counts	output_dir/salmon.merged.gene_counts.tsv	.tsv	None	3787	xxw0k3au3KtxFcgtbEr4eQ	None	None	md5	None	1	False	1	1	1	1	2024-09-25 19:58:01.746343+00:00

curated_file.view_lineage()

_images/c6694007ee8528b599425ccb0864ad449b05b09d32ed17035c4dd604678cba3f.svg

Let’s by query by gene:

genes = bt.Gene.lookup()

genes.spo7

# a gene set containing SPO7
feature_set = ln.FeatureSet.filter(genes=genes.spo7).first()

# artifacts that link to this feature set
ln.Artifact.filter(feature_sets=feature_set).df()

Show code cell output Hide code cell output

	uid	version	is_latest	description	key	suffix	type	size	hash	n_objects	n_observations	_hash_type	_accessor	visibility	_key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
2	mjb4nVHgZSwhSPwA0000	None	True	Curated bulk RNA counts	None	.h5ad	dataset	28180	6bieh8XjOCCz6bJToN4u1g	None	None	md5	AnnData	1	True	1	2	2	1	2024-09-25 19:58:05.202526+00:00

# clean up test instance
!rm -r test-bulkrna
!lamin delete --force test-bulkrna