Cohorts

A Cohort is the single source of truth for a multi-slide experiment. You build it from a directory of WSIs (or a manifest CSV), run a model over it, and outputs are organized into a reproducible workspace. Cohorts can also hold bulk RNA for multimodal prediction — see Spatial transcriptomics.

Runnable notebook

m-jumpstart includes a cohort batch-processing example.

1. Build a cohort

From a directory
From a CSV manifest

from bioptimus.data.cohort import Cohort

cohort = Cohort.from_directories(wsi_dir="/data/wsi/tcga_mini_coad")

from bioptimus.data.cohort import Cohort

# One row per slide. Required column: `wsi_id` (filename stem or full name).
# Optional: `patient_id`, `bulk_rna_id`, `timepoint`, `wsi_path`, `bulk_rna_path`.
# Pass `wsi_dir=` / `bulk_rna_dir=` to resolve paths from IDs, or
# `columns={...}` to map custom column names.
cohort = Cohort.from_csv("/data/cohort.csv", wsi_dir="/data/wsi/tcga_mini_coad")

print(cohort.summary())
print(cohort.wsi_ids)
print(cohort[0].available_modalities)   # e.g. ['image']

2. Run a model over the cohort

Create one Inference for the cohort, then run it. Tissue masks are cached and resume automatically, so re-running only processes what’s missing. Pick your backend below (only the common dict differs), then your model.

On-premise
AWS SageMaker

common = dict(
    api_url="http://localhost:8080",
    tissue=True, mask_threshold=0.5,
    output_path="/data/output", experiment="tcga_coad", run=1,
    workers=5,
)

common = dict(
    backend="aws", endpoint_name="bioptimus-prod", region_name="us-east-1",
    tissue=True, mask_threshold=0.5,
    output_path="/data/output", experiment="tcga_coad", run=1,
    workers=1,   # ml.g5.xlarge has a single GPU; concurrent requests can OOM
)

from bioptimus.inference import Inference
from bioptimus.models.types import Models

infer = Inference(model_name=Models.H1, cohort=cohort, variant="mini", **common)
infer.tissue()              # shared masks; only computes what's missing
infer.run(mode="embed")     # H-Optimus produces embeddings
infer.report()              # status summary

from bioptimus.inference import Inference
from bioptimus.models.types import Models

infer = Inference(model_name=Models.M_OPTIMUS, cohort=cohort, variant="mini", **common)
infer.tissue()              # shared masks; only computes what's missing
infer.run(mode="embed")
infer.run(mode="predict")   # image-only (see Spatial transcriptomics for bulk RNA)
infer.report()              # status summary

Each output is tagged with the modalities used (e.g. ["image"]).

3. Extract & save tiles (optional)

Independent of inference — useful for QC or external pipelines:

from bioptimus.extraction.wsi.tile_extraction import TileExtractor
from bioptimus.io.wsi.factory import WSI
from bioptimus.models.backbones import Backbone
from bioptimus.models.types import Models

# Reuse the model's tile geometry (224×224 @ 0.5 µm/px) so tiles match inference.
tile_spec = Backbone(Models.H1, base_url="http://localhost:8080").model_spec.tile_spec
extractor = TileExtractor(tile_spec=tile_spec, mask_threshold=0.5)
with WSI(wsi_path) as reader:
    extractor.fit_extract(reader)
    extractor.save(tile_dir, image_format="png", workers=4)
    extractor.to_csv(csv_dir)

Overview

Get Started

Preprocessing

Workflows

Reference

Runnable notebook

1. Build a cohort

2. Run a model over the cohort

3. Extract & save tiles (optional)

Runnable notebook

​1. Build a cohort

​2. Run a model over the cohort

​3. Extract & save tiles (optional)

1. Build a cohort

2. Run a model over the cohort

3. Extract & save tiles (optional)