Skip to main content
Cohort manifest for pairing WSIs with bulk RNA and clinical metadata. Provides Cohort, a typed registry that tracks the mapping between patients, wsis, bulk RNA samples, timepoints, labels, and arbitrary clinical metadata. WSIRecord is the single source of truth for each WSI — including per-model outputs and processing status. Construction paths:
  1. From a user-provided CSV via from_csv.
  2. Auto-matching from directories via from_directories.
  3. From a YAML manifest via load (for resume).
The cohort can be serialised to YAML via save and loaded back for full reproducibility and resume support. Example:
from bioptimus.data.cohort import Cohort

# Auto-match from directories:
cohort = Cohort.from_directories(
    wsi_dir="/data/wsis",
    bulk_rna_dir="/data/rna",
)

# Or load a saved manifest (resume):
cohort = Cohort.load(".output/brca/run_1/manifest.yaml")

# Add labels post-hoc:
cohort.add_labels({"patient_001": {"mutation": "BRAF"}})

# Iterate:
for record in cohort:
    print(record.patient_id, record.wsi_path)

ModelOutput

@dataclass
class ModelOutput()
Per-model output record for a single WSI.
model_name
Name of the model that produced this output.
embedding_path
Path to the embedding output file.
prediction_path
Path to the prediction output file.
tiles_csv_path
Path to the tile coordinates CSV.
tiles_dir
Directory containing exported tile images.
modalities
Input modalities used for this output (e.g. ["image"] or ["image", "bulk_rna"]).
timestamp
ISO 8601 string of when the output was produced.
status
Processing status ("pending", "done", "error").
error
Error message if status is "error".

is_done

@property
def is_done() -> bool
Returns True if status is done.

has_embedding

@property
def has_embedding() -> bool
Returns True if an embedding path is set and exists.

has_prediction

@property
def has_prediction() -> bool
Returns True if a prediction path is set and exists.

set_stage_output

def set_stage_output(stage: str,
                     modalities: list[str],
                     path: Path,
                     timestamp: str | None = None) -> None
Records an output for a specific stage and modality combo. Updates both the per-modality tracking dict and the top-level convenience fields (embedding_path / prediction_path).
stage
str
required
"embed" or "predict".
modalities
list[str]
required
Modalities used for this output.
path
Path
required
Output file path.
timestamp
str | None
ISO 8601 timestamp.

get_stage_path

def get_stage_path(stage: str,
                   modalities: list[str] | None = None) -> Path | None
Returns the output path for a stage and modality combo.
stage
str
required
"embed" or "predict".
modalities
list[str] | None
Modality combination to look up. Falls back to the top-level path when None.
returns
Path | None
Resolved path, or None if not recorded.

to_dict

def to_dict() -> dict[str, Any]
Serialises to a plain dict for YAML output.
returns
dict[str, Any]
A dictionary representation of the model output.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> ModelOutput
Constructs from a plain dict (YAML round-trip).
data
dict[str, Any]
required
Dictionary with serialised model output fields.
returns
ModelOutput
A new ModelOutput instance.

WSIRecord

@dataclass
class WSIRecord()
Single WSI entry — the single source of truth. Tracks inputs (WSI, RNA, mask), per-model outputs, and arbitrary annotations for one WSI within a cohort.
patient_id
Unique patient identifier.
wsi_id
Unique WSI identifier (typically the filename stem).
wsi_path
Path to the WSI file.
bulk_rna_id
Identifier for the paired bulk RNA sample.
bulk_rna_path
Path to the bulk RNA CSV/TSV file.
mask_path
Path to a tissue mask (pre-computed or cached).
mask_timestamp
When the mask was computed/discovered.
timepoint
Temporal ordering label (e.g. "t0", "t1").
outputs
Per-model output records keyed by model name.
labels
Arbitrary categorical annotations.
metadata
Arbitrary clinical/treatment metadata.

get_output

def get_output(model_name: str) -> ModelOutput
Returns the output for model_name, creating if absent.
model_name
str
required
Model identifier string.
returns
ModelOutput
The ModelOutput for the given model.

is_stage_done

def is_stage_done(model_name: str, stage: str) -> bool
Checks whether a stage output exists on disk. When modality_outputs are tracked, the check is scoped to the record’s current available_modalities so that linking new data (e.g. bulk RNA) automatically surfaces pending work without requiring force=True.
model_name
str
required
Model identifier.
stage
str
required
"embed" or "predict".
returns
bool
True if the output path exists on disk.

has_mask

@property
def has_mask() -> bool
Returns True if a mask path is set and exists on disk.

available_modalities

@property
def available_modalities() -> list[str]
Returns the input modalities available for this record. Always includes "image". Includes "bulk_rna" when bulk_rna_path is set.

to_dict

def to_dict() -> dict[str, Any]
Serialises to a plain dict for YAML output.
returns
dict[str, Any]
A dictionary representation of the WSI record.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> WSIRecord
Constructs from a plain dict (YAML round-trip).
data
dict[str, Any]
required
Dictionary with serialised WSI record fields.
returns
WSIRecord
A new WSIRecord instance.

PatientRecord

@dataclass
class PatientRecord()
Groups all WSIs belonging to a single patient.
patient_id
Unique patient identifier.
wsis
Ordered list of WSI records (by timepoint).
labels
Patient-level categorical annotations.
metadata
Patient-level clinical metadata.

Cohort

class Cohort()
Typed cohort manifest — single source of truth for an experiment. A cohort is an ordered collection of WSIRecord entries grouped by patient. It stores all information needed to reproduce and resume an inference run: file locations, pairing logic, timepoints, per-model outputs, labels, and clinical metadata. Construction: Use from_csv, from_directories, or load rather than calling the constructor directly.
records
Initial list of WSIRecord entries.

from_csv

@classmethod
def from_csv(cls,
             path: str | Path,
             *,
             wsi_dir: str | Path | None = None,
             bulk_rna_dir: str | Path | None = None,
             columns: dict[str, str] | None = None) -> Cohort
Creates a cohort from a user-provided CSV manifest. Required columns:
  • wsi_id: WSI identifier (filename stem or full name).
Optional columns:
  • patient_id: if absent, derived from wsi_id.
  • bulk_rna_id: paired RNA sample identifier.
  • timepoint: temporal label.
  • wsi_path: explicit path override.
  • bulk_rna_path: explicit path override.
  • Any other columns are treated as labels if prefixed with label_ or as metadata otherwise.
When wsi_dir or bulk_rna_dir are provided, paths are resolved by joining the directory with the corresponding ID (plus matching extension found on disk). The columns dict maps canonical field names to the actual CSV column names. Only the keys that differ from the defaults need to be specified. For example:
Cohort.from_csv(
    "cohort.csv",
    columns={
        "wsi_id": "imageid",
        "patient_id": "subject_id",
    },
)
Recognised canonical keys: wsi_id, patient_id, bulk_rna_id, timepoint, wsi_path, bulk_rna_path.
path
Path to the CSV manifest.
wsi_dir
Directory to resolve WSI paths from.
bulk_rna_dir
Directory to resolve bulk RNA paths from.
columns
Mapping of canonical field names to actual CSV column names. Unmapped fields fall back to their canonical name.
Returns: A populated Cohort.

from_directories

@classmethod
def from_directories(
        cls,
        wsi_dir: str | Path,
        bulk_rna_dir: str | Path | None = None,
        *,
        mask_dir: str | Path | None = None,
        patient_id_fn: Callable[[str], str] | None = None) -> Cohort
Creates a cohort by auto-matching files from directories. WSI and bulk RNA files are paired by exact filename stem equality (case-sensitive). Patient IDs default to the shared stem. When multiple wsis share the same patient ID, they are assigned sequential timepoints (t0, t1, …) in alphabetical order.
wsi_dir
str | Path
required
Directory containing WSI files.
bulk_rna_dir
str | Path | None
Directory containing bulk RNA files. When None, wsis are registered without RNA.
mask_dir
str | Path | None
Directory containing pre-computed masks (PNG files whose stem matches the WSI stem).
patient_id_fn
Callable[[str], str] | None
Optional callable that maps a filename stem to a patient ID.
returns
Cohort
A populated Cohort.

save

def save(path: str | Path) -> Path
Persists the cohort manifest to a YAML file. The YAML includes all inputs, per-model outputs, status, labels, and metadata — everything needed to resume.
path
str | Path
required
Destination file path.
returns
Path
The resolved output path.

load

@classmethod
def load(cls, path: str | Path) -> Cohort
Loads a cohort from a previously saved YAML manifest.
path
str | Path
required
Path to the YAML manifest file.
returns
Cohort
A fully-populated Cohort.

upsert

def upsert(wsi_id: str,
           *,
           patient_id: str | None = None,
           wsi_path: Path | None = None,
           bulk_rna_path: Path | None = None,
           mask_path: Path | None = None,
           timepoint: str | None = None) -> WSIRecord
Inserts or updates a WSI record. If a record with wsi_id exists, updates the non-None fields. Otherwise creates a new record.
wsi_id
str
required
WSI identifier.
patient_id
str | None
Patient identifier (defaults to wsi_id).
wsi_path
Path | None
Path to WSI file.
bulk_rna_path
Path | None
Path to bulk RNA file.
mask_path
Path | None
Path to tissue mask.
timepoint
str | None
Timepoint label.
returns
WSIRecord
The inserted or updated WSIRecord.

get_wsi

def get_wsi(wsi_id: str) -> WSIRecord | None
Returns the record for wsi_id, or None.
wsi_id
str
required
Unique WSI identifier (typically the file stem).
returns
WSIRecord | None
The matching WSIRecord, or None if not found.
def link_bulk_rna(bulk_rna_dir: str | Path,
                  *,
                  extensions: set[str] | None = None,
                  separator: str | None = None,
                  gene_column: str | None = None,
                  value_column: str | None = None,
                  strip_version: bool = False) -> int
Links bulk RNA files to existing WSI records by stem. Scans bulk_rna_dir for recognised RNA files and pairs them with existing records by exact filename-stem equality. Records that already have a bulk_rna_path are skipped. This enables late-binding of bulk RNA data: build a cohort from WSIs first, then call this method to attach RNA when it becomes available.
bulk_rna_dir
str | Path
required
Directory containing bulk RNA files.
extensions
set[str] | None
File extensions to match. Defaults to {".csv", ".tsv"}.
separator
str | None
Column delimiter for parsing. When None the separator is inferred from the file extension.
gene_column
str | None
Column name containing gene identifiers. Required (with value_column) for long-format files (e.g. GDC/TCGA gene quantification TSVs).
value_column
str | None
Column name containing expression values to read (e.g. "tpm_unstranded").
strip_version
bool
When True, strips version suffixes from gene identifiers (e.g. ENSG…00003.15ENSG…00003).
returns
int
Number of records that were linked.

add_labels

def add_labels(labels: dict[str, dict[str, Any]],
               *,
               by: str = "patient_id") -> None
Adds categorical labels to records.
labels
dict[str, dict[str, Any]]
required
Mapping of {identifier: {label_name: value}}.
by
str
Key to match on — "patient_id" or "wsi_id".

add_metadata

def add_metadata(metadata: dict[str, dict[str, Any]],
                 *,
                 by: str = "patient_id") -> None
Adds clinical/treatment metadata to records.
metadata
dict[str, dict[str, Any]]
required
Mapping of {identifier: {field: value}}.
by
str
Key to match on — "patient_id" or "wsi_id".

add_labels_from_csv

def add_labels_from_csv(path: str | Path, *, by: str = "patient_id") -> None
Adds labels and metadata from a CSV file. The CSV must have a column matching by (default patient_id). Columns prefixed with label_ are treated as labels; remaining columns as metadata.
path
str | Path
required
Path to the labels CSV.
by
str
Join key column name.

num_patients

@property
def num_patients() -> int
Number of unique patients in the cohort.

patients

@property
def patients() -> dict[str, PatientRecord]
Patient-indexed view of the cohort.

get_patient

def get_patient(patient_id: str) -> PatientRecord
Returns the PatientRecord for a given patient.
patient_id
str
required
Patient identifier string.
returns
PatientRecord
The matching PatientRecord.
Raises:
  • KeyError — If patient_id is not found.

wsi_ids

@property
def wsi_ids() -> list[str]
Ordered list of all WSI IDs.

patient_ids

@property
def patient_ids() -> list[str]
Ordered list of unique patient IDs.

wsi_paths

@property
def wsi_paths() -> list[Path | None]
Ordered list of WSI paths (may contain None).

pending

def pending(model_name: str, stage: str) -> list[WSIRecord]
Returns wsis that have not completed a stage for a model. Checks whether the output file exists on disk. Records with missing WSI paths are excluded.
model_name
str
required
Model identifier.
stage
str
required
"embed" or "predict".
returns
list[WSIRecord]
List of WSIRecord entries still needing processing.

to_csv

def to_csv(path: str | Path) -> Path
Writes input columns to a CSV file (labels + metadata). For full persistence including outputs, use save.
path
str | Path
required
Destination file path.
returns
Path
The resolved output path.

summary

def summary() -> str
Returns a human-readable summary of the cohort.
returns
str
Multi-line summary string.