cohort

Cohort manifest for pairing WSIs with bulk RNA and clinical metadata. Provides Cohort, a typed registry that tracks the mapping between patients, wsis, bulk RNA samples, timepoints, labels, and arbitrary clinical metadata. WSIRecord is the single source of truth for each WSI — including per-model outputs and processing status. Construction paths:

From a user-provided CSV via from_csv.
Auto-matching from directories via from_directories.
From a YAML manifest via load (for resume).

The cohort can be serialised to YAML via save and loaded back for full reproducibility and resume support. Example:

from bioptimus.data.cohort import Cohort

# Auto-match from directories:
cohort = Cohort.from_directories(
    wsi_dir="/data/wsis",
    bulk_rna_dir="/data/rna",
)

# Or load a saved manifest (resume):
cohort = Cohort.load(".output/brca/run_1/manifest.yaml")

# Add labels post-hoc:
cohort.add_labels({"patient_001": {"mutation": "BRAF"}})

# Iterate:
for record in cohort:
    print(record.patient_id, record.wsi_path)

ModelOutput

@dataclass
class ModelOutput()

Per-model output record for a single WSI.

Name of the model that produced this output.

Path to the embedding output file.

Path to the prediction output file.

Path to the tile coordinates CSV.

Directory containing exported tile images.

Input modalities used for this output (e.g. ["image"] or ["image", "bulk_rna"]).

ISO 8601 string of when the output was produced.

Processing status ("pending", "done", "error").

Error message if status is "error".

is_done

@property
def is_done() -> bool

Returns True if status is done.

has_embedding

@property
def has_embedding() -> bool

Returns True if an embedding path is set and exists.

has_prediction

@property
def has_prediction() -> bool

Returns True if a prediction path is set and exists.

set_stage_output

def set_stage_output(stage: str,
                     modalities: list[str],
                     path: Path,
                     timestamp: str | None = None) -> None

Records an output for a specific stage and modality combo. Updates both the per-modality tracking dict and the top-level convenience fields (embedding_path / prediction_path).

str

required

"embed" or "predict".

list[str]

required

Modalities used for this output.

Path

required

Output file path.

str | None

ISO 8601 timestamp.

get_stage_path

def get_stage_path(stage: str,
                   modalities: list[str] | None = None) -> Path | None

Returns the output path for a stage and modality combo.

str

required

"embed" or "predict".

list[str] | None

Modality combination to look up. Falls back to the top-level path when None.

Path | None

Resolved path, or None if not recorded.

to_dict

def to_dict() -> dict[str, Any]

Serialises to a plain dict for YAML output.

dict[str, Any]

A dictionary representation of the model output.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> ModelOutput

Constructs from a plain dict (YAML round-trip).

dict[str, Any]

required

Dictionary with serialised model output fields.

ModelOutput

A new ModelOutput instance.

WSIRecord

@dataclass
class WSIRecord()

Single WSI entry — the single source of truth. Tracks inputs (WSI, RNA, mask), per-model outputs, and arbitrary annotations for one WSI within a cohort.

Unique patient identifier.

Unique WSI identifier (typically the filename stem).

Path to the WSI file.

Identifier for the paired bulk RNA sample.

Path to the bulk RNA CSV/TSV file.

Path to a tissue mask (pre-computed or cached).

When the mask was computed/discovered.

Temporal ordering label (e.g. "t0", "t1").

Per-model output records keyed by model name.

Arbitrary categorical annotations.

Arbitrary clinical/treatment metadata.

get_output

def get_output(model_name: str) -> ModelOutput

Returns the output for model_name, creating if absent.

str

required

Model identifier string.

ModelOutput

The ModelOutput for the given model.

is_stage_done

def is_stage_done(model_name: str, stage: str) -> bool

Checks whether a stage output exists on disk. When modality_outputs are tracked, the check is scoped to the record’s current available_modalities so that linking new data (e.g. bulk RNA) automatically surfaces pending work without requiring force=True.

str

required

Model identifier.

str

required

"embed" or "predict".

bool

True if the output path exists on disk.

has_mask

@property
def has_mask() -> bool

Returns True if a mask path is set and exists on disk.

available_modalities

@property
def available_modalities() -> list[str]

Returns the input modalities available for this record. Always includes "image". Includes "bulk_rna" when bulk_rna_path is set.

to_dict

def to_dict() -> dict[str, Any]

Serialises to a plain dict for YAML output.

dict[str, Any]

A dictionary representation of the WSI record.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> WSIRecord

Constructs from a plain dict (YAML round-trip).

dict[str, Any]

required

Dictionary with serialised WSI record fields.

WSIRecord

A new WSIRecord instance.

PatientRecord

@dataclass
class PatientRecord()

Groups all WSIs belonging to a single patient.

Unique patient identifier.

Ordered list of WSI records (by timepoint).

Patient-level categorical annotations.

Patient-level clinical metadata.

class Cohort()

Typed cohort manifest — single source of truth for an experiment. A cohort is an ordered collection of WSIRecord entries grouped by patient. It stores all information needed to reproduce and resume an inference run: file locations, pairing logic, timepoints, per-model outputs, labels, and clinical metadata. Construction: Use from_csv, from_directories, or load rather than calling the constructor directly.

Initial list of WSIRecord entries.

from_csv

@classmethod
def from_csv(cls,
             path: str | Path,
             *,
             wsi_dir: str | Path | None = None,
             bulk_rna_dir: str | Path | None = None,
             columns: dict[str, str] | None = None) -> Cohort

Creates a cohort from a user-provided CSV manifest. Required columns:

wsi_id: WSI identifier (filename stem or full name).

Optional columns:

patient_id: if absent, derived from wsi_id.
bulk_rna_id: paired RNA sample identifier.
timepoint: temporal label.
wsi_path: explicit path override.
bulk_rna_path: explicit path override.
Any other columns are treated as labels if prefixed with label_ or as metadata otherwise.

When wsi_dir or bulk_rna_dir are provided, paths are resolved by joining the directory with the corresponding ID (plus matching extension found on disk). The columns dict maps canonical field names to the actual CSV column names. Only the keys that differ from the defaults need to be specified. For example:

Cohort.from_csv(
    "cohort.csv",
    columns={
        "wsi_id": "imageid",
        "patient_id": "subject_id",
    },
)

Recognised canonical keys: wsi_id, patient_id, bulk_rna_id, timepoint, wsi_path, bulk_rna_path.

Path to the CSV manifest.

Directory to resolve WSI paths from.

Directory to resolve bulk RNA paths from.

Mapping of canonical field names to actual CSV column names. Unmapped fields fall back to their canonical name.

Returns: A populated Cohort.

from_directories

@classmethod
def from_directories(
        cls,
        wsi_dir: str | Path,
        bulk_rna_dir: str | Path | None = None,
        *,
        mask_dir: str | Path | None = None,
        patient_id_fn: Callable[[str], str] | None = None) -> Cohort

Creates a cohort by auto-matching files from directories. WSI and bulk RNA files are paired by exact filename stem equality (case-sensitive). Patient IDs default to the shared stem. When multiple wsis share the same patient ID, they are assigned sequential timepoints (t0, t1, …) in alphabetical order.

str | Path

required

Directory containing WSI files.

str | Path | None

Directory containing bulk RNA files. When None, wsis are registered without RNA.

str | Path | None

Directory containing pre-computed masks (PNG files whose stem matches the WSI stem).

Callable[[str], str] | None

Optional callable that maps a filename stem to a patient ID.

Cohort

A populated Cohort.

save

def save(path: str | Path) -> Path

Persists the cohort manifest to a YAML file. The YAML includes all inputs, per-model outputs, status, labels, and metadata — everything needed to resume.

str | Path

required

Destination file path.

Path

The resolved output path.

load

@classmethod
def load(cls, path: str | Path) -> Cohort

Loads a cohort from a previously saved YAML manifest.

str | Path

required

Path to the YAML manifest file.

Cohort

A fully-populated Cohort.

upsert

def upsert(wsi_id: str,
           *,
           patient_id: str | None = None,
           wsi_path: Path | None = None,
           bulk_rna_path: Path | None = None,
           mask_path: Path | None = None,
           timepoint: str | None = None) -> WSIRecord

Inserts or updates a WSI record. If a record with wsi_id exists, updates the non-None fields. Otherwise creates a new record.

str

required

WSI identifier.

str | None

Patient identifier (defaults to wsi_id).

Path | None

Path to WSI file.

Path | None

Path to bulk RNA file.

Path | None

Path to tissue mask.

str | None

Timepoint label.

WSIRecord

The inserted or updated WSIRecord.

get_wsi

def get_wsi(wsi_id: str) -> WSIRecord | None

Returns the record for wsi_id, or None.

str

required

Unique WSI identifier (typically the file stem).

WSIRecord | None

The matching WSIRecord, or None if not found.

link_bulk_rna

def link_bulk_rna(bulk_rna_dir: str | Path,
                  *,
                  extensions: set[str] | None = None,
                  separator: str | None = None,
                  gene_column: str | None = None,
                  value_column: str | None = None,
                  strip_version: bool = False) -> int

Links bulk RNA files to existing WSI records by stem. Scans bulk_rna_dir for recognised RNA files and pairs them with existing records by exact filename-stem equality. Records that already have a bulk_rna_path are skipped. This enables late-binding of bulk RNA data: build a cohort from WSIs first, then call this method to attach RNA when it becomes available.

str | Path

required

Directory containing bulk RNA files.

set[str] | None

File extensions to match. Defaults to {".csv", ".tsv"}.

str | None

Column delimiter for parsing. When None the separator is inferred from the file extension.

str | None

Column name containing gene identifiers. Required (with value_column) for long-format files (e.g. GDC/TCGA gene quantification TSVs).

str | None

Column name containing expression values to read (e.g. "tpm_unstranded").

bool

When True, strips version suffixes from gene identifiers (e.g. ENSG…00003.15 → ENSG…00003).

int

Number of records that were linked.

add_labels

def add_labels(labels: dict[str, dict[str, Any]],
               *,
               by: str = "patient_id") -> None

Adds categorical labels to records.

dict[str, dict[str, Any]]

required

Mapping of {identifier: {label_name: value}}.

str

Key to match on — "patient_id" or "wsi_id".

add_metadata

def add_metadata(metadata: dict[str, dict[str, Any]],
                 *,
                 by: str = "patient_id") -> None

Adds clinical/treatment metadata to records.

dict[str, dict[str, Any]]

required

Mapping of {identifier: {field: value}}.

str

Key to match on — "patient_id" or "wsi_id".

add_labels_from_csv

def add_labels_from_csv(path: str | Path, *, by: str = "patient_id") -> None

Adds labels and metadata from a CSV file. The CSV must have a column matching by (default patient_id). Columns prefixed with label_ are treated as labels; remaining columns as metadata.

str | Path

required

Path to the labels CSV.

str

Join key column name.

num_patients

@property
def num_patients() -> int

Number of unique patients in the cohort.

patients

@property
def patients() -> dict[str, PatientRecord]

Patient-indexed view of the cohort.

get_patient

def get_patient(patient_id: str) -> PatientRecord

Returns the PatientRecord for a given patient.

str

required

Patient identifier string.

PatientRecord

The matching PatientRecord.

Raises:

KeyError — If patient_id is not found.

wsi_ids

@property
def wsi_ids() -> list[str]

Ordered list of all WSI IDs.

patient_ids

@property
def patient_ids() -> list[str]

Ordered list of unique patient IDs.

wsi_paths

@property
def wsi_paths() -> list[Path | None]

Ordered list of WSI paths (may contain None).

pending

def pending(model_name: str, stage: str) -> list[WSIRecord]

Returns wsis that have not completed a stage for a model. Checks whether the output file exists on disk. Records with missing WSI paths are excluded.

str

required

Model identifier.

str

required

"embed" or "predict".

list[WSIRecord]

List of WSIRecord entries still needing processing.

to_csv

def to_csv(path: str | Path) -> Path

Writes input columns to a CSV file (labels + metadata). For full persistence including outputs, use save.

str | Path

required

Destination file path.

Path

The resolved output path.

summary

def summary() -> str

Returns a human-readable summary of the cohort.

str

Multi-line summary string.

​ModelOutput

​is_done

​has_embedding

​has_prediction

​set_stage_output

​get_stage_path

​to_dict

​from_dict

​WSIRecord

​get_output

​is_stage_done

​has_mask

​available_modalities

​to_dict

​from_dict

​PatientRecord

​Cohort

​from_csv

​from_directories

​save

​load

​upsert

​get_wsi

​link_bulk_rna

​add_labels

​add_metadata

​add_labels_from_csv

​num_patients

​patients

​get_patient

​wsi_ids

​patient_ids

​wsi_paths

​pending

​to_csv

​summary

ModelOutput

is_done

has_embedding

has_prediction

set_stage_output

get_stage_path

to_dict

from_dict

WSIRecord

get_output

is_stage_done

has_mask

available_modalities

to_dict

from_dict

PatientRecord

Cohort

from_csv

from_directories

save

load

upsert

get_wsi

link_bulk_rna

add_labels

add_metadata

add_labels_from_csv

num_patients

patients

get_patient

wsi_ids

patient_ids

wsi_paths

pending

to_csv

summary