ModelSpec which encapsulates a model’s identity, input tile requirements, normalisation parameters, output contract, and compute hints.
Reuses TileSpec for tile geometry so the same spec drives both the extraction pipeline and the model’s expectations.
Example:
Models
model_name field in each YAML config under bioptimus/models/configs/.
ModelSpec
TileSpec) with preprocessing, architecture, and output metadata so that data pipelines can prepare inputs and interpret outputs correctly without manual configuration.
This is framework-agnostic — it contains only plain Python types and can be consumed by PyTorch, TensorFlow, JAX, or any other framework.
Groups: Identity & provenance — who is this model? Input / preprocessing — what does the model expect? Architecture — structural hints for generic code. Output — how to interpret the raw model output. Compute — precision & hardware hints.
model_name
Unique identifier / registry key (e.g.
"h0-mini").version
Model version string (e.g.
"1.0.0"). Ensures embeddings extracted with v1 are not mixed with v2.weights_source
HuggingFace repo, URL, or local path to weights (e.g.
"bioptimus/H0-mini"). None = no auto-download.license
SPDX identifier or short description (e.g.
"proprietary", "apache-2.0").tile_spec
Tile geometry the model expects (size, stride, resolution, measurement unit). Directly reusable by
TileExtractor.num_channels
Number of input image channels (default 3 for RGB).
mean
Per-channel normalisation mean (channel order matches input).
std
Per-channel normalisation std.
interpolation
Resize interpolation mode string (e.g.
"bicubic", "bilinear").antialias
Whether the resize operation should use antialiasing.
color_space
Expected colour space of the input image (
"RGB", "BGR", "HED").stain_normalization
Stain normalisation method applied before the model, or
None for no stain normalisation. (e.g. "macenko", "reinhard", "vahadane").architecture
Architecture family (
"vit", "swin", "resnet", "convnext", …).patch_size
ViT patch size (e.g. 14, 16). Determines the number of output tokens =
(tile_size / patch_size)². None for non-ViT architectures.num_prefix_tokens
Number of non-spatial prefix tokens before patch tokens in the output sequence (CLS, register tokens, …). E.g. DINOv2-reg = 5, most ViTs = 1.
embedding_dim
Dimensionality of the final output feature vector (after any post-processing like CLS + mean pooling).
output_type
How the raw model output is consumed:
"cls" | "patch_mean" | "cls+patch_mean" | "dense" | "token_sequence".output_spatial_dims
For dense / segmentation models that output a spatial feature map, e.g.
(16, 16). None for pooled-output models.precision
Recommended inference precision (
"fp32", "fp16", "bf16").
