Generating embeddings for multiple slides using local hardware

Batch Processing multiple slides with LazySlide

This notebook demonstrates how to use the lazyslide Python package to process whole-slide images (WSIs) in batches using local hardware. We will perform a multi-step workflow consisting of:

Pre-processing: Tissue segmentation and tiling.
Feature Extraction: Running a pre-trained model on the tiles to generate features.
Embedding Export: Saving the generated features as NumPy arrays for downstream analysis.

Setup

Pre-requisites

For this tutorial we assume you have access to the relevant model on HuggingFace and have an account on the same. We also assume you have installed the lazyslide package; for more information on this please visit the project page here: https://github.com/rendeirolab/LazySlide

First, please login to HuggingFace and submit your HF token when prompted.

# Login to HF
from huggingface_hub import login, hf_hub_download
login()

Next, let’s import all the necessary libraries and define the paths to our data and where we’ll save the output embeddings.

import lazyslide as zs
import numpy as np
import os
import glob
from tqdm.notebook import tqdm

# Please ensure you have the following two directories created in the same folder you run this Notebook
DATA_DIR = "data/"  # Please copy your slide images here
EMBEDDINGS_DIR = "embeddings/" # This is where we will store the final result

# Ensure the output directory exists
os.makedirs(EMBEDDINGS_DIR, exist_ok=True)

# Find all slide files
slide_paths = glob.glob(os.path.join(DATA_DIR, "*.svs"))  

# We assume the slide images are in *.svs format (pls change if you are using a different format
print(f"Found {len(slide_paths)} slides to process:")
for path in slide_paths:
    print(os.path.basename(path))

Found 2 slides to process:
GTEX-1117F-1026.svs
GTEX-111FC-0426.svs

Step 1: Pre-processing (Segmentation and Tiling)

In this step, we will iterate through each slide, identify the tissue regions, and then generate tiles from those regions. These operations are stored within the .zarr directory created for each slide.

for slide_path in tqdm(slide_paths, desc="Preprocessing Slides"):
    print(f'\\nProcessing {os.path.basename(slide_path)}...')
    
    # Open the whole-slide image    
    wsi = zs.open_wsi(slide_path)
    
    # 1. Find tissues    print("Finding tissues...")
    zs.pp.find_tissues(wsi)
    
    # 2. Tile tissues    print("Tiling tissues...")
    zs.pp.tile_tissues(wsi, 224, mpp=0.5)
    
    # Save the slide data    
    wsi.write()
    print(f"Finished preprocessing for {os.path.basename(slide_path)}.")

Processing GTEX-1117F-1026.svs...
Finding tissues...
Tiling tissues...
Finished preprocessing for GTEX-1117F-1026.svs.

Processing GTEX-111FC-0426.svs...
Finding tissues...
Tiling tissues...
Finished preprocessing for GTEX-111FC-0426.svs.

Step 2: Feature Extraction

Now that the slides are pre-processed, we can extract features from the tiles. We will use the h-optimus-0 model. This step can be time-consuming, and performance will depend heavily on your available hardware (GPU is recommended).

Note: Ensure you have a CUDA-compatible GPU and the necessary drivers installed to use device="cuda".

Warning: This step can take several minutes per slide on a local machine, we recommend if testing on single GPU machine to use smaller slides.

for slide_path in tqdm(slide_paths, desc="Extracting Features"):
    print(f'\\nExtracting features for {os.path.basename(slide_path)}...')
    
    # Re-open the WSI to access the pre-processed data    
    wsi = zs.open_wsi(slide_path)
    
    # Extract features    
    zs.tl.feature_extraction(wsi, "h-optimus-1", device="cuda")  
    # If you do not have a cuda compatible GPU you can use device="cpu"    
    
    # Save the embeddings    
    wsi.write()
    
    print(f"Finished feature extraction for {os.path.basename(slide_path)}.")

Extracting features for GTEX-1117F-1026.svs...
Finished feature extraction for GTEX-1117F-1026.svs.

Extracting features for GTEX-111FC-0426.svs...
Finished feature extraction for GTEX-111FC-0426.svs.

Step 3: Export Embeddings

With the features extracted and stored in the .zarr directories, we can now load them and save them as individual .npy files. This format is convenient for loading into machine learning frameworks like PyTorch or TensorFlow, or for general analysis with NumPy and Scikit-learn.

for slide_path in tqdm(slide_paths, desc="Exporting Embeddings"):
    slide_basename = os.path.basename(slide_path)
    slide_name, _ = os.path.splitext(slide_basename)
    
    print(f'\\nExporting embeddings for {slide_basename}...')
    
    # Re-open the WSI to access all data    
    wsi = zs.open_wsi(slide_path)
    
    # Fetch features as an AnnData object    
    adata = wsi.fetch.features_anndata("h-optimus-1")
    
    # The embeddings are stored in adata.X    
    embeddings = adata.X
    
    # Define the output path and save the embeddings    
    output_path = os.path.join(EMBEDDINGS_DIR, f'{slide_name}.npy')
    np.save(output_path, embeddings)
    print(f'Saved embeddings to {output_path} ({embeddings.shape})')

Exporting embeddings for GTEX-1117F-1026.svs...
Saved embeddings to embeddings/GTEX-1117F-1026.npy ((3254, 1536))

Exporting embeddings for GTEX-111FC-0426.svs...
Saved embeddings to embeddings/GTEX-111FC-0426.npy ((655, 1536))

Verification

Finally, let’s check the embeddings/ directory to confirm that our .npy files have been created successfully.

print(f"Contents of the '{EMBEDDINGS_DIR}' directory:")
for file in os.listdir(EMBEDDINGS_DIR):
    if file.endswith(".npy"):
        print(file)

Contents of the 'embeddings/' directory:
GTEX-111FC-0426.npy
GTEX-1117F-1026.npy

Resources

Python notebook for this tutorial:

Generating_embeddings_for_multiple_slides_locally.ipynb

Lazyslide documentation: https://lazyslide.readthedocs.io/en/latest/

<aside>

Quick Navigation

</aside>

Latest version: December 16, 2025

Support: [email protected]