Organizing Histology Data

This tutorial explains how to organize locally attached histology data for using the PICSL Histology Annotation Service (PHAS). It is also possible to do organize data on the Google Cloud Storage service (see Histology in Google Cloud Storage).

Prerequisites

A machine with Docker installed

Data Organization Overview

Histology data in PHAS is organized in a hierarchical manner. The hierarchy is as follows:

project -> task -> specimen -> block -> section -> slide

project: A collection of histology slides that is accessible to a set of users. The project is the top level of the hierarchy.
task: An activity that is performed by the members of a project. Common tasks include browsing, annotation, or training of machine learning classifiers. A task may include all slides in the project or a subset of slides.
specimen: Refers to tissue from an individual tissue donor. Each specimen has a unique identifier.
block: A tissue block extracted from a specimen. Each block has a name, e.g., “left amygdala”. Typically a block will have multiple histology slides with different stains, organized into one or more sections.
section: The concept of a section arises in projects where tissue is serially sectioned. For example, in one block, we may obtain a NISSL slide every 500 microns, immediately followed by other other stains (e.g., myelin, iron). Each repeating sequence of NISSL-myelin-iron is considered a “section”. In diagnostic pathology, we would typically have just one section per block.
slide: Individual slides within a section. These will typically have different stains.

Archive Organization

In this tutorial, we assume that your histology images in your archive are in a format that is readable by OpenSlide. For example, Aperio SVS.

We will assume that your histology images for a particular project are stored in the directory /data/archive/someproject. Substitute this with the actual path to your data archive.
Inside of this directory, there is a separate sub-directory for each specimen (e.g., /data/archive/someproject/S1, /data/archive/someproject/S2, etc.)
For each specimen, there is directory histo_raw that contains histology images and associated metadata files in .json format.

For example, if we were to list the contents of the archive command, we would see the following listing:

> ls /data/archive/*/histo_raw/*
/data/archive/someproject/S1/histo_raw/S1_slide_001.json
/data/archive/someproject/S1/histo_raw/S1_slide_001.svs
/data/archive/someproject/S1/histo_raw/S1_slide_002.json
/data/archive/someproject/S1/histo_raw/S1_slide_002.svs
/data/archive/someproject/S2/histo_raw/S2_slide_001.json
/data/archive/someproject/S2/histo_raw/S2_slide_001.svs
...

The accompanying .json files must at a minimum include three fields: specimen (string), block (string) and stain (string). The files may also include optional fields section (integer), slide (integer), cert (string, with special values duplicate and exclude used to prevent slides from showing to the user), and tags (comma-separated list of strings). Here is an example of a valid .json file:

{
    "specimen": "S1",
    "block": "B06_left",
    "section": 20,
    "stain": "Nissl",
    "tags": [ "diag", "hippocampus" ]
}

Preprocessed Data

In addition to the raw data, PHAS requires each slide should have a set of derived files. These files include metadata extracted from the image header and a thumbnail. We recommend placing these files into a separate directory, for example for slide S1_slide_001.svs above, we may place the derived files into /data/archive/someproject/S1/histo_proc/S1_slide_001/preproc/.

Generating Preprocessed Data using Docker

To run preprocessing for a slide, run the following docker command:

# Create the directory for the preprocessed files
mkdir -p /data/archive/someproject/S1/histo_proc/S1_slide_001/preproc

# Generate preprocessed files
docker run \
    -v /data/archive:/data/archive \
    pyushkevich/histo-preproc:latest \
    python3 process_raw_slide.py -i /data/archive/someproject/S1/histo_raw/S1_slide_001.svs \
                                 -s /data/archive/someproject/S1/histo_proc/S1_slide_001/preproc/S1_slide_001

If successful, the directory /data/archive/someproject/S1/histo_proc/S1_slide_001/preproc will contain files S1_slide_001_thumbnail.tiff, S1_slide_001_metadata.json, and some other files.

Generating Preprocessed Data Manually

If you do not want to use the Docker container, you can download the script process_raw_slide.py from here instead.

Project Descriptor Json

Once you have organized the data in this manner, you can create a PHAS project as described in PICSL Histology Annotation Service Quick Start. For the organization used in the example above, the project descriptor json file would look like this:

{
    "base_url": "/data/archive/someproject",
    "disp_name": "Some Project",
    "desc": "Project demonstrating data organization in PHAS",
    "manifest_mode": "individual_json",
    "url_schema": {
        "pattern": {
            "raw": "{specimen}/histo_raw/{slide_name}.{slide_ext}",
            "thumb": "{specimen}/histo_proc/{slide_name}/preproc/{slide_name}_thumb.tiff",
            "metadata": "{specimen}/histo_proc/{slide_name}/preproc/{slide_name}_metadata.json"
        },
        "raw_slide_ext": [ "svs" ]
    }
}