Organizing Histology Data
This tutorial explains how to organize locally attached histology data for using the PICSL Histology Annotation Service (PHAS). It is also possible to do organize data on the Google Cloud Storage service (see Histology in Google Cloud Storage).
Prerequisites
A machine with Docker installed
Data Organization Overview
Histology data in PHAS is organized in a hierarchical manner. The hierarchy is as follows:
project -> task -> specimen -> block -> section -> slide
- project
A collection of histology slides that is accessible to a set of users. The project is the top level of the hierarchy.
- task
An activity that is performed by the members of a project. Common tasks include browsing, annotation, or training of machine learning classifiers. A task may include all slides in the project or a subset of slides.
- specimen
Refers to tissue from an individual tissue donor. Each specimen has a unique identifier.
- block
A tissue block extracted from a specimen. Each block has a name, e.g., “left amygdala”. Typically a block will have multiple histology slides with different stains, organized into one or more sections.
- section
The concept of a section arises in projects where tissue is serially sectioned. For example, in one block, we may obtain a NISSL slide every 500 microns, immediately followed by other other stains (e.g., myelin, iron). Each repeating sequence of NISSL-myelin-iron is considered a “section”. In diagnostic pathology, we would typically have just one section per block.
- slide
Individual slides within a section. These will typically have different stains.
Archive Organization
In this tutorial, we assume that your histology images in your archive are in a format that is readable by OpenSlide. For example, Aperio SVS.
We will assume that your histology images for a particular project are stored in the directory
/data/archive/someproject. Substitute this with the actual path to your data archive.Inside of this directory, there is a separate sub-directory for each specimen (e.g.,
/data/archive/someproject/S1,/data/archive/someproject/S2, etc.)For each specimen, there is directory
histo_rawthat contains histology images and associated metadata files in.jsonformat.
For example, if we were to list the contents of the archive command, we would see the following listing:
> ls /data/archive/*/histo_raw/*
/data/archive/someproject/S1/histo_raw/S1_slide_001.json
/data/archive/someproject/S1/histo_raw/S1_slide_001.svs
/data/archive/someproject/S1/histo_raw/S1_slide_002.json
/data/archive/someproject/S1/histo_raw/S1_slide_002.svs
/data/archive/someproject/S2/histo_raw/S2_slide_001.json
/data/archive/someproject/S2/histo_raw/S2_slide_001.svs
...
The accompanying .json files must at a minimum include three fields: specimen (string), block (string) and stain (string). The files may also include optional fields section (integer), slide (integer), cert (string, with special values duplicate and exclude used to prevent slides from showing to the user), and tags (comma-separated list of strings). Here is an example of a valid .json file:
{
"specimen": "S1",
"block": "B06_left",
"section": 20,
"stain": "Nissl",
"tags": [ "diag", "hippocampus" ]
}
Preprocessed Data
In addition to the raw data, PHAS requires each slide should have a set of derived files. These files include metadata extracted from the image header and a thumbnail. We recommend placing these files into a separate directory, for example for slide S1_slide_001.svs above, we may place the derived files into /data/archive/someproject/S1/histo_proc/S1_slide_001/preproc/.
Generating Preprocessed Data using Docker
To run preprocessing for a slide, run the following docker command:
# Create the directory for the preprocessed files
mkdir -p /data/archive/someproject/S1/histo_proc/S1_slide_001/preproc
# Generate preprocessed files
docker run \
-v /data/archive:/data/archive \
pyushkevich/histo-preproc:latest \
python3 process_raw_slide.py -i /data/archive/someproject/S1/histo_raw/S1_slide_001.svs \
-s /data/archive/someproject/S1/histo_proc/S1_slide_001/preproc/S1_slide_001
If successful, the directory /data/archive/someproject/S1/histo_proc/S1_slide_001/preproc will contain files S1_slide_001_thumbnail.tiff, S1_slide_001_metadata.json, and some other files.
Generating Preprocessed Data Manually
If you do not want to use the Docker container, you can download the script process_raw_slide.py from here instead.
Project Descriptor Json
Once you have organized the data in this manner, you can create a PHAS project as described in PICSL Histology Annotation Service Quick Start. For the organization used in the example above, the project descriptor json file would look like this:
{
"base_url": "/data/archive/someproject",
"disp_name": "Some Project",
"desc": "Project demonstrating data organization in PHAS",
"manifest_mode": "individual_json",
"url_schema": {
"pattern": {
"raw": "{specimen}/histo_raw/{slide_name}.{slide_ext}",
"thumb": "{specimen}/histo_proc/{slide_name}/preproc/{slide_name}_thumb.tiff",
"metadata": "{specimen}/histo_proc/{slide_name}/preproc/{slide_name}_metadata.json"
},
"raw_slide_ext": [ "svs" ]
}
}