===================================================================
Documentation of Public PAMONO Datasets
===================================================================


==================================
version: 	1.0
date: 		2014-05-13
author: 	dominic siedhoff
==================================



0. TERMINOLOGY

The terms "virus", "particle" and "nano-object" are used synonymously in the file names, directory names and data described in this documentation. All these terms denote the same objects of interest, that is true positive nano-objects that are sought in the data.


1. DIRECTORY/FILE STRUCTURE

Each directory (respectively .zip file) in this directory contains one dataset. This encompasses real data (as measured by the PAMONO sensor) and synthetic data (based on the real data) with ground truth segmentation/classification.


1.1 DATASETS

Directory names (respectively .zip file names) describe the most important properties of the contained dataset. The information in these names consists of particle size in nanometers (nm), date of experiment and an optional identifier in case of multiple experiments from one day.


1.2. REAL DATA

The "real" directory inside the base directory of each dataset contains the data measured by the PAMONO sensor. It is divided into directories "background" and "particles". The "background" directory contains measurements from before particles were injected into the liquid and hence only shows background and noise. It is the data used as the particle-less background in synthesis as according to [1]. The "particles" directory contains measurements from after particles were injected into the liquid and hence shows particle adhesions. This is the real data to be finally analyzed.


1.3 SYNTHETIC DATA

The "synth" directory inside the base directory of each dataset contains synthetic data. Particles from the "real/particles" directory were synthesized on the "real/background" data. The directories containing synthetic data are named "pcountNNN", where NNN is an integer indicating the number of synthetic particles occuring in the data.
On the first level, the "pcountNNN" directory contains a directory called "nstt". This directory indicates the kind of data: "nstt" is shorthand for "Nano-Synthesis: Training and Testing".
In the "nstt" directory there are a "training" and a "testing" directory. The "training" dataset is constructed from the first half of the "real/background" data, the "testing" dataset from the second. The testing dataset is intended for use in performance estimations: Parameters and models can be tuned for the "training" dataset and evaluated on the "testing" dataset to avoid bias.
The directories and files occuring in the "training" and "testing" folder have the same structure and meaning and are explained in detail in the following section.



2. DIRECTORIES AND FILES IN THE SYNTHETIC TRAINING AND TESTING FOLDERS


2.1 ".png" FILES IN DIRECTORY "training" RESPECTIVELY "testing"

These ".png" files are the synthesized data, i.e. virus templates taken from real data, synthesized on real background and noise, as according to the sensor model proposed in [1].
On the first 128 and the last 64 images no particles appear. Furthermore, no particles appear that intersect the image boundary. The reason for this is to avoid boundary issues in the training/testing data: Recall 1 should be attainable even if large spatio-temporal kernels are used.
The number of particles appearing in total can be inferred from the either the directory name (pcountNNN, where NNN is the number) or from the number of examples in the file described in the next section.


2.2 "NanoSynthMLPolygonFormFactors.csv" FILE

Synthetic ground truth segmentation and classification for particles appearing in the files from 2.1. Column Separator: ";". The first line contains column names, the remaining lines contain the data. The most important columns are (in the order as they appear in the file):

label: class label of the example in the current line. All labels are "virus" because synthetic ground truth only contains viruses and no false positive detections. As stated in 0., "virus" is synonymous with "particle", "nano-object" or "true positive detection".

fileName: name of the ".png" file on which the particle was found. Note that this name is absolute and might need adaptation.

x*, y*: x and y coordinates of the manually segmented polygon delineating the particle adhesion. This polygon is typically larger than machine-detected polygons because it is intended to cover the entire region affected by the particle signal and its nearest surroundings. For this reason, these polygons (and features derived therefrom) should not be used to learn a model to classify machine-detected data. The polygons in this file rather serve as a basis the machine-detected polygons can be matched to.


2.3 DIRECTORY "nanoSynthInjector_nosg"

"nosg" is short for Nano-Objects-SiGnal. These images map 1:1 to the images descibed in 2.1. They contain solely the signal caused by the particles. Hence this is the pure signal of interest.


2.4 DIRECTORY "nanoSynthInjector_subs"

This directory contains the complementary data to directory "nanoSynthInjector_nosg": This is solely the irrelevant background and noise signal, without the particles. The images described in 2.1 consist of the images from 2.4 as the noise component and the images from 2.3 as the signal component.





===================================================================
References:
===================================================================

[1] Siedhoff, D., Libuschewski, P., Weichert, F., Zybin, A., Marwedel, P., & Müller, H. (2014). Modellierung und Optimierung eines Biosensors zur Detektion viraler Strukturen. In Bildverarbeitung für die Medizin 2014 (pp. 108-113). Springer Berlin Heidelberg.