===================================================================
Documentation of Public PAMONO Datasets
===================================================================


==================================
Version: 	2.0
Date: 		2017-08-07
Author: 	Dominic Siedhoff
==================================



0. TERMINOLOGY

The terms "virus", "particle" and "nano-object" are used synonymously in the file names, directory names and data described in this documentation. All these terms denote the same objects of interest, that is true positive nano-objects that are sought in the data.


1. DIRECTORY/FILE STRUCTURE

Each .zip file contains one dataset, i.e. data recorded during one PAMONO experiment. This encompasses real data, as measured by the PAMONO sensor, and synthetic data with a ground truth segmentation/classification. Synthetic data is based on real data in combination with the signal model described in [1] (German) and in Chapter 4 of [2] (English).


1.1 DATASETS

Names of the .zip files describe the most important properties of the contained dataset. The information in these names consists of particle size in nanometers (nm), the date when the experiment took place, and an optional identifier in case of multiple experiments from one day.

The mapping from the names used in Table 7.1 of [2] to the .zip file names used here is:
200nm HQ  ---> 200nm_10Apr13
200nm MQ  ---> 200nm_11Apr13_2
200nm LQ  ---> 200nm_11Apr13_1
200nm Gpy ---> 200nm_9May14
100nm HQ  ---> 100nm_27Sep13_exp3
100nm LQ  ---> 100nm_27Sep13_exp2


1.2. REAL DATA

The directory named "real" inside the base directory of each dataset contains data recorded by the PAMONO sensor. It is divided into directories "background" and "particles". The "background" directory contains measurements from before particles were injected into the liquid and hence only shows background, artifacts and noise. It is the data used as the particle-less background in synthesis as according to [1,2]. The "particles" directory contains measurements from after particles were injected into the liquid and hence shows particle adhesions. This is the real data to be finally analyzed.


1.3 SYNTHETIC DATA

The directory named "synthetic" inside the base directory of each dataset contains synthetic images (.png) annotated with ground truth (.csv). A small set of template particles extracted from the "real/particles" images was synthesized on the "real/background" images. The background images are split into three temporally coherent batches of equal length, stored in the folders "1", "2" and "3". These are intended to be used as training, validation and test data, respectively. These roles can be permuted, hence the names are numbers instead of "training", "validation" and "test" to avoid confusion.
All three parts use different background images and different template particles, so they are completely disjoint.
A typical workflow (as done in [2]) is to optimize parameters and train a classifying model on "1", perform parameter- and model-selection on "2" and obtain performance estimates by applying the results to "3". Then the roles can optionally be permuted, allowing cross-validation.
The directories and files occurring in each of the directories "1", "2" and "3" share the same structure and meaning and are explained in detail in the following sections.



2. DIRECTORIES AND FILES IN THE "synthetic" FOLDERS

2.1 ".png" FILES IN DIRECTORIES "1", "2" and "3"

These ".png" files are the synthesized images, i.e. virus templates taken from real data, synthesized on real background, artifacts and noise, as according to the sensor model proposed in [1] and in Chapter 4 of [2].
On the first 128 and the last 64 images no particles appear. Furthermore, no particles appear that intersect the image boundary. The reason for this is to avoid boundary issues: Recall 1 should be attainable even if large spatio-temporal kernels are used.
The number of particles appearing in total can be inferred from the number of examples in the .csv file described in the next section.


2.2 "NanoSynthMLPolygonFormFactors.csv" FILE

This file contains the synthetic ground truth segmentation of all particles appearing in the image files from 2.1. This segmentation contains only particle positives, so a classification is implicitly given as well.

The semicolon ";" is used as the column separator in "NanoSynthMLPolygonFormFactors.csv". The first line contains column names, the remaining lines contain the data. The most important columns are (in the order as they appear in the file):

label: class label of the example in the current line. All labels are "virus" here because synthetic ground truth only contains viruses and no false positive detections. As stated in 0., the term 'virus' is synonymous with the terms 'particle', 'nano-object' or 'true positive detection'.

fileName: name of the ".png" file on which the particle was found. Note that this name is absolute and might need adaptation to you local directory structure.

x*, y*: x and y coordinates of the manually segmented polygon delineating the particle adhesion. This polygon is typically larger than machine-detected polygons because it is intended to cover the entire region affected by the particle signal and its nearest surroundings. For this reason, these polygons (and features derived from them) should not be used in learning a model to classify machine-detected data. The polygons in this file rather serve as a basis to which the machine-detected polygons can be matched in order to measure detection quality.


2.3 DIRECTORY "particles_component"

For synthetic data, a perfectly separated particle component is available, which is stored in the subdirectory "particles_component". It contains only the target particles to be detected without any background/artifacts/noise or other effects impeding analysis. Hence this is the pure signal of interest. In Equation 4.1 of [2], this signal corresponds to the T component, i.e. the target signal.
The images in "particles_component" map 1:1 to the images described in 2.1. and 2.4.


2.4 DIRECTORY "background_component"

This directory contains the complementary data to the directory "particles_component": This is solely the irrelevant background, artifacts and noise, without the particles. The images described in 2.1 are composed as according to Equation 4.1 in [2], using the images from 2.4 as the B(ackground), A(rtifacts) and N(oise) components, modulated with the images from 2.3 as the T(arget particles) component.
Note that the directory "background_component" is named merely for the visually most prominent component it contains. However, artifacts and noise are also present because these images were recorded with a real sensor.
The images in "background_component" map 1:1 to the images described in 2.1. and 2.3.





===================================================================
References:
===================================================================

[1] Siedhoff, D., Libuschewski, P., Weichert, F., Zybin, A., Marwedel, P., Müller, H. (2014). "Modellierung und Optimierung eines Biosensors zur Detektion viraler Strukturen" In: Bildverarbeitung für die Medizin 2014 (pp. 108-113). Springer Berlin Heidelberg.

[2] Siedhoff, D. (2016). "A parameter-optimizing model-based approach to the analysis of low-SNR image sequences for biological virus detection" PhD thesis. TU Dortmund University. DOI: http://dx.doi.org/10.17877/DE290R-17272

[3] Siedhoff, D., Fichtenberger, H., Libuschewski, P., Weichert, F., Sohler, C. and Müller, H. (2014) "Signal/Background Classification of Time Series for Biological Virus Detection" In: Pattern Recognition. Ed. by X. Jiang, J. Hornegger, and R. Koch. Vol. 8753. Lecture Notes in Computer Science. Springer Berlin Heidelberg. URL: https://link.springer.com/chapter/10.1007/978-3-319-11752-2_31