Overview
=========================

The Machine Learning pipeline includes many operations which should be separated from the underlying engine like TensorFlow
or PyTorch in order to guarantee high compatibility, reproducibility, flexibility and sometimes efficiency.
These operations include pre- & postprocessing, sampling and data-loading, which we provide as Python bindings from our
underlying C++ framework.

Training Only
-------------

Some concepts like data-loading and sampling are specifically designed for training purposes.
This chapter concentrates on the data processing which is only available during the training phase.


Dataset
^^^^^^^^^^

The :class:`~imfusion.machinelearning.Dataset` class represents a pipeline for lazily loading and processing data.
Besides loading data from disk, it can batch, shuffle, filter, prefetch data and apply :class:`~imfusion.machinelearning.Operation`\s on data among other things.
It implements Python's :class:`typing.Iterable` protocol, which allows using it in loops & comprehensions and with the :func:`iter` & :func:`next` built-ins.
Performing iterations on a :class:`~imfusion.machinelearning.Dataset` instance yields :class:`~imfusion.machinelearning.DataItem` instances, which are similar to Python dictionaries.

A :class:`~imfusion.machinelearning.Dataset` instance can be created by passing it a list of file specifications.
Each file specification must be a tuple with two entries:

1. A mapping from an unsigned integer index to the name of the field in the resulting :class:`~imfusion.machinelearning.DataItem`. This is needed since a single file might contain more than one image and the integer specifies which one should be picked. The field name specifies the key under which the loaded :class:`~imfusion.machinelearning.DataElement` will be stored in the resulting :class:`~imfusion.machinelearning.DataItem`.
2. A list of filenames.

.. code-block:: python

    import imfusion.machinelearning as ml
    path_list = ['/path/to/file1.imf', '/path/to/file2.imf']
    index_to_field = {0: "input"}
    file_spec = (index_to_field, path_list)
    dataset = ml.Dataset([(index_to_field, path_list)])

The dataset can then be further tuned for e.g. shuffling, batching and repetition:

.. code-block:: python

    batch_size = 16
    num_epochs = 10
    dataset.shuffle() # shuffle the entire dataset
    dataset.batch(batch_size) # set dataloader to batchsize 16
    dataset.repeat(num_epochs) # set dataloader to 10 epochs

Once set up, the dataset can be iterated over and will load and process the data in a lazy fashion.

.. code-block:: python

    for i, item in enumerate(dataset):
         # here input and labels data can be used
         print('Processed: {:.1f}%'.format((i+1) * 100 / len(dataset)))

If you only want individual sample, you can also use :func:`next`:

.. code-block:: python

    try:
        item = next(dataset)  # this will advance the dataset to the next sample

        dataset.reset()       # we can reset the iteration of the dataset afterwards

    except StopIteration:     # beware that calling `next` directly might raise a StopIteration if the dataset has been exhausted
        pass


ImageROISampler
^^^^^^^^^^^^^^^

There are different situations where it is necessary to extract Regions-Of-Interest (ROI) from images/volumes.
Either the dataset contains many dispensable parts or the dataset is simply too big to train a network architecture
on an entire volume. In the latter case one could also consider using the :class:`~imfusion.machinelearning.ResampleDims` operation, however it is sometimes
desirable to subsample the image in order to predict a big volume patchwise later on.
Furthermore, sampling can constitute a form of data augmentation, which is useful for training scenarios with small size datasets.

All sampler classes are a special kind of :class:`~imfusion.machinelearning.Operation` and their names end in ``ROISampler``.
Each sampler class implements a different sampling strategy.
An overview of the available samplers is shown in :doc:`ml_samplers_bindings`.

If you want to use samplers in your training pipeline as a form of data augmentation, you can combine multiple samples in a :class:`~imfusion.machinelearning.RandomChoiceOperation` or directly use :meth:`~imfusion.machinelearning.Data.sample`.
While you can pass a list of samplers to
with each element defined as:
``(name : str, properties : dict)``

This can look like the following:

.. code-block:: python

    sampler_settings = [('RandomROISampler', {'roi_size': [64, 64, 64]}), ('LabelROISampler', {'roi_size' : [64, 64, 64], 'labels_values' : [1, 2]})]

The class :class:`~imfusion.machinelearning.RandomChoiceOperation` can then be instantiated with the settings:

.. code-block:: python

    sampler_set = ml.RandomChoiceOperation(sampler_settings)

Afterwards, ROIs can be extracted from an image or an image-label-pair (depending on the samplers) by:

.. code-block:: python

    sampler_set.extract_roi(image, label)

In order to register a sampling configuration to a data loader, execute:

.. code-block:: python

    dataset.sample(sampler_settings) # set sampler up


Training and Inference
----------------------

This chapter describes all the extra functionality which can be added to perform during training and/or inference.

Operation
^^^^^^^^^

Pre-/ post-processing are operations which are successively applied to images/volumes.
In the machine learning pipeline some operations make sense as pre- or/and post-processing and some should be applied
only to network input/label or both.
A list of all available operations can be found in the section :doc:`ml_op_bindings`.
For instance, a :class:`~imfusion.machinelearning.ThresholdOperation` can be defined by:

.. code-block:: python

    op = ml.ThresholdOperation(value=0.5, to_ubyte=False)

Its properties can also be redefined (e.g. when using the default constructor ``ThresholdOperation()`` ) by:

.. code-block:: python

    op.configure({'value' : 1, 'to_ubyte' : True})

And the defined operation can finally be run on a :class:`~imfusion.SharedImageSet` by:

.. code-block:: python

    op.process(x) # x : SharedImageSet

The Operation class comprises the most common methods to define a custom pre and post-processing pipeline.
In order to set up a operations pipeline, a list of tuples has to be defined as configuration with each entry:
``(name : str, properties : dict, phase : imfusion.machinelearning.Phase)``

.. code-block:: python

    from imfusion.machinelearning import Phase
    operation_settings = [('MakeFloat', {'apply_to_label': True}, Phase.Always),
                          ('NormalizeUniform', {'apply_to_label': True}, Phase.Always),
                          ('RandomLinearIntensityMapping', {'probability': 0.5, 'random_range': 0.1}, Phase.Train)]
                          # Phase: Some operations should just be applied during training, validation or always.

The operations settings can be used to define a :class:`~imfusion.machinelearning.OperationsSequence`, which can execute the pipeline on a SharedImageSet or two SharedImageSets as input & label pair:

.. code-block:: python

    op_seq = OperationsSequence(operations_settings)
    op_seq.execute(image, label) # image : imf.SharedImageSet, label : imf.SharedImageSet

Furthermore, :class:`~imfusion.machinelearning.OperationsSequence` allows to list all available operations:

.. code-block:: python

    OperationsSequence.available_cpp_operations()
    OperationsSequence.available_py_operations()

In order to register the operations to a Dataset call the preprocess function on the settings:

.. code-block:: python

    dataset.preprocess(operation_settings)

After registering the preprocessing a new iterator can be created which performs the preprocessing:

.. code-block:: python

    iterator = iter(dataset)

Writing a custom operation
^^^^^^^^^^^^^^^^^^^^^^^^^^

There is also the possibility to write custom C++ operations which can then also be used for pre- & postprocessing.

.. note::
   The ``Operation`` interface has been generalized to accept not only images but also point clouds and bounding boxes.
   The documentation for this new interface will be available once it becomes more mature.
   In the meantime, the old interface is still available under the name ``OperationV1``.
   All existing operations have also been adapted to avoid breaking changes, you can also easily adapt custom ``OperationV1``s into a new ``Operation`` by using ``ML::ImageOperationAdapter<MyOperationV1>``.


An example for a new class :class:`MyFunctionOperation` would look like this:

.. code-block:: cpp

    class IMFUSION_ML_API MyFunctionOperation : public OperationV1
    {
        public:
            // Default constructor, to be used in combination with configure
            MyFunctionOperation() : OperationV1("MyFunctionOperation", true /* specifies whether to process labels too, in case they are there */) {}

            // Direct constructor
            explicit MyFunctionOperation(int myParam1, std::vector<float> myParam2)
                : MyFunctionOperation()
            {
                m_myParam1 = myParam1;
                m_myParam2 = std::move(myParam2);
            }

            // This function must be overridden, as it specifies the Operation logic
            std::unique_ptr<SharedImageSet> process(std::unique_ptr<SharedImageSet> input) const override;
            // This function needs to be implemented only if you need special treatment of image and label, the default implementation
            // applies the function above to both image and label (depending on the flag).
            std::pair<std::unique_ptr<SharedImageSet>, std::unique_ptr<SharedImageSet>> process(std::unique_ptr<SharedImageSet> input, std::unique_ptr<SharedImageSet> label) const;

     private:
        // Declare the class parameter that you want to configure as OpParam, they'll be auto-configured.
        OpParam<int> m_myParam1 = {"param1", 0 /* default value */, this, ParamRequired::Yes /* if "param1" is not in the configuring Properties, returns an error message */};
        OpParam<std::vector<float>> m_myParam2 = {"param2", std::vector<float>(), this, ParamRequired::No /* this parameter is optional */};
    };

Afterwards, the class has to be registered in :file:`MLPlugin.cpp` by adding following line:

.. code-block:: cpp

    auto* factory = ML::getCppOperationFactory();
    factory->registerType<MyFunctionOperation>(std::string("MyFunction"), ML::makeUniqueOperation<MyFunctionOperation>);

Note: On the Python site the operation is then called :class:`MyFunction`.
The next step is to write a proper documentation for the Python site into :file:`MachineLearningBindings.cpp`, which describes what the operation and all of its parameter do:

.. code-block:: cpp

    py::class_<MyFunctionOperation, std::shared_ptr<MyFunctionOperation>, Operation>(sm, "MyFunctionOperation", R"DOC(
            What the custom operation does.

            Args:
                my_param (int): What this parameter of the operation does.
                my_param2 (list): What this parameter of the operation does.)DOC")
    .def(py::init<>()).def(py::init<int, std::vector<float>>(), "my_param"_a, "my_param2"_a)


Writing a custom random operation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Operations meant to be used as data augmentation must be randomized. In order to make testing easier, we always write two versions of such operations: a deterministic one and a randomized one (see :class:`~imfusion.machinelearning.g.RotationOperation` and :class:`~imfusion.machinelearning.g.RandomRotationOperation` for instance). The :class:`~imfusion.machinelearning.g.RandomOperation` interface provides an easy way to write such operations.

Let's assume that we want to create a random operation out of :class:`MyFunctionOperation`. (defined above), where param1 should be drawn randomly but param2 should still be configurable. Here is what it would look like:

.. code-block:: cpp

    #include <ImFusion/ML/Operation.h>

    class IMFUSION_ML_API MyRandomOperation : public RandomOperationV1<MyFunctionOperation> /* note the template base class */
    {
        public:
            // The constructor still takes ``myParam2`` but instead of ``myParam1`` we now take a range of potential values
            // The last parameter ``probability`` is also new and comes from ``RandomOperation``. It defines the probability to actually run this random operation (otherwise we just propage the data at it is)
            explicit MyRandomOperation(vec2i param1Range, std::vector<float> myParam2, double probability = 1.0)
                : RandomOperationV1<MyFunctionOperation>("MyRandomOperation", std::make_unique<MyFunctionOperation>(0, myParam2), probability)
            {
                // We need to mark the parameters to be randomized for internal checks
                setRandomizedParameters(p_myParam1);
            }

            // This function is responsible to randomly sets all necessary parameters
            // We provide the input in case the parameters depend on it (e.g. is it 2D or 3D?)
            // Note that this is the only function that we need to implement
            void randomizeOperation(SharedImageSet* input) const override
            {
                // All random values must be drawn from m_randGenerator (ML::RandomNumberGenerator) - that way, we properly control the seeding
                int randomValue = m_randGenerator.getUniformInt(p_myParam1Range.value()[0], p_myParam1Range.value()[1]);
                // m_baseOperation is a pointer automatically created to the deterministic operation
                m_baseOperation->p_param1 = randomValue;
            }

        private:
            // The only new parameter is the range of potential values of the param1
            OpParam<vec2i> p_myParam1Range = {"param1Range", vec2i(0,2) /* default value */, this, ParamRequired::Yes /* if "param1Range" is not in the configuring Properties, returns an error message */};
    };

Don't forget to register it in MLPlugin, and to include it in the python bindings as described above.