Overview
The Machine Learning pipeline includes many operations which should be separated from the underlying engine like TensorFlow or PyTorch in order to guarantee high compatibility, reproducibility, flexibility and sometimes efficiency. These operations include pre- & postprocessing, sampling and data-loading, which we provide as Python bindings from our underlying C++ framework.
Training Only
Some concepts like data-loading and sampling are specifically designed for training purposes. This chapter concentrates on the data processing which is only available during the training phase.
Dataset
The Dataset
class represents a pipeline for lazily loading and processing data.
Besides loading data from disk, it can batch, shuffle, filter, prefetch data and apply :class:~imfusion.machinelearning.Operation`s on data among other things.
It implements Python’s typing.Iterable
protocol, which allows using it in loops & comprehensions and with the iter()
& next()
built-ins.
Performing iterations on a Dataset
instance yields DataItem
instances, which are similar to Python dictionaries.
A Dataset
instance can be created by passing it a list of file specifications.
Each file specification must be a tuple with two entries:
A mapping from an unsigned integer index to the name of the field in the resulting
DataItem
. This is needed since a single file might contain more than one image and the integer specifies which one should be picked. The field name specifies the key under which the loadedDataElement
will be stored in the resultingDataItem
.A list of filenames.
import imfusion.machinelearning as ml
path_list = ['/path/to/file1.imf', '/path/to/file2.imf']
index_to_field = {0: "input"}
file_spec = (index_to_field, path_list)
dataset = ml.Dataset([(index_to_field, path_list)])
The dataset can then be further tuned for e.g. shuffling, batching and repetition:
batch_size = 16
num_epochs = 10
dataset.shuffle() # shuffle the entire dataset
dataset.batch(batch_size) # set dataloader to batchsize 16
dataset.repeat(num_epochs) # set dataloader to 10 epochs
Once set up, the dataset can be iterated over and will load and process the data in a lazy fashion.
for i, item in enumerate(dataset):
# here input and labels data can be used
print('Processed: {:.1f}%'.format((i+1) * 100 / len(dataset)))
If you only want individual sample, you can also use next()
:
try:
item = next(dataset) # this will advance the dataset to the next sample
dataset.reset() # we can reset the iteration of the dataset afterwards
except StopIteration: # beware that calling `next` directly might raise a StopIteration if the dataset has been exhausted
pass
ImageROISampler
There are different situations where it is necessary to extract Regions-Of-Interest (ROI) from images/volumes.
Either the dataset contains many dispensable parts or the dataset is simply too big to train a network architecture
on an entire volume. In the latter case one could also consider using the ResampleDims
operation, however it is sometimes
desirable to subsample the image in order to predict a big volume patchwise later on.
Furthermore, sampling can constitute a form of data augmentation, which is useful for training scenarios with small size datasets.
All sampler classes are a special kind of Operation
and their names end in ROISampler
.
Each sampler class implements a different sampling strategy.
An overview of the available samplers is shown in List of Samplers.
If you want to use samplers in your training pipeline as a form of data augmentation, you can combine multiple samples in a RandomChoiceOperation
or directly use sample()
.
While you can pass a list of samplers to
with each element defined as:
(name : str, properties : dict)
This can look like the following:
sampler_settings = [('RandomROISampler', {'roi_size': [64, 64, 64]}), ('LabelROISampler', {'roi_size' : [64, 64, 64], 'labels_values' : [1, 2]})]
The class RandomChoiceOperation
can then be instantiated with the settings:
sampler_set = ml.RandomChoiceOperation(sampler_settings)
Afterwards, ROIs can be extracted from an image or an image-label-pair (depending on the samplers) by:
sampler_set.extract_roi(image, label)
In order to register a sampling configuration to a data loader, execute:
dataset.sample(sampler_settings) # set sampler up
Training and Inference
This chapter describes all the extra functionality which can be added to perform during training and/or inference.
Operation
Pre-/ post-processing are operations which are successively applied to images/volumes.
In the machine learning pipeline some operations make sense as pre- or/and post-processing and some should be applied
only to network input/label or both.
A list of all available operations can be extracted from the documentation: Operations documentation.
For instance, a ThresholdOperation
can be defined by:
op = ml.ThresholdOperation(value=0.5, to_ubyte=False)
Its properties can also be redefined (e.g. when using the default constructor ThresholdOperation()
) by:
op.configure({'value' : 1, 'to_ubyte' : True})
And the defined operation can finally be run on a SharedImageSet
by:
op.process(x) # x : SharedImageSet
The Operation class comprises the most common methods to define a custom pre and post-processing pipeline.
In order to set up a operations pipeline, a list of tuples has to be defined as configuration with each entry:
(name : str, properties : dict, phase : imfusion.machinelearning.Phase)
from imfusion.machinelearning import Phase
operation_settings = [('MakeFloat', {'apply_to_label': True}, Phase.Always),
('NormalizeUniform', {'apply_to_label': True}, Phase.Always),
('RandomLinearIntensityMapping', {'probability': 0.5, 'random_range': 0.1}, Phase.Train)]
# Phase: Some operations should just be applied during training, validation or always.
The operations settings can be used to define a OperationsSequence
, which can execute the pipeline on a SharedImageSet or two SharedImageSets as input & label pair:
op_seq = OperationsSequence(operations_settings)
op_seq.execute(image, label) # image : imf.SharedImageSet, label : imf.SharedImageSet
Furthermore, OperationsSequence
allows to list all available operations:
OperationsSequence.available_cpp_operations()
OperationsSequence.available_py_operations()
In order to register the operations to a Dataset call the preprocess function on the settings:
dataset.preprocess(operation_settings)
After registering the preprocessing a new iterator can be created which performs the preprocessing:
iterator = iter(dataset)
Writing a custom operation
There is also the possibility to write custom C++ operations which can then also be used for pre- & postprocessing.
Note
The Operation
interface has been generalized to accept not only images but also point clouds and bounding boxes.
The documentation for this new interface will be available once it becomes more mature.
In the meantime, the old interface is still available under the name OperationV1
.
All existing operations have also been adapted to avoid breaking changes, you can also easily adapt custom OperationV1``s into a new ``Operation
by using ML::ImageOperationAdapter<MyOperationV1>
.
An example for a new class MyFunctionOperation
would look like this:
class IMFUSION_ML_API MyFunctionOperation : public OperationV1
{
public:
// Default constructor, to be used in combination with configure
MyFunctionOperation() : OperationV1("MyFunctionOperation", true /* specifies whether to process labels too, in case they are there */) {}
// Direct constructor
explicit MyFunctionOperation(int myParam1, std::vector<float> myParam2)
: MyFunctionOperation()
{
m_myParam1 = myParam1;
m_myParam2 = std::move(myParam2);
}
// This function must be overridden, as it specifies the Operation logic
std::unique_ptr<SharedImageSet> process(std::unique_ptr<SharedImageSet> input) const override;
// This function needs to be implemented only if you need special treatment of image and label, the default implementation
// applies the function above to both image and label (depending on the flag).
std::pair<std::unique_ptr<SharedImageSet>, std::unique_ptr<SharedImageSet>> process(std::unique_ptr<SharedImageSet> input, std::unique_ptr<SharedImageSet> label) const;
private:
// Declare the class parameter that you want to configure as OpParam, they'll be auto-configured.
OpParam<int> m_myParam1 = {"param1", 0 /* default value */, this, ParamRequired::Yes /* if "param1" is not in the configuring Properties, returns an error message */};
OpParam<std::vector<float>> m_myParam2 = {"param2", std::vector<float>(), this, ParamRequired::No /* this parameter is optional */};
};
Afterwards, the class has to be registered in MLPlugin.cpp
by adding following line:
auto* factory = ML::getCppOperationFactory();
factory->registerType<MyFunctionOperation>(std::string("MyFunction"), ML::makeUniqueOperation<MyFunctionOperation>);
Note: On the Python site the operation is then called MyFunction
.
The next step is to write a proper documentation for the Python site into MachineLearningBindings.cpp
, which describes what the operation and all of its parameter do:
py::class_<MyFunctionOperation, std::shared_ptr<MyFunctionOperation>, Operation>(sm, "MyFunctionOperation", R"DOC(
What the custom operation does.
Args:
my_param (int): What this parameter of the operation does.
my_param2 (list): What this parameter of the operation does.)DOC")
.def(py::init<>()).def(py::init<int, std::vector<float>>(), "my_param"_a, "my_param2"_a)
Writing a custom random operation
Operations meant to be used as data augmentation must be randomized. In order to make testing easier, we always write two versions of such operations: a deterministic one and a randomized one (see RotationOperation
and RandomRotationOperation
for instance). The RandomOperation
interface provides an easy way to write such operations.
Let’s assume that we want to create a random operation out of MyFunctionOperation
. (defined above), where param1 should be drawn randomly but param2 should still be configurable. Here is what it would look like:
#include <ImFusion/ML/Operation.h>
class IMFUSION_ML_API MyRandomOperation : public RandomOperationV1<MyFunctionOperation> /* note the template base class */
{
public:
// The constructor still takes ``myParam2`` but instead of ``myParam1`` we now take a range of potential values
// The last parameter ``probability`` is also new and comes from ``RandomOperation``. It defines the probability to actually run this random operation (otherwise we just propage the data at it is)
explicit MyRandomOperation(vec2i param1Range, std::vector<float> myParam2, double probability = 1.0)
: RandomOperationV1<MyFunctionOperation>("MyRandomOperation", std::make_unique<MyFunctionOperation>(0, myParam2), probability)
{
// We need to mark the parameters to be randomized for internal checks
setRandomizedParameters(p_myParam1);
}
// This function is responsible to randomly sets all necessary parameters
// We provide the input in case the parameters depend on it (e.g. is it 2D or 3D?)
// Note that this is the only function that we need to implement
void randomizeOperation(SharedImageSet* input) const override
{
// All random values must be drawn from m_randGenerator (ML::RandomNumberGenerator) - that way, we properly control the seeding
int randomValue = m_randGenerator.getUniformInt(p_myParam1Range.value()[0], p_myParam1Range.value()[1]);
// m_baseOperation is a pointer automatically created to the deterministic operation
m_baseOperation->p_param1 = randomValue;
}
private:
// The only new parameter is the range of potential values of the param1
OpParam<vec2i> p_myParam1Range = {"param1Range", vec2i(0,2) /* default value */, this, ParamRequired::Yes /* if "param1Range" is not in the configuring Properties, returns an error message */};
};
Don’t forget to register it in MLPlugin, and to include it in the python bindings as described above.