ImFusion SDK 4.3
Dataset Class Reference

#include <ImFusion/ML/Dataset.h>

Class for creating an iterable dataset by chaining data loading and transforming operations executed in a lazy fashion. More...

Detailed Description

Class for creating an iterable dataset by chaining data loading and transforming operations executed in a lazy fashion.

Example Usage:

Dataset ds(dataLists);
// define pipeline
ds.preprocess(operations);
ds.memoryCache();
ds.shuffle();
// iterate over the dataset
while (auto item = ds.next())
{
// do something with item
}
Dataset(bool verbose=false)
Constructs an empty dataset with no starting loader.

Public Member Functions

 Dataset (bool verbose=false)
 Constructs an empty dataset with no starting loader.
 
 Dataset (const std::vector< FileReaderColumn > &dataLists, bool shuffle=false, bool verbose=false)
 Constructs a dataset from a list of h5 filenames to be loaded.
 
 Dataset (std::unique_ptr< DataReader > reader, bool verbose=false)
 Constructs a dataset from an existing DataReader.
 
 Dataset (const std::string &readerType, const Properties &readerProperties, bool verbose=false)
 Create a Dataset by specifying the DataReader type and properties.
 
std::optional< DataItemnext ()
 Returns the next item in the dataset When the dataset is over, no element is returned.
 
std::optional< size_t > size () const
 Determines the overall size of the dataset.
 
void reset ()
 Reset the dataset to the beginning of the data loading pipeline.
 
void reinit ()
 Reinit the dataset to a state equivalent to its state after construction, clearing all the state surviving reset() (i.e.
 
Cardinality cardinality () const
 Get cardinality of the data set.
 
Datasetread (const std::string &readerType, Properties readerProperties, bool verbose=false)
 
Datasetbatch (int batchSize, bool pad=false, int overlap=0)
 Batches the next batchSize items in a single one before returning it.
 
Datasetsplit (int numItems=-1)
 Splits the content of the SharedImagesSets into SIS containing a single image.
 
Datasetrepeat (int numEpochRepetitions, int numItemRepettiions=1)
 Repeats the epoch numEpochs times and each individual item numItemRepetitions times For both parameters, a value of -1 indicates an infinite repetition.
 
Datasetshuffle (int howMany=-1, int seed=-1)
 Shuffles the dataset.
 
Datasetmap (std::function< void(DataItem &)> func, int numParallelCalls=1)
 Applies a mapping to each item of the dataset.
 
Datasetmap (const std::string &funcKey, int numParallelCalls=1)
 Applies a mapping to each item of the dataset This overload is useful when configuring the loading pipeline from properties, the user can register a mapping function and use its registry key to configure the map decorator to use it.
 
Datasetfilter (std::function< bool(const DataItem &)> func)
 Filters the dataset according to a user defined criterion.
 
Datasetfilter (const std::string &funcKey)
 Filters the dataset according to a user defined criterion This overload is useful when configuring the loading pipeline from properties, the user can register a filtering function and use its registry key to configure the filter decorator to use it.
 
Datasetprefetch (size_t prefetchSize, bool syncToGl=true)
 Prefetches datasets in a separate thread, independently of the regular pipeline.
 
Datasetpreprocess (const std::vector< Operation::Specs > &preprocPipeline, Phase execPhase=Phase::Always, int numParallelCalls=1)
 Applies a preprocessing pipeline to each item of the dataset.
 
Datasetpreprocess (const std::vector< std::shared_ptr< Operation > > &operations, int numParallelCalls=1)
 Applies a preprocessing pipeline to each item of the dataset.
 
Datasetsample (const std::vector< Operation::Specs > &samplingPipeline, int samplerSelectionSeed=-1, int numParallelCalls=1)
 Samples from each item of the dataset.
 
Datasetsample (const std::shared_ptr< ImageROISampler > &sampler, int numParallelCalls)
 Samples from each item of the dataset.
 
Datasetsample (const std::vector< std::shared_ptr< ImageROISampler > > &samplers, const std::optional< std::vector< float > > &weights, int samplerSelectionSeed=-1, int numParallelCalls=1)
 Samples from each item of the dataset.
 
DatasetmemoryCache (bool makeExclusiveCPU=false, bool lazy=true, int compressionLevel=0, bool shuffle=false, int numThreads=2)
 Caches the dataset already loaded.
 
DatasetdiskCache (const std::string &location, bool lazy=true, bool reload=false, bool compression=false, bool shuffle=false)
 Caches the dataset loaded in a persistent manner (on a disk location)
 
template<typename Loader, typename... Params>
Datasetchain (Params &&... params)
 Chains the list of loaders with a custom defined one on top.
 
void buildPipeline (const std::vector< DataLoaderSpecs > &specsList, Phase configPhase=Phase::Always)
 configure decorator calls via list of Properties
 
void setRandomSeed (unsigned int seed)
 seed the data loading pipeline
 
bool verbose () const
 
void setVerbose (bool verbose)
 

Constructor & Destructor Documentation

◆ Dataset()

Dataset ( const std::vector< FileReaderColumn > & dataLists,
bool shuffle = false,
bool verbose = false )
explicit

Constructs a dataset from a list of h5 filenames to be loaded.

Note
It will be assumed that from each file two SIS are loaded, the first is mapped to "data" and the second to "label".

Member Function Documentation

◆ size()

std::optional< size_t > size ( ) const

Determines the overall size of the dataset.

Note
If the dataset doesn't have a fixed cardinality, it is not possible to determine its full size, In such case the returned optional is empty.

◆ reset()

void reset ( )

Reset the dataset to the beginning of the data loading pipeline.

This is useful for instance when the dataset is completely consumed, i.e. at the end of a training epoch. Note: reset doesn't necessarily restore the state of the dataset completely, stuff like data caches or seedings survive this call.

◆ reinit()

void reinit ( )

Reinit the dataset to a state equivalent to its state after construction, clearing all the state surviving reset() (i.e.

data caches)

◆ batch()

Dataset & batch ( int batchSize,
bool pad = false,
int overlap = 0 )

Batches the next batchSize items in a single one before returning it.

Parameters
batchSizeNumber of consecutive items to batch together
padWhether to ensure that last batch is of specified size. If true, the last item is repeatedly added to the batch until the size = batchSize
overlapNumber of overlapping items shared by consecutive batches. Example: batch(3, false, 1): [1, 2, 3, 4, 5, 6, 7, 8] -> [{1, 2, 3}, {3, 4, 5}, {5, 6, 7}, {7, 8}]

◆ split()

Dataset & split ( int numItems = -1)

Splits the content of the SharedImagesSets into SIS containing a single image.

Parameters
numItemssplit return only the first 'numItems' items
Note
This choice will make the dataset uncountable

◆ shuffle()

Dataset & shuffle ( int howMany = -1,
int seed = -1 )

Shuffles the dataset.

Parameters
howManynumber of consecutive elements to be shuffled. Default to -1: shuffles the entire dataset
seedSeeds the shuffling
Exceptions
DataLoaderExceptionif howMany is not specified and the dataset is not countable.

◆ map() [1/2]

Dataset & map ( std::function< void(DataItem &)> func,
int numParallelCalls = 1 )

Applies a mapping to each item of the dataset.

Parameters
funcmapping function to be applied on each DataItem
numParallelCallshow many asynchronous threads are used for the mapping

◆ map() [2/2]

Dataset & map ( const std::string & funcKey,
int numParallelCalls = 1 )

Applies a mapping to each item of the dataset This overload is useful when configuring the loading pipeline from properties, the user can register a mapping function and use its registry key to configure the map decorator to use it.

Parameters
funcKeyName of the mapping function registered in the MapFuncRegistry
numParallelCallshow many asynchronous threads are used for the mapping

◆ filter() [1/2]

Dataset & filter ( std::function< bool(const DataItem &)> func)

Filters the dataset according to a user defined criterion.

Parameters
funcfiltering function to be applied on each DataItem
Note
Using filter makes the dataset dynamic, since the func output is conditional

◆ filter() [2/2]

Dataset & filter ( const std::string & funcKey)

Filters the dataset according to a user defined criterion This overload is useful when configuring the loading pipeline from properties, the user can register a filtering function and use its registry key to configure the filter decorator to use it.

Parameters
funcKeyName of the filtering function registered in the FilterFuncRegistry
Note
Using filter makes the dataset dynamic, since the filtering function output is conditional

◆ prefetch()

Dataset & prefetch ( size_t prefetchSize,
bool syncToGl = true )

Prefetches datasets in a separate thread, independently of the regular pipeline.

The user can define how many images are prefetched.

Parameters
prefetchSizeSize of the prefetch queue; up to prefetchSize images are pre-loaded.
syncToGlIf true, the images are synchronized to the GPU memory after they have been pre-fetched.

◆ preprocess() [1/2]

Dataset & preprocess ( const std::vector< Operation::Specs > & preprocPipeline,
Phase execPhase = Phase::Always,
int numParallelCalls = 1 )

Applies a preprocessing pipeline to each item of the dataset.

Parameters
preprocPipelinespecifies how to perform the preprocessing
execPhasespecifies the execution phase of the preprocessing pipeline
numParallelCallshow many asynchronous threads are used for the preprocessing

◆ preprocess() [2/2]

Dataset & preprocess ( const std::vector< std::shared_ptr< Operation > > & operations,
int numParallelCalls = 1 )

Applies a preprocessing pipeline to each item of the dataset.

Parameters
operationsoperations that perform the processing
numParallelCallshow many asynchronous threads are used for the preprocessing

◆ memoryCache()

Dataset & memoryCache ( bool makeExclusiveCPU = false,
bool lazy = true,
int compressionLevel = 0,
bool shuffle = false,
int numThreads = 2 )

Caches the dataset already loaded.

Parameters
makeExclusiveCPUreleases the GPU memory if true
lazycaches items when requested (otherwise caches the whole dataset at initialization)
compressionLevelcontrols compression. Higher means more compression, but slower. 0 disables compression
shufflereshuffle the order of the cache every epoch
numThreadsthe number of items to prefetch from the cache in the background. The cache needs to create copies of the data so fetching is more expensive than one may anticipate. Only has an effect if makeExclusiveCPU is true.
Exceptions
DataLoaderExceptionif the dataset is not countable
std::bad_allocif the system runs out of memory

◆ diskCache()

Dataset & diskCache ( const std::string & location,
bool lazy = true,
bool reload = false,
bool compression = false,
bool shuffle = false )

Caches the dataset loaded in a persistent manner (on a disk location)

Parameters
locationpath to the folder where all the data will be cached
lazycaches items when requested (otherwise caches the whole dataset at initialization)
reloadtry to reload the cache from a previous session
compressionenable saving with compression
shufflereshuffle the order of the cache every epoch
Exceptions
DataLoaderExceptionif the dataset is not countable

◆ chain()

template<typename Loader, typename... Params>
Dataset & chain ( Params &&... params)
inline

Chains the list of loaders with a custom defined one on top.

Template Parameters
LoaderA data loader implementing the DataLoader interface.
Parameters
paramsThe construction parameters of the Loader class.

The documentation for this class was generated from the following file:
Search Tab / S to search, Esc to close