#include <ImFusion/ML/Dataset.h>

Class for creating an iterable dataset by chaining data loading and transforming operations executed in a lazy fashion. More...

Detailed Description

Class for creating an iterable dataset by chaining data loading and transforming operations executed in a lazy fashion.

Example Usage:

std::vector<FileReaderColumn> dataLists;
Dataset ds(dataLists);
// define pipeline
std::vector<std::shared_ptr<Operation>> operations;
ds.preprocess(operations);
ds.memoryCache();
ds.shuffle();
// iterate over the dataset
while (auto item = ds.next())
{
   // do something with item
}

Public Member Functions
	Dataset (bool verbose=false)
	Constructs an empty dataset with no starting loader.

	Dataset (const std::vector< FileReaderColumn > &dataLists, bool shuffle=false, bool verbose=false)
	Constructs a dataset from a list of h5 filenames to be loaded.

	Dataset (std::unique_ptr< DataReader > reader, bool verbose=false)
	Constructs a dataset from an existing DataReader.

	Dataset (const std::string &readerType, const Properties &readerProperties, bool verbose=false)
	Create a Dataset by specifying the DataReader type and properties.

std::optional< DataItem >	next ()
	Returns the next item in the dataset When the dataset is over, no element is returned.

std::optional< size_t >	size () const
	Determines the overall size of the dataset.

void	reset ()
	Reset the dataset to the beginning of the data loading pipeline.

void	reinit ()
	Reinit the dataset to a state equivalent to its state after construction, clearing all the state surviving reset() (i.e.

Cardinality	cardinality () const
	Get cardinality of the data set.

Dataset &	read (const std::string &readerType, Properties readerProperties, bool verbose=false)

Dataset &	batch (int batchSize, bool pad=false, int overlap=0)
	Batches the next `batchSize` items in a single one before returning it.

Dataset &	split (int numItems=-1)
	Splits the content of the SharedImagesSets into SIS containing a single image.

Dataset &	repeat (int numEpochRepetitions, int numItemRepettiions=1)
	Repeats the epoch `numEpochs` times and each individual item `numItemRepetitions` times For both parameters, a value of -1 indicates an infinite repetition.

Dataset &	shuffle (int howMany=-1, int seed=-1)
	Shuffles the dataset.

Dataset &	map (std::function< void(DataItem &)> func, int numParallelCalls=1)
	Applies a mapping to each item of the dataset.

Dataset &	map (const std::string &funcKey, int numParallelCalls=1)
	Applies a mapping to each item of the dataset This overload is useful when configuring the loading pipeline from properties, the user can register a mapping function and use its registry key to configure the map decorator to use it.

Dataset &	filter (std::function< bool(const DataItem &)> func)
	Filters the dataset according to a user defined criterion.

Dataset &	filter (const std::string &funcKey)
	Filters the dataset according to a user defined criterion This overload is useful when configuring the loading pipeline from properties, the user can register a filtering function and use its registry key to configure the filter decorator to use it.

Dataset &	prefetch (size_t prefetchSize, bool syncToGl=true)
	Prefetches datasets in a separate thread, independently of the regular pipeline.

Dataset &	preprocess (const std::vector< Operation::Specs > &preprocPipeline, Phase execPhase=Phase::Always, int numParallelCalls=1)
	Applies a preprocessing pipeline to each item of the dataset.

Dataset &	preprocess (const std::vector< std::shared_ptr< Operation > > &operations, int numParallelCalls=1)
	Applies a preprocessing pipeline to each item of the dataset.

Dataset &	sample (const std::vector< Operation::Specs > &samplingPipeline, int samplerSelectionSeed=-1, int numParallelCalls=1)
	Samples from each item of the dataset.

Dataset &	sample (const std::shared_ptr< ImageROISampler > &sampler, int numParallelCalls)
	Samples from each item of the dataset.

Dataset &	sample (const std::vector< std::shared_ptr< ImageROISampler > > &samplers, const std::optional< std::vector< float > > &weights, int samplerSelectionSeed=-1, int numParallelCalls=1)
	Samples from each item of the dataset.

Dataset &	memoryCache (bool makeExclusiveCPU=false, bool lazy=true, int compressionLevel=0, bool shuffle=false, int numThreads=2)
	Caches the dataset already loaded.

Dataset &	diskCache (const std::string &location, bool lazy=true, bool reload=false, bool compression=false, bool shuffle=false)
	Caches the dataset loaded in a persistent manner (on a disk location)

template<typename Loader, typename... Params>
Dataset &	chain (Params &&... params)
	Chains the list of loaders with a custom defined one on top.

void	buildPipeline (const std::vector< DataLoaderSpecs > &specsList, Phase configPhase=Phase::Always)
	configure decorator calls via list of Properties

void	setRandomSeed (unsigned int seed)
	seed the data loading pipeline

bool	verbose () const

void	setVerbose (bool verbose)

Constructor & Destructor Documentation

◆ Dataset()

Dataset	(	const std::vector< FileReaderColumn > &	dataLists,
		bool	shuffle = false,
		bool	verbose = false )

explicit

Constructs a dataset from a list of h5 filenames to be loaded.

Note: It will be assumed that from each file two SIS are loaded, the first is mapped to "data" and the second to "label".

Member Function Documentation

◆ size()

std::optional< size_t > size ( ) const

Determines the overall size of the dataset.

Note: If the dataset doesn't have a fixed cardinality, it is not possible to determine its full size, In such case the returned optional is empty.

◆ reset()

void reset ( )

Reset the dataset to the beginning of the data loading pipeline.

This is useful for instance when the dataset is completely consumed, i.e. at the end of a training epoch. Note: reset doesn't necessarily restore the state of the dataset completely, stuff like data caches or seedings survive this call.

◆ reinit()

void reinit ( )

Reinit the dataset to a state equivalent to its state after construction, clearing all the state surviving reset() (i.e.

data caches)

◆ batch()

Dataset & batch	(	int	batchSize,
		bool	pad = false,
		int	overlap = 0 )

Batches the next batchSize items in a single one before returning it.

Parameters

batchSize	Number of consecutive items to batch together
pad	Whether to ensure that last batch is of specified size. If true, the last item is repeatedly added to the batch until the size = batchSize
overlap	Number of overlapping items shared by consecutive batches. Example: batch(3, false, 1): [1, 2, 3, 4, 5, 6, 7, 8] -> [{1, 2, 3}, {3, 4, 5}, {5, 6, 7}, {7, 8}]

◆ split()

Dataset & split ( int numItems = -1 )

Splits the content of the SharedImagesSets into SIS containing a single image.

Parameters

numItems split return only the first 'numItems' items

Note: This choice will make the dataset uncountable

◆ shuffle()

Dataset & shuffle	(	int	howMany = -1,
		int	seed = -1 )

Shuffles the dataset.

Parameters

howMany	number of consecutive elements to be shuffled. Default to -1: shuffles the entire dataset
seed	Seeds the shuffling

Exceptions

DataLoaderException if howMany is not specified and the dataset is not countable.

◆ map() [1/2]

Dataset & map	(	std::function< void(DataItem &)>	func,
		int	numParallelCalls = 1 )

Applies a mapping to each item of the dataset.

Parameters

func	mapping function to be applied on each DataItem
numParallelCalls	how many asynchronous threads are used for the mapping

◆ map() [2/2]

Dataset & map	(	const std::string &	funcKey,
		int	numParallelCalls = 1 )

Applies a mapping to each item of the dataset This overload is useful when configuring the loading pipeline from properties, the user can register a mapping function and use its registry key to configure the map decorator to use it.

Parameters

funcKey	Name of the mapping function registered in the MapFuncRegistry
numParallelCalls	how many asynchronous threads are used for the mapping

◆ filter() [1/2]

Dataset & filter ( std::function< bool(const DataItem &)> func )

Filters the dataset according to a user defined criterion.

Parameters

func	filtering function to be applied on each DataItem

Note: Using filter makes the dataset dynamic, since the func output is conditional

◆ filter() [2/2]

Dataset & filter ( const std::string & funcKey )

Filters the dataset according to a user defined criterion This overload is useful when configuring the loading pipeline from properties, the user can register a filtering function and use its registry key to configure the filter decorator to use it.

Parameters

funcKey Name of the filtering function registered in the FilterFuncRegistry

Note: Using filter makes the dataset dynamic, since the filtering function output is conditional

◆ prefetch()

Dataset & prefetch	(	size_t	prefetchSize,
		bool	syncToGl = true )

Prefetches datasets in a separate thread, independently of the regular pipeline.

The user can define how many images are prefetched.

Parameters

prefetchSize	Size of the prefetch queue; up to prefetchSize images are pre-loaded.
syncToGl	If true, the images are synchronized to the GPU memory after they have been pre-fetched.

◆ preprocess() [1/2]

Dataset & preprocess	(	const std::vector< Operation::Specs > &	preprocPipeline,
		Phase	execPhase = Phase::Always,
		int	numParallelCalls = 1 )

Applies a preprocessing pipeline to each item of the dataset.

Parameters

preprocPipeline	specifies how to perform the preprocessing
execPhase	specifies the execution phase of the preprocessing pipeline
numParallelCalls	how many asynchronous threads are used for the preprocessing

◆ preprocess() [2/2]

Dataset & preprocess	(	const std::vector< std::shared_ptr< Operation > > &	operations,
		int	numParallelCalls = 1 )

Applies a preprocessing pipeline to each item of the dataset.

Parameters

operations	operations that perform the processing
numParallelCalls	how many asynchronous threads are used for the preprocessing

◆ memoryCache()

Dataset & memoryCache	(	bool	makeExclusiveCPU = false,
		bool	lazy = true,
		int	compressionLevel = 0,
		bool	shuffle = false,
		int	numThreads = 2 )

Caches the dataset already loaded.

Parameters

makeExclusiveCPU	releases the GPU memory if true
lazy	caches items when requested (otherwise caches the whole dataset at initialization)
compressionLevel	controls compression. Higher means more compression, but slower. 0 disables compression
shuffle	reshuffle the order of the cache every epoch
numThreads	the number of items to prefetch from the cache in the background. The cache needs to create copies of the data so fetching is more expensive than one may anticipate. Only has an effect if makeExclusiveCPU is true.

Exceptions

DataLoaderException	if the dataset is not countable
std::bad_alloc	if the system runs out of memory

◆ diskCache()

Dataset & diskCache	(	const std::string &	location,
		bool	lazy = true,
		bool	reload = false,
		bool	compression = false,
		bool	shuffle = false )

Caches the dataset loaded in a persistent manner (on a disk location)

Parameters

location	path to the folder where all the data will be cached
lazy	caches items when requested (otherwise caches the whole dataset at initialization)
reload	try to reload the cache from a previous session
compression	enable saving with compression
shuffle	reshuffle the order of the cache every epoch

Exceptions

DataLoaderException if the dataset is not countable

◆ chain()

template<typename Loader, typename... Params>

Dataset & chain ( Params &&... params )

inline

Chains the list of loaders with a custom defined one on top.

Template Parameters

Loader A data loader implementing the DataLoader interface.

Parameters

params The construction parameters of the Loader class.

The documentation for this class was generated from the following file:

ImFusion/ML/Dataset.h

Detailed Description

Public Member Functions

Constructor & Destructor Documentation

◆ Dataset()

Member Function Documentation

◆ size()

◆ reset()

◆ reinit()

◆ batch()

◆ split()

◆ shuffle()

◆ map() [1/2]

◆ map() [2/2]

◆ filter() [1/2]

◆ filter() [2/2]

◆ prefetch()

◆ preprocess() [1/2]

◆ preprocess() [2/2]

◆ memoryCache()

◆ diskCache()

◆ chain()