BigMultiPipe

class bigmultipipe.BigMultiPipe(num_processes=None, mem_available=None, mem_frac=0.8, process_size=None, outdir=None, create_outdir=False, outname_append='_bmp', pre_process_list=None, post_process_list=None, PoolClass=None, **kwargs)[source]

Bases: object

Base class for memory- and processing power-optimized pipelines

Parameters
outdir, create_outdir, outname_append, outname: see

outname_create()

num_processes, mem_available, mem_frac, process_sizeoptional

These parameters tune computer processing and memory resources and are used when the pipeline() method is executed. See documentation for num_can_process() for use, noting that the num_to_process argument of that function is set to the number of input filenames in pipeline()

pre_process_listlist

See pre_process_list attribute

post_process_listlist

See post_process_list attribute

PoolClassclass name or None, optional

Typcally a subclass of multiprocessing.pool.Pool. The map() method of this class implements the multiprocessing feature of this module. If None, multiprocessing.pool.Pool is used. Default is None.

**kwargsoptional

Python’s **kwargs construct stores additional keyword arguments as a dict accessed in the function as kwargs. In order to implement the control stream discussed in the introduction to this module, this dict is captured as property on instantiation. When any methods are run, the kwargs passed to that method are merged with the property kwargs using kwargs_merge(). This allows the parameters passed to the methods at runtime to override the parameters passed to the object at instantiation time.

Notes

Just like **kwargs, all named parameters passed at object instantiation are stored as property and used to initialize the identical list of parameters to the BigMultiPipe.pipeline() method. Any of these parameters except pre_process_list and post_process_list can be overridden when pipeline() is called by using the corresponding keyword. This enables definition of a default pipeline configuration when the object is instantiated that can be modified at run-time. The exception to this are pre_process_list and post_process_list. When these are provided to pipeline(), they are added to corresponding lists provided at instantiation time. To erase these lists in the object simply set their property to None: e.g. BigMultipipe.pre_process_list = None

Attributes Summary

post_process_list

list or None : List of post-processing routines

pre_process_list

list or None : List of pre-processing routines

Methods Summary

data_process(data, **kwargs)

Process the data.

data_process_meta_create(data, **kwargs)

Process data and create metadata

file_process(in_name, **kwargs)

Process one file in the bigmultipipe system

file_read(in_name, **kwargs)

Reads data file(s) from disk.

file_write(data, outname[, meta, create_outdir])

Create outdir of create_outdir is True.

kwargs_merge(**kwargs)

Merge **kwargs with **kwargs provided on instantiation

outname_create(*args, **kwargs)

Create output filename (including path) using bigmultipipe.outname_creator

pipeline(in_names[, num_processes, ...])

Runs pipeline, maximizing processing and memory resources

post_process(data[, post_process_list, ...])

Conduct post-processing tasks, including creation of metadata

pre_process(data[, pre_process_list])

Conduct pre-processing tasks

Attributes Documentation

post_process_list

list or None : List of post-processing routines

List of functions called by post_process() after primary processing step. Indended to enable additional processing steps and produce metadata as discussed in Discussion of Design. Each function must accept one positional parameter, data, one optional keyword parameter, bmp_meta, any keywords needed by the function, and an arbitrary list of keywords handled by the **kwargs feature. bmp_meta is of type dict. The return value of each function is intended to be the data but it not restricted in any way. If None is return, processing stops for that file, None is returned for that file’s data and the metadata accumulated to that point is returned as that file’s metadata. bmp_meta.clear() can be used in the terminating post_process_list routine if it is desirable to erase the metadata. See Example for examples of a simple functioning pipeline.

>>> def later_booster(data, need_to_boost_by=None, **kwargs):
>>>     if need_to_boost_by is not None:
>>>         data = data + need_to_boost_by
>>>     return data
>>> 
>>> def median(data, bmp_meta=None, **kwargs):
>>>     m = np.median(data)
>>>     if bmp_meta is not None:
>>>         bmp_meta['median'] = m
>>>     return data
>>> 
pre_process_list

list or None : List of pre-processing routines

List of functions called by pre_process() before primary processing step. Intended to implement filtering and control features as described in Discussion of Design. Each function must accept one positional parameter, data, keyword arguments necessary for its internal functioning, and **kwargs to ignore keyword parameters not processed by the function. If the return value of the function is None, processing of that file stops, no output file is written, and None is returned instead of an output filename. This is how filtering is implemented. Otherwise, the return value is either data or a dict with two keys: bmp_data and bmp_kwargs. In the later case, bmp_kwargs will be merged into **kwargs. This is how the control channel is implemented. Below are examples. See Example to see this code in use in a functioning pipeline.

>>> def reject(data, reject_value=None, **kwargs):
>>>     if reject_value is None:
>>>         return data
>>>     if data[0,0] == reject_value:
>>>         # --> Return data=None to reject data
>>>         return None
>>>     return data
>>> 
>>> def boost_later(data, boost_target=None, boost_amount=None, **kwargs):
>>>     if boost_target is None or boost_amount is None:
>>>         return data
>>>     if data[0,0] == boost_target:
>>>         add_kwargs = {'need_to_boost_by': boost_amount}
>>>         retval = {'bmp_data': data,
>>>                   'bmp_kwargs': add_kwargs}
>>>         return retval
>>>     return data

Methods Documentation

data_process(data, **kwargs)[source]

Process the data. Intended to be overridden in subclass

Parameters
dataany type

Data to be processed

Returns
dataany type

Processed data

data_process_meta_create(data, **kwargs)[source]

Process data and create metadata

Parameters
dataany type

Data to be processed by pre_process(), data_process(), and post_process()

kwargssee Notes in BigMultiPipe Parameter section
Returns
(data, meta)tuple

Data is the processed data. Meta is created by post_process()

file_process(in_name, **kwargs)[source]

Process one file in the bigmultipipe system

This method can be overridden to interface with applications where the primary processing routine already reads the input data from disk and writes the output data to disk,

Parameters
in_name: str

Name of file to process. Data from the file will be read by file_read() and processed by data_process_meta_create(). Output filename will be created by outname_create() and data will be written by file_write()

kwargssee Notes in BigMultiPipe Parameter section
Returns
(outname, meta)tuple

Outname is the name of file to which processed data was written. Meta is the dictionary element of the tuple returned by data_process_meta_create()

file_read(in_name, **kwargs)[source]

Reads data file(s) from disk. Intended to be overridden by subclass

Parameters
in_namestr or list

If str, name of file to read. If list, each element in list is processed recursively so that multiple files can be considered a single “data” in bigmultipipe nomenclature

kwargssee Notes in BigMultiPipe Parameter section
Returns
dataany type

Data to be processed

file_write(data, outname, meta=None, create_outdir=False, **kwargs)[source]

Create outdir of create_outdir is True. MUST be overridden by subclass to actually write the data, in which case, this would be an example first line of the subclass.file_write method:

BigMultiPipe(self).file_write(data, outname, **kwargs)

Parameters
dataany type

Processed data

outnamestr

Name of file to write

metadict

BigMultiPipe metadata dictionary

create_outdirbool, optional

If True, create outdir and any needed parent directories. Does not raise an error if outdir already exists. Overwritten by create_outdir key in meta. Default is False

kwargssee Notes in BigMultiPipe Parameter section
Returns
outnamestr

Name of file written

kwargs_merge(**kwargs)[source]

Merge **kwargs with **kwargs provided on instantiation

Intended to be called by methods

Parameters
**kwargskeyword arguments
outname_create(*args, **kwargs)[source]

Create output filename (including path) using bigmultipipe.outname_creator

Returns
outnamestr

output filename to be written, including path

pipeline(in_names, num_processes=None, mem_available=None, mem_frac=None, process_size=None, PoolClass=None, outdir=None, create_outdir=None, outname_append=None, outname=None, **kwargs)[source]

Runs pipeline, maximizing processing and memory resources

Parameters
in_nameslist of str

List of input filenames. Each file is processed using file_process()

All other parameterssee Parameters to BigMultiPipe
Returns
poutlist of tuples (outname, meta), one tuple for

each in_name. Outname is str or None, meta is a dict containing metadata output. If outname is str, it is the name of the file to which the processed data were written. If None and meta = {}, the convenience function prune_pout() can be used to remove this tuple from pout and the corresponding in_name from the in_names list.

post_process(data, post_process_list=None, no_bmp_cleanup=False, **kwargs)[source]

Conduct post-processing tasks, including creation of metadata

This method can be overridden to permanently insert post-processing tasks in the pipeline for each instantiated object or the post_process_list feature can be used for a more dynamic approach to inserting post-processing tasks at object instantiation and/or when the pipeline is run

Parameters
dataany type

Data to be processed by the functions in pre_process_list

post_process_listlist

See documentation for this parameter in Parameters section of BigMultiPipe

no_bmp_cleanupbool

Do not run bmp_cleanup post-processing task even if keywords have been added to the bmp_cleanup_list (provided for debugging purposes). Default is False

kwargssee Notes in BigMultiPipe Parameter section
Returns
(data, meta)tuple

Data are the post-processed data. Meta are the combined meta dicts from all of the post_process_list functions.

pre_process(data, pre_process_list=None, **kwargs)[source]

Conduct pre-processing tasks

This method can be overridden to permanently insert pre-processing tasks in the pipeline for each instantiated object and/or the pre_process_list feature can be used for a more dynamic approach to inserting pre-processing tasks at object instantiation and/or when the pipeline is run

Parameters
dataany type

Data to be processed by the functions in pre_process_list

pre_process_listlist

See documentation for this parameter in Parameters section of BigMultiPipe

kwargssee Notes in BigMultiPipe Parameter section
Returns
(data, kwargs)tuple

Data are the pre-processed data. Kwargs are the combined kwarg outputs from all of the pre_process_list functions.