BigMultiPipe¶

class bigmultipipe.BigMultiPipe(num_processes=None, mem_available=None, mem_frac=0.8, process_size=None, outdir=None, create_outdir=False, outname_append='_bmp', pre_process_list=None, post_process_list=None, PoolClass=None, **kwargs)[source]¶

Bases: object

Base class for memory- and processing power-optimized pipelines

Parameters

outdir, create_outdir, outname_append, outname: see: outname_create()
num_processes, mem_available, mem_frac, process_sizeoptional: These parameters tune computer processing and memory resources and are used when the pipeline() method is executed. See documentation for num_can_process() for use, noting that the num_to_process argument of that function is set to the number of input filenames in pipeline()
pre_process_listlist: See pre_process_list attribute
post_process_listlist: See post_process_list attribute
PoolClassclass name or None, optional: Typcally a subclass of multiprocessing.pool.Pool. The map() method of this class implements the multiprocessing feature of this module. If None, multiprocessing.pool.Pool is used. Default is None.
**kwargsoptional: Python’s **kwargs construct stores additional keyword arguments as a dict accessed in the function as kwargs. In order to implement the control stream discussed in the introduction to this module, this dict is captured as property on instantiation. When any methods are run, the kwargs passed to that method are merged with the property kwargs using kwargs_merge(). This allows the parameters passed to the methods at runtime to override the parameters passed to the object at instantiation time.

Notes

Just like **kwargs, all named parameters passed at object instantiation are stored as property and used to initialize the identical list of parameters to the BigMultiPipe.pipeline() method. Any of these parameters except pre_process_list and post_process_list can be overridden when pipeline() is called by using the corresponding keyword. This enables definition of a default pipeline configuration when the object is instantiated that can be modified at run-time. The exception to this are pre_process_list and post_process_list. When these are provided to pipeline(), they are added to corresponding lists provided at instantiation time. To erase these lists in the object simply set their property to None: e.g. BigMultipipe.pre_process_list = None

Attributes Summary

`post_process_list`	list or None : List of post-processing routines
`pre_process_list`	list or None : List of pre-processing routines

Methods Summary

`data_process`(data, **kwargs)	Process the data.
`data_process_meta_create`(data, **kwargs)	Process data and create metadata
`file_process`(in_name, **kwargs)	Process one file in the `bigmultipipe` system
`file_read`(in_name, **kwargs)	Reads data file(s) from disk.
`file_write`(data, outname[, meta, create_outdir])	Create outdir of create_outdir is True.
`kwargs_merge`(**kwargs)	Merge kwargs with kwargs provided on instantiation
`outname_create`(args, *kwargs)	Create output filename (including path) using `bigmultipipe.outname_creator`
`pipeline`(in_names[, num_processes, ...])	Runs pipeline, maximizing processing and memory resources
`post_process`(data[, post_process_list, ...])	Conduct post-processing tasks, including creation of metadata
`pre_process`(data[, pre_process_list])	Conduct pre-processing tasks

Attributes Documentation

post_process_list¶

list or None : List of post-processing routines

List of functions called by post_process() after primary processing step. Indended to enable additional processing steps and produce metadata as discussed in Discussion of Design. Each function must accept one positional parameter, data, one optional keyword parameter, bmp_meta, any keywords needed by the function, and an arbitrary list of keywords handled by the **kwargs feature. bmp_meta is of type dict. The return value of each function is intended to be the data but it not restricted in any way. If None is return, processing stops for that file, None is returned for that file’s data and the metadata accumulated to that point is returned as that file’s metadata. bmp_meta.clear() can be used in the terminating post_process_list routine if it is desirable to erase the metadata. See Example for examples of a simple functioning pipeline.

>>> def later_booster(data, need_to_boost_by=None, **kwargs):
>>>     if need_to_boost_by is not None:
>>>         data = data + need_to_boost_by
>>>     return data
>>> 
>>> def median(data, bmp_meta=None, **kwargs):
>>>     m = np.median(data)
>>>     if bmp_meta is not None:
>>>         bmp_meta['median'] = m
>>>     return data
>>> 

pre_process_list¶

list or None : List of pre-processing routines

List of functions called by pre_process() before primary processing step. Intended to implement filtering and control features as described in Discussion of Design. Each function must accept one positional parameter, data, keyword arguments necessary for its internal functioning, and **kwargs to ignore keyword parameters not processed by the function. If the return value of the function is None, processing of that file stops, no output file is written, and None is returned instead of an output filename. This is how filtering is implemented. Otherwise, the return value is either data or a dict with two keys: bmp_data and bmp_kwargs. In the later case, bmp_kwargs will be merged into **kwargs. This is how the control channel is implemented. Below are examples. See Example to see this code in use in a functioning pipeline.

>>> def reject(data, reject_value=None, **kwargs):
>>>     if reject_value is None:
>>>         return data
>>>     if data[0,0] == reject_value:
>>>         # --> Return data=None to reject data
>>>         return None
>>>     return data
>>> 
>>> def boost_later(data, boost_target=None, boost_amount=None, **kwargs):
>>>     if boost_target is None or boost_amount is None:
>>>         return data
>>>     if data[0,0] == boost_target:
>>>         add_kwargs = {'need_to_boost_by': boost_amount}
>>>         retval = {'bmp_data': data,
>>>                   'bmp_kwargs': add_kwargs}
>>>         return retval
>>>     return data

Methods Documentation

data_process(data, **kwargs)[source]¶

Process the data. Intended to be overridden in subclass

Parameters

dataany type: Data to be processed

Returns

dataany type: Processed data

data_process_meta_create(data, **kwargs)[source]¶

Process data and create metadata

Parameters

dataany type: Data to be processed by pre_process(), data_process(), and post_process()
kwargssee Notes in BigMultiPipe Parameter section

Returns

(data, meta)tuple: Data is the processed data. Meta is created by post_process()

file_process(in_name, **kwargs)[source]¶

Process one file in the bigmultipipe system

This method can be overridden to interface with applications where the primary processing routine already reads the input data from disk and writes the output data to disk,

Parameters

in_name: str: Name of file to process. Data from the file will be read by file_read() and processed by data_process_meta_create(). Output filename will be created by outname_create() and data will be written by file_write()
kwargssee Notes in BigMultiPipe Parameter section

Returns

(outname, meta)tuple: Outname is the name of file to which processed data was written. Meta is the dictionary element of the tuple returned by data_process_meta_create()

file_read(in_name, **kwargs)[source]¶

Reads data file(s) from disk. Intended to be overridden by subclass

Parameters

in_namestr or list: If str, name of file to read. If list, each element in list is processed recursively so that multiple files can be considered a single “data” in bigmultipipe nomenclature
kwargssee Notes in BigMultiPipe Parameter section

Returns

dataany type: Data to be processed

file_write(data, outname, meta=None, create_outdir=False, **kwargs)[source]¶

Create outdir of create_outdir is True. MUST be overridden by subclass to actually write the data, in which case, this would be an example first line of the subclass.file_write method:

BigMultiPipe(self).file_write(data, outname, **kwargs)

Parameters

dataany type: Processed data
outnamestr: Name of file to write
metadict: BigMultiPipe metadata dictionary
create_outdirbool, optional: If True, create outdir and any needed parent directories. Does not raise an error if outdir already exists. Overwritten by create_outdir key in meta. Default is False
kwargssee Notes in BigMultiPipe Parameter section

Returns

outnamestr: Name of file written

kwargs_merge(**kwargs)[source]¶

Merge **kwargs with **kwargs provided on instantiation

Intended to be called by methods

Parameters

**kwargskeyword arguments

outname_create(*args, **kwargs)[source]¶

Create output filename (including path) using bigmultipipe.outname_creator

Returns

outnamestr: output filename to be written, including path

pipeline(in_names, num_processes=None, mem_available=None, mem_frac=None, process_size=None, PoolClass=None, outdir=None, create_outdir=None, outname_append=None, outname=None, **kwargs)[source]¶

Runs pipeline, maximizing processing and memory resources

Parameters

in_nameslist of str: List of input filenames. Each file is processed using file_process()
All other parameterssee Parameters to BigMultiPipe

Returns

poutlist of tuples (outname, meta), one tuple for: each in_name. Outname is str or None, meta is a dict containing metadata output. If outname is str, it is the name of the file to which the processed data were written. If None and meta = {}, the convenience function prune_pout() can be used to remove this tuple from pout and the corresponding in_name from the in_names list.

post_process(data, post_process_list=None, no_bmp_cleanup=False, **kwargs)[source]¶

Conduct post-processing tasks, including creation of metadata

This method can be overridden to permanently insert post-processing tasks in the pipeline for each instantiated object or the post_process_list feature can be used for a more dynamic approach to inserting post-processing tasks at object instantiation and/or when the pipeline is run

Parameters

dataany type: Data to be processed by the functions in pre_process_list
post_process_listlist: See documentation for this parameter in Parameters section of BigMultiPipe
no_bmp_cleanupbool: Do not run bmp_cleanup post-processing task even if keywords have been added to the bmp_cleanup_list (provided for debugging purposes). Default is False
kwargssee Notes in BigMultiPipe Parameter section

Returns

(data, meta)tuple: Data are the post-processed data. Meta are the combined meta dicts from all of the post_process_list functions.

pre_process(data, pre_process_list=None, **kwargs)[source]¶

Conduct pre-processing tasks

This method can be overridden to permanently insert pre-processing tasks in the pipeline for each instantiated object and/or the pre_process_list feature can be used for a more dynamic approach to inserting pre-processing tasks at object instantiation and/or when the pipeline is run

Parameters

dataany type: Data to be processed by the functions in pre_process_list
pre_process_listlist: See documentation for this parameter in Parameters section of BigMultiPipe
kwargssee Notes in BigMultiPipe Parameter section

Returns

(data, kwargs)tuple: Data are the pre-processed data. Kwargs are the combined kwarg outputs from all of the pre_process_list functions.

Navigation

BigMultiPipe¶