BigMultiPipe¶
- class bigmultipipe.BigMultiPipe(num_processes=None, mem_available=None, mem_frac=0.8, process_size=None, outdir=None, create_outdir=False, outname_append='_bmp', pre_process_list=None, post_process_list=None, PoolClass=None, **kwargs)[source]¶
Bases:
object
Base class for memory- and processing power-optimized pipelines
- Parameters
- outdir, create_outdir, outname_append, outname: see
- num_processes, mem_available, mem_frac, process_sizeoptional
These parameters tune computer processing and memory resources and are used when the
pipeline()
method is executed. See documentation fornum_can_process()
for use, noting that thenum_to_process
argument of that function is set to the number of input filenames inpipeline()
- pre_process_listlist
See
pre_process_list
attribute- post_process_listlist
See
post_process_list
attribute- PoolClassclass name or None, optional
Typcally a subclass of
multiprocessing.pool.Pool
. Themap()
method of this class implements the multiprocessing feature of this module. IfNone
,multiprocessing.pool.Pool
is used. Default isNone.
- **kwargsoptional
Python’s
**kwargs
construct stores additional keyword arguments as adict
accessed in the function askwargs
. In order to implement the control stream discussed in the introduction to this module, thisdict
is captured as property on instantiation. When any methods are run, thekwargs
passed to that method are merged with the propertykwargs
usingkwargs_merge()
. This allows the parameters passed to the methods at runtime to override the parameters passed to the object at instantiation time.
Notes
Just like
**kwargs
, all named parameters passed at object instantiation are stored as property and used to initialize the identical list of parameters to theBigMultiPipe.pipeline()
method. Any of these parameters exceptpre_process_list
andpost_process_list
can be overridden whenpipeline()
is called by using the corresponding keyword. This enables definition of a default pipeline configuration when the object is instantiated that can be modified at run-time. The exception to this arepre_process_list
andpost_process_list
. When these are provided topipeline()
, they are added to corresponding lists provided at instantiation time. To erase these lists in the object simply set their property to None: e.g.BigMultipipe.pre_process_list
= NoneAttributes Summary
list or None : List of post-processing routines
list or None : List of pre-processing routines
Methods Summary
data_process
(data, **kwargs)Process the data.
data_process_meta_create
(data, **kwargs)Process data and create metadata
file_process
(in_name, **kwargs)Process one file in the
bigmultipipe
systemfile_read
(in_name, **kwargs)Reads data file(s) from disk.
file_write
(data, outname[, meta, create_outdir])Create outdir of create_outdir is True.
kwargs_merge
(**kwargs)Merge **kwargs with **kwargs provided on instantiation
outname_create
(*args, **kwargs)Create output filename (including path) using
bigmultipipe.outname_creator
pipeline
(in_names[, num_processes, ...])Runs pipeline, maximizing processing and memory resources
post_process
(data[, post_process_list, ...])Conduct post-processing tasks, including creation of metadata
pre_process
(data[, pre_process_list])Conduct pre-processing tasks
Attributes Documentation
- post_process_list¶
list or None : List of post-processing routines
List of functions called by
post_process()
after primary processing step. Indended to enable additional processing steps and produce metadata as discussed in Discussion of Design. Each function must accept one positional parameter,data
, one optional keyword parameter,bmp_meta
, any keywords needed by the function, and an arbitrary list of keywords handled by the**kwargs
feature.bmp_meta
is of typedict
. The return value of each function is intended to be the data but it not restricted in any way. IfNone
is return, processing stops for that file,None
is returned for that file’s data and the metadata accumulated to that point is returned as that file’s metadata.bmp_meta.clear()
can be used in the terminatingpost_process_list
routine if it is desirable to erase the metadata. See Example for examples of a simple functioning pipeline.>>> def later_booster(data, need_to_boost_by=None, **kwargs): >>> if need_to_boost_by is not None: >>> data = data + need_to_boost_by >>> return data >>> >>> def median(data, bmp_meta=None, **kwargs): >>> m = np.median(data) >>> if bmp_meta is not None: >>> bmp_meta['median'] = m >>> return data >>>
- pre_process_list¶
list or None : List of pre-processing routines
List of functions called by
pre_process()
before primary processing step. Intended to implement filtering and control features as described in Discussion of Design. Each function must accept one positional parameter,data
, keyword arguments necessary for its internal functioning, and**kwargs
to ignore keyword parameters not processed by the function. If the return value of the function isNone
, processing of that file stops, no output file is written, andNone
is returned instead of an output filename. This is how filtering is implemented. Otherwise, the return value is eitherdata
or adict
with two keys:bmp_data
andbmp_kwargs
. In the later case,bmp_kwargs
will be merged into**kwargs
. This is how the control channel is implemented. Below are examples. See Example to see this code in use in a functioning pipeline.>>> def reject(data, reject_value=None, **kwargs): >>> if reject_value is None: >>> return data >>> if data[0,0] == reject_value: >>> # --> Return data=None to reject data >>> return None >>> return data >>> >>> def boost_later(data, boost_target=None, boost_amount=None, **kwargs): >>> if boost_target is None or boost_amount is None: >>> return data >>> if data[0,0] == boost_target: >>> add_kwargs = {'need_to_boost_by': boost_amount} >>> retval = {'bmp_data': data, >>> 'bmp_kwargs': add_kwargs} >>> return retval >>> return data
Methods Documentation
- data_process(data, **kwargs)[source]¶
Process the data. Intended to be overridden in subclass
- Parameters
- dataany type
Data to be processed
- Returns
- dataany type
Processed data
- data_process_meta_create(data, **kwargs)[source]¶
Process data and create metadata
- Parameters
- dataany type
Data to be processed by
pre_process()
,data_process()
, andpost_process()
- kwargssee Notes in
BigMultiPipe
Parameter section
- Returns
- (data, meta)tuple
Data is the processed data. Meta is created by
post_process()
- file_process(in_name, **kwargs)[source]¶
Process one file in the
bigmultipipe
systemThis method can be overridden to interface with applications where the primary processing routine already reads the input data from disk and writes the output data to disk,
- Parameters
- in_name: str
Name of file to process. Data from the file will be read by
file_read()
and processed bydata_process_meta_create()
. Output filename will be created byoutname_create()
and data will be written byfile_write()
- kwargssee Notes in
BigMultiPipe
Parameter section
- Returns
- (outname, meta)tuple
Outname is the name of file to which processed data was written. Meta is the dictionary element of the tuple returned by
data_process_meta_create()
- file_read(in_name, **kwargs)[source]¶
Reads data file(s) from disk. Intended to be overridden by subclass
- Parameters
- in_namestr or list
If
str
, name of file to read. Iflist
, each element in list is processed recursively so that multiple files can be considered a single “data” inbigmultipipe
nomenclature- kwargssee Notes in
BigMultiPipe
Parameter section
- Returns
- dataany type
Data to be processed
- file_write(data, outname, meta=None, create_outdir=False, **kwargs)[source]¶
Create outdir of create_outdir is True. MUST be overridden by subclass to actually write the data, in which case, this would be an example first line of the subclass.file_write method:
BigMultiPipe(self).file_write(data, outname, **kwargs)
- Parameters
- dataany type
Processed data
- outnamestr
Name of file to write
- metadict
BigMultiPipe
metadata dictionary- create_outdirbool, optional
If
True
, create outdir and any needed parent directories. Does not raise an error if outdir already exists. Overwritten bycreate_outdir
key inmeta
. Default isFalse
- kwargssee Notes in
BigMultiPipe
Parameter section
- Returns
- outnamestr
Name of file written
- kwargs_merge(**kwargs)[source]¶
Merge **kwargs with **kwargs provided on instantiation
Intended to be called by methods
- Parameters
- **kwargskeyword arguments
- outname_create(*args, **kwargs)[source]¶
Create output filename (including path) using
bigmultipipe.outname_creator
- Returns
- outnamestr
output filename to be written, including path
- pipeline(in_names, num_processes=None, mem_available=None, mem_frac=None, process_size=None, PoolClass=None, outdir=None, create_outdir=None, outname_append=None, outname=None, **kwargs)[source]¶
Runs pipeline, maximizing processing and memory resources
- Parameters
- in_names
list
ofstr
List of input filenames. Each file is processed using
file_process()
- All other parameterssee Parameters to
BigMultiPipe
- in_names
- Returns
- pout
list
of tuples(outname, meta)
, onetuple
for each
in_name
.Outname
isstr
orNone
,meta
is adict
containing metadata output. Ifoutname
isstr
, it is the name of the file to which the processed data were written. IfNone
and meta = {}, the convenience functionprune_pout()
can be used to remove this tuple frompout
and the corresponding in_name from the in_names list.
- pout
- post_process(data, post_process_list=None, no_bmp_cleanup=False, **kwargs)[source]¶
Conduct post-processing tasks, including creation of metadata
This method can be overridden to permanently insert post-processing tasks in the pipeline for each instantiated object or the post_process_list feature can be used for a more dynamic approach to inserting post-processing tasks at object instantiation and/or when the pipeline is run
- Parameters
- dataany type
Data to be processed by the functions in pre_process_list
- post_process_listlist
See documentation for this parameter in Parameters section of
BigMultiPipe
- no_bmp_cleanupbool
Do not run
bmp_cleanup
post-processing task even if keywords have been added to the bmp_cleanup_list (provided for debugging purposes). Default isFalse
- kwargssee Notes in
BigMultiPipe
Parameter section
- Returns
- (data, meta)tuple
Data are the post-processed data. Meta are the combined meta dicts from all of the post_process_list functions.
- pre_process(data, pre_process_list=None, **kwargs)[source]¶
Conduct pre-processing tasks
This method can be overridden to permanently insert pre-processing tasks in the pipeline for each instantiated object and/or the pre_process_list feature can be used for a more dynamic approach to inserting pre-processing tasks at object instantiation and/or when the pipeline is run
- Parameters
- dataany type
Data to be processed by the functions in pre_process_list
- pre_process_listlist
See documentation for this parameter in Parameters section of
BigMultiPipe
- kwargssee Notes in
BigMultiPipe
Parameter section
- Returns
- (data, kwargs)tuple
Data are the pre-processed data. Kwargs are the combined kwarg outputs from all of the pre_process_list functions.