PairwiseEMD
The PairwiseEMD
functionality handles the collective computation of Wasserstein distances between pairs of events. Adopting language from particle physics, we will call the distributions "events," the discrete entities in the ground space "particles," and the particle weights (probability mass) "energy". It automatically parallelizes the computations across a specified number of threads for the greatest efficiency.
Python
The Python PairwiseEMD
function returns an object (either wasserstein.PairwiseEMDFloat64
or wasserstein.PairwiseEMDFloat32
, each of which are instantiations of the C++ template PairwiseEMD
) that can be used to efficientl compute pairs of EMD distances. It is designed to work with numpy arrays efficiently. The PairwiseEMDYPhi
function behaves similarly but implements 2\pi periodicity in the second coordinate dimension, and so is suited for using the rapidity-azimuth plane as a ground space.
wasserstein.PairwiseEMD(R=1.0, beta=1.0, norm=False,
num_threads=-1, print_every=-10, verbose=1,
request_mode=False, store_sym_emds_raw=True, throw_on_error=False,
omp_dynamic_chunksize=10,
n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0,
dtype='float64')
wasserstein.PairwiseEMDYPhi(R=1.0, beta=1.0, norm=False,
num_threads=-1, print_every=-10, verbose=1,
request_mode=False, store_sym_emds_raw=True, throw_on_error=False,
omp_dynamic_chunksize=10,
n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0,
dtype='float64')
Arguments
- R : float
- The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric satisfying the triangle inequality.
- beta : float
- The angular weighting exponent. The internal pairwsie distance matrix is raised to this power prior to solving the optimal transport problem.
- norm : bool
- Whether or not to normalize the particle weights to sum to one prior to computing the EMD.
- num_threads : int
- The number of threads to use when executing the computations. A value of
-1
uses the maximum number according toomp_get_max_threads()
and should be used for the greatest efficiency.
- The number of threads to use when executing the computations. A value of
- print_every : int
- This controls how the slate of EMD computations is divied up. If positive, it is essentially a batch size - i.e. how many computations to do before printing progress (see
verbose
), checking for signals, etc., and then continuing. If negative, it is the approximate total number of times to print progress, etc. A value of0
is equivalent to-1
. Setting this to a small positive number or a large (in absolute value) negative number tends to be inefficient.
- This controls how the slate of EMD computations is divied up. If positive, it is essentially a batch size - i.e. how many computations to do before printing progress (see
- verbose : int
- Controls verbosity of the object. All printing is turned off if equal to
0
. Larger numbers turn on successively more printing. Currently there is only one verbosity level but more may be added. For capturing this output in a Jupyter notebook, run%load_ext wurlitzer
prior to starting the computation.
- Controls verbosity of the object. All printing is turned off if equal to
- request_mode : bool
- Sets up the
PairwiseEMD
object to accept requests for EMD computations via theemd(i, j)
method, rather than computing all EMDs immediately.
- Sets up the
- store_sym_emds_raw : bool
- Determines whether a symmetric distance matrix is stored square-form (redundant) or vector-form (raw). Wasserstein follows SciPy's convention for vector-form distance matrices.
- throw_on_error : bool
- Determines if the computation is interrupted by an error in the EMD calculation, or if the error is handled silently. In all cases, the errors are logged and can be accessed with the
error_messages()
method.
- Determines if the computation is interrupted by an error in the EMD calculation, or if the error is handled silently. In all cases, the errors are logged and can be accessed with the
- omp_dynamic_chunksize : int
- Number of EMD computations to spool to each thread at a time.
- n_iter_max : int
- Maximum number of iterations for solving the optimal transport problem.
- epsilon_large_factor : float
- Controls some tolerances in the optimal transport solver. This value is multiplied by the floating points epsilon (around 1e-16 for 64-bit floats) to determine the actual tolerance.
- epsilon_small_factor : float
- Analogous to
epsilon_large_factor
but used where the numerical tolerance can be stricter.
- Analogous to
- dtype : one of
{'float64', 'float32', numpy.float32, numpy.float64}
- Controls the floating point precision used in the EMD computation.
Compute Distances
__call__
__call__(eventsA, eventsB=None, gdim=None, mask=False, event_weightsA=None, event_weightsB=None)
Compute the Wasserstein distances between all pairs of events using the Euclidean ground distance. If eventsB
is None
then the computations will be between pairs of events drawn from eventsA
, otherwise they are between all pairs of events between eventsA
and eventsB
. Each event should be a two-dimensional numpy array where the first column is the weights and the remaining columns are the coordinates of the particles in the Euclidean ground space. This method does not return anything; see the Access Results section below.
Arguments
- eventsA : {numpy.ndarray, list} of numpy.ndarray
- The first dataset of events as a list or numpy object array of events. Each event should be a two-dimensional numpy array where the weights are the first column and the Euclidean coordinates are the remaining columns.
- eventsB : {numpy.ndarray, list} of numpy.ndarray or
None
- A collection of events with the same structure as
eventsA
, if notNone
. This may beNone
in which case the computations are between all pairs of events fromeventsA
(effectively,eventsB
is set equal toeventsA
).
- A collection of events with the same structure as
- gdim : int or
None
- If
None
, has no effect. If an integer, then this is the dimension of the ground space and the firstgdim
columns after the first (which contains the weights) are used as the particle coordinates. Concretely, for an event the weights areevent[:,0]
and the coordinates are#!python event[:,1:gdim+1]
.
- If
- mask : bool
- If
True
, then particles farther thanR
away from the origin of the ground space will be ignored in the EMD computations, whereR
is the parameter given to the constructor of thePairwiseEMD
object.
- If
- event_weightsA : numpy.ndarray or
None
- Weights to associate to each event in
eventsA
, used only if anExternalEMDHandler
has been provided which uses event weights. IfNone
, the weight for event is taken to be one.
- Weights to associate to each event in
- event_weightsS : numpy.ndarray or
None
- Same as
event_weightsA
but corresponding toeventsB
.
- Same as
Access Results
These methods access quantities determined during the most recent call to __call__
.
emds
emds()
Matrix of Wasserstein distances as a numpy array. If eventsB
was None
then this will have shape (nevA, nevA)
, otherwise it will have shape (nevA, nevB)
.
Returns
- 2d numpy.ndarray
- Matrix of Wasserstein distances. If
eventsB
wasNone
then it will be symmetric.
- Matrix of Wasserstein distances. If
raw_emds
raw_emds()
Vector of Wasserstein distances as a numpy array, of length nevA*(nevA-1)/2
. Throws an error if eventsB
was not None
. This corresponds to what SciPy terms the vector-form of a square symmetric distance matrix, where the distance between events i
and j
is given by:
where n = nevA
and i < j
.
Returns
- 1d numpy.ndarray
- Vector of Wasserstein distances.
emd
emd(i, j)
The Wasserstein distance between event i
from eventsA
and event j
from eventsB
(or eventsA
if eventsB
is None
). Note that negative indices are accepted and count from the end, as usual in Python. If the PairwiseEMD
object is in request mode, then this method is used to trigger the computation of the requested EMD computation.
Arguments
- i : int
- The index of an event from
eventsA
. AnIndexError
is raised if it is out of range.
- The index of an event from
- j : int
- The index of an event from
eventsB
, oreventsA
ifeventsB
wasNone
. AnIndexError
is raised if it is out of range.
- The index of an event from
Returns
- float
- The Wasserstein distance between the specified events.
nevA
nevA()
Returns
- int
- The number of events in dataset A.
nevB
nevB()
If eventsB
was None
, this will be the same as nevA
.
Returns
- int
- The number of events in dataset B.
num_emds
num_emds()
Returns
- int
- The number of unique EMD computations carried out.
duration
duration()
Returns
- float
- The wall time, in seconds, of the previous computation.
errored
errored()
Whether or not any individual computations had a non-zero return status, indicating an error.
Returns
- bool
- Whether any errors occurred among the computations.
error_messages
error_messages()
Error messages, if any, that have occurred during the computations.
Returns
- tuple of str
- A tuple of the error messages for anything that went wrong during the computations. An empty tuple means there were no errors.
Get/Set Options
These methods can be used to get the current settings of the PairwiseEMD
object or to set new ones. The getter methods are:
R()
: returns floatbeta()
: returns floatnorm()
: returns boolrequest_mode()
: returns boolomp_dynamic_chunksize()
: returns inthave_external_emd_handler()
: returns bool- This will be
True
if there is an external EMD handler associated with thePairwiseEMD
object andFalse
otherwise.
- This will be
storage()
: returns int- Returns an integer corresponding to an
EMDPairsStorage
enum value.
- Returns an integer corresponding to an
The setter methods are:
set_R(new_R)
: accepts floatset_beta(new_beta)
: accepts floatset_norm(new_norm)
: accepts boolset_request_mode(new_mode)
: accepts boolset_omp_dynamic_chunksize(new_chunksize)
: accepts intset_network_simplex_params(n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0)
- This method resets all of the underlying network simplex solver's parameters at once.
external_emd_handler
set_external_emd_handler(emd_handler)
Associates this PairwiseEMD
object with an ExternalEMDHandler
. The results of the EMD computations will not be internally stored but will be passed on to the handler in a thread-safe manner for processing.
Arguments
- emd_handler :
ExternalEMDHandler
- An instance of
ExternalEMDHandler
that processes EMD values in an online manner.
- An instance of
Other Methods
preprocess_CenterWeightedCentroid
preprocess_CenterWeightedCentroid()
This method adds a CenterWeightedCentroid
preprocessor to the internal list. The particles of each event will be adjusted so that the origin of the ground space corresponds to their weighted centroid.
description
description(write_preprocessors=True)
Returns a string that describes the PairwiseEMD
object. Printing the PairwiseEMD
object uses this method to describe the object.
Arguments
- write_preprocessors : bool
- Whether or not to include preprocessors in the description. There are currently no preprocessors included in the Wasserstein package but this may change in the future.
Returns
- string
- The description of the
PairwiseEMD
object.
- The description of the
clear
clear()
Frees some memory in use by the PairwiseEMD
object. This should not normally need to be called by the user, except possibly after a larger computation. Note that this also removes any external EMD handler that may be present.
C++
The PairwiseEMD
C++ class is a template class that accepts a fully-qualified EMD
type parameter, from which it obtains information about event types, pairwise distance types, etc. There is a second template parameter which defaults to the value type of the EMD
type, but can also be set separately. There are two constructors: one that takes many of the same arguments as the EMD
class in order to construct them internally and one that acquires these settings from an EMD
object.
template<class EMD, typename Value = typename EMD::value_type>
wasserstein::PairwiseEMD(Value R = 1, Value beta = 1, bool norm = false,
int num_threads = -1,
std::ptrdiff_t print_every = -10,
unsigned verbose = 1,
bool request_mode = false,
bool store_sym_emds_raw = true,
bool throw_on_error = false,
int omp_dynamic_chunksize = 10,
unsigned n_iter_max = 100000,
Value epsilon_large_factor = 1000,
Value epsilon_small_factor = 1,
std::ostream & os = std::cout);
template<class EMD>
wasserstein::PairwiseEMD(const EMD & emd,
int num_threads = -1,
std::ptrdiff_t print_every = -10,
unsigned verbose = 1,
bool store_sym_emds_raw = true,
bool throw_on_error = false,
int omp_dynamic_chunksize = 10,
std::ostream & os = std::cout)
See the Python class of the same name for the meaning of these arguments. os
is an output stream that will be used for printing if verbose > 0
.
Compute Distances
The operator()
and compute
methods are overloaded to compute collections of EMD distances between two sets of events in different ways. Each has a one-argument version that computes all EMDs between every pair of events in the provided collection. Each also has a two-argument version that computes all EMDs between pairs of events with one from each set. Just as with the EMD
methods of the same name, the operator()
methods accept "proto events" (which may be of type Event
or are something that Event
can be constructed from) that will be preprocessed and compute
accepts fully-constructed events that will not be preprocessed. Events or proto events may be provided as a vector (for both operator()
and compute
) or as a first and last iterator (operator()
only)
operator()
// computes EMDs between pairs of events in one collection, provided as a vector
template<class ProtoEvent>
void operator()(const std::vector<ProtoEvent> & proto_events,
const std::vector<Value> & event_weights = {});
// computes EMDs between pairs of events in one collection, provided as iterators
template<class ProtoEventIt>
void operator()(ProtoEventIt proto_events_first, ProtoEventIt proto_events_last,
const std::vector<Value> & event_weights = {});
// computes EMDs between pairs of events in two collections, provided as vectors
template<class ProtoEventA, class ProtoEventB>
void operator()(const std::vector<ProtoEventA> & proto_eventsA,
const std::vector<ProtoEventB> & proto_eventsB,
const std::vector<Value> & event_weightsA = {},
const std::vector<Value> & event_weightsB = {});
// computes EMDs between pairs of events in two collections, provided as iterators
template<class ProtoEventAIt, class ProtoEventBIt>
void operator()(ProtoEventAIt proto_eventsA_first, ProtoEventAIt proto_eventsA_last,
ProtoEventBIt proto_eventsB_first, ProtoEventBIt proto_eventsB_last,
const std::vector<Value> & event_weightsA = {},
const std::vector<Value> & event_weightsB = {});
This version preprocesses each event to ensure any preprocessors are called and the particle weights are normalized properly (if norm = true
).
Arguments
- proto_events[A/B] (vector-style only)
- Collection of proto events; the actual events will be obtained by calling
Event(proto_event)
for eachproto_event
in the vector.
- Collection of proto events; the actual events will be obtained by calling
- proto_events[A/B]_first (iterator-style only)
- Iterator to first (proto) event. Should be at least a forward iterator.
- proto_events[A/B]_last (iterator-style only)
- Iterator pointing to the end of the range of (proto) event . Should be at least a forward iterator.
- event_weights[A/B]
- Vector of weights associated to each event; possibly used by an
ExternalEMDHandler
. If no event weights are provided, each event will be treated as having weight 1.
- Vector of weights associated to each event; possibly used by an
compute
void compute(const std::vector<Event> & events);
void compute(const std::vector<Event> & eventsA, const std::vector<Event> & eventsB);
This version does not apply any preprocessing.
Arguments
- events[A/B]
- Vector of fully-constructed events.
Access Results
Many of the methods share the same names as the Python ones. These are:
const std::vector<Value> & emds(bool raw = false);
const std::vector<Event> & events() const;
Value emd(std::ptrdiff_t i, std::ptrdiff_t j) const;
std::ptrdiff_t nevA() const;
std::ptrdiff_t nevB() const;
std::ptrdiff_t num_emds() const;
wasserstein::EMDPairsStorage storage() const;
double duration() const;
bool errored() const;
const std::vector<std::string> & error_messages() const;
The emds
method returns a reference to a vector of the EMD results (raw = true
corresponds to the Python raw_emds
method). The events
method returns a const reference to the vector of events that was used for the computations.
Get/Set Options
These methods can be used to get the current settings of the EMD
object or to set new ones. The getter methods are:
Value R() const;
Value beta() const;
bool norm() const;
bool have_external_emd_handler() const;
int omp_dynamic_chunksize() const;
bool request_mode() const;
template<class EMDHandler>
EMDHandler * external_emd_handler();
The external_emd_handler
method accepts one template parameter that is used to dynamically cast the internal pointer to that type. The setter methods are:
void set_R(Value R);
void set_beta(Value beta);
void set_norm(bool norm);
void set_external_emd_handler(ExternalEMDHandler & handler);
void set_omp_dynamic_chunksize(int chunksize);
void set_request_mode(bool mode);
void set_network_simplex_params(unsigned n_iter_max = 100000,
Value epsilon_large_factor = 1000,
Value epsilon_small_factor = 1);
See External EMD Handlers for more on the provided external EMD handler classes provided in Wasserstein.
Other Methods
preprocess
template<template<class> class P, typename... Args>
PairwiseEMD & preprocess(Args && ... args)
Adds a preprocessor to the internal list. Each event will be preprocessed by the preprocessors in the order they were given. Currently, there is one preprocessor as part of Wasserstein, CenterWeightedCentroid
. Since this preprocessor takes no arguments, it can be added as:
pairwise_emd_obj.preprocess<wasserstein::CenterWeightedCentroid>()
In general, any arguments to the preprocessor class are given as arguments to this method. A reference to the PairwiseEMD
object is returned.
description
std::string description()
Returns
- A string that describes the
PairwiseEMD
object.
clear
void clear()
Frees some memory in use by the PairwiseEMD
object. This should not normally need to be called by the user.