PairwiseEMD

The PairwiseEMD functionality handles the collective computation of Wasserstein distances between pairs of events. Adopting language from particle physics, we will call the distributions "events," the discrete entities in the ground space "particles," and the particle weights (probability mass) "energy". It automatically parallelizes the computations across a specified number of threads for the greatest efficiency.

Python

The Python PairwiseEMD function returns an object (either wasserstein.PairwiseEMDFloat64 or wasserstein.PairwiseEMDFloat32, each of which are instantiations of the C++ template PairwiseEMD) that can be used to efficientl compute pairs of EMD distances. It is designed to work with numpy arrays efficiently. The PairwiseEMDYPhi function behaves similarly but implements $2\pi$ periodicity in the second coordinate dimension, and so is suited for using the rapidity-azimuth plane as a ground space.

wasserstein.PairwiseEMD(R=1.0, beta=1.0, norm=False,
                        num_threads=-1, print_every=-10, verbose=1,
                        request_mode=False, store_sym_emds_raw=True, throw_on_error=False,
                        omp_dynamic_chunksize=10,
                        n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0,
                        dtype='float64')

wasserstein.PairwiseEMDYPhi(R=1.0, beta=1.0, norm=False,
                            num_threads=-1, print_every=-10, verbose=1,
                            request_mode=False, store_sym_emds_raw=True, throw_on_error=False,
                            omp_dynamic_chunksize=10,
                            n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0,
                            dtype='float64')

Arguments

R : float
- The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric satisfying the triangle inequality.
beta : float
- The angular weighting exponent. The internal pairwsie distance matrix is raised to this power prior to solving the optimal transport problem.
norm : bool
- Whether or not to normalize the particle weights to sum to one prior to computing the EMD.
num_threads : int
- The number of threads to use when executing the computations. A value of -1 uses the maximum number according to omp_get_max_threads() and should be used for the greatest efficiency.
print_every : int
- This controls how the slate of EMD computations is divied up. If positive, it is essentially a batch size - i.e. how many computations to do before printing progress (see verbose), checking for signals, etc., and then continuing. If negative, it is the approximate total number of times to print progress, etc. A value of 0 is equivalent to -1. Setting this to a small positive number or a large (in absolute value) negative number tends to be inefficient.
verbose : int
- Controls verbosity of the object. All printing is turned off if equal to 0. Larger numbers turn on successively more printing. Currently there is only one verbosity level but more may be added. For capturing this output in a Jupyter notebook, run %load_ext wurlitzer prior to starting the computation.
request_mode : bool
- Sets up the PairwiseEMD object to accept requests for EMD computations via the emd(i, j) method, rather than computing all EMDs immediately.
store_sym_emds_raw : bool
- Determines whether a symmetric distance matrix is stored square-form (redundant) or vector-form (raw). Wasserstein follows SciPy's convention for vector-form distance matrices.
throw_on_error : bool
- Determines if the computation is interrupted by an error in the EMD calculation, or if the error is handled silently. In all cases, the errors are logged and can be accessed with the error_messages() method.
omp_dynamic_chunksize : int
- Number of EMD computations to spool to each thread at a time.
n_iter_max : int
- Maximum number of iterations for solving the optimal transport problem.
epsilon_large_factor : float
- Controls some tolerances in the optimal transport solver. This value is multiplied by the floating points epsilon (around 1e-16 for 64-bit floats) to determine the actual tolerance.
epsilon_small_factor : float
- Analogous to epsilon_large_factor but used where the numerical tolerance can be stricter.
dtype : one of {'float64', 'float32', numpy.float32, numpy.float64}
- Controls the floating point precision used in the EMD computation.

Compute Distances

call

__call__(eventsA, eventsB=None, gdim=None, mask=False, event_weightsA=None, event_weightsB=None)

Compute the Wasserstein distances between all pairs of events using the Euclidean ground distance. If eventsB is None then the computations will be between pairs of events drawn from eventsA, otherwise they are between all pairs of events between eventsA and eventsB. Each event should be a two-dimensional numpy array where the first column is the weights and the remaining columns are the coordinates of the particles in the Euclidean ground space. This method does not return anything; see the Access Results section below.

Arguments

eventsA : {numpy.ndarray, list} of numpy.ndarray
- The first dataset of events as a list or numpy object array of events. Each event should be a two-dimensional numpy array where the weights are the first column and the Euclidean coordinates are the remaining columns.
eventsB : {numpy.ndarray, list} of numpy.ndarray or None
- A collection of events with the same structure as eventsA, if not None. This may be None in which case the computations are between all pairs of events from eventsA (effectively, eventsB is set equal to eventsA).
gdim : int or None
- If None, has no effect. If an integer, then this is the dimension of the ground space and the first gdim columns after the first (which contains the weights) are used as the particle coordinates. Concretely, for an event the weights are event[:,0] and the coordinates are #!python event[:,1:gdim+1].
mask : bool
- If True, then particles farther than R away from the origin of the ground space will be ignored in the EMD computations, where R is the parameter given to the constructor of the PairwiseEMD object.
event_weightsA : numpy.ndarray or None
- Weights to associate to each event in eventsA, used only if an ExternalEMDHandler has been provided which uses event weights. If None, the weight for event is taken to be one.
event_weightsS : numpy.ndarray or None
- Same as event_weightsA but corresponding to eventsB.

Access Results

These methods access quantities determined during the most recent call to __call__.

emds

emds()

Matrix of Wasserstein distances as a numpy array. If eventsB was None then this will have shape (nevA, nevA), otherwise it will have shape (nevA, nevB).

Returns

2d numpy.ndarray
- Matrix of Wasserstein distances. If eventsB was None then it will be symmetric.

raw_emds

raw_emds()

Vector of Wasserstein distances as a numpy array, of length nevA*(nevA-1)/2. Throws an error if eventsB was not None. This corresponds to what SciPy terms the vector-form of a square symmetric distance matrix, where the distance between events i and j is given by:

$\binom{n}{2}-\binom{n-i}{2}+(j-i-1)$

where n = nevA and i < j.

Returns

1d numpy.ndarray
- Vector of Wasserstein distances.

emd

emd(i, j)

The Wasserstein distance between event i from eventsA and event j from eventsB (or eventsA if eventsB is None). Note that negative indices are accepted and count from the end, as usual in Python. If the PairwiseEMD object is in request mode, then this method is used to trigger the computation of the requested EMD computation.

Arguments

i : int
- The index of an event from eventsA. An IndexError is raised if it is out of range.
j : int
- The index of an event from eventsB, or eventsA if eventsB was None. An IndexError is raised if it is out of range.

Returns

float
- The Wasserstein distance between the specified events.

nevA

nevA()

Returns

int
- The number of events in dataset A.

nevB

nevB()

If eventsB was None, this will be the same as nevA.

Returns

int
- The number of events in dataset B.

num_emds

num_emds()

Returns

int
- The number of unique EMD computations carried out.

duration

duration()

Returns

float
- The wall time, in seconds, of the previous computation.

errored

errored()

Whether or not any individual computations had a non-zero return status, indicating an error.

Returns

bool
- Whether any errors occurred among the computations.

error_messages

error_messages()

Error messages, if any, that have occurred during the computations.

Returns

tuple of str
- A tuple of the error messages for anything that went wrong during the computations. An empty tuple means there were no errors.

Get/Set Options

These methods can be used to get the current settings of the PairwiseEMD object or to set new ones. The getter methods are:

R() : returns float
beta() : returns float
norm() : returns bool
request_mode() : returns bool
omp_dynamic_chunksize() : returns int
have_external_emd_handler() : returns bool
- This will be True if there is an external EMD handler associated with the PairwiseEMD object and False otherwise.
storage() : returns int
- Returns an integer corresponding to an EMDPairsStorage enum value.

The setter methods are:

set_R(new_R) : accepts float
set_beta(new_beta) : accepts float
set_norm(new_norm) : accepts bool
set_request_mode(new_mode) : accepts bool
set_omp_dynamic_chunksize(new_chunksize) : accepts int
set_network_simplex_params(n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0)
- This method resets all of the underlying network simplex solver's parameters at once.

external_emd_handler

set_external_emd_handler(emd_handler)

Associates this PairwiseEMD object with an ExternalEMDHandler. The results of the EMD computations will not be internally stored but will be passed on to the handler in a thread-safe manner for processing.

Arguments

emd_handler : ExternalEMDHandler
- An instance of ExternalEMDHandler that processes EMD values in an online manner.

Other Methods

preprocess_CenterWeightedCentroid

preprocess_CenterWeightedCentroid()

This method adds a CenterWeightedCentroid preprocessor to the internal list. The particles of each event will be adjusted so that the origin of the ground space corresponds to their weighted centroid.

description

description(write_preprocessors=True)

Returns a string that describes the PairwiseEMD object. Printing the PairwiseEMD object uses this method to describe the object.

Arguments

write_preprocessors : bool
- Whether or not to include preprocessors in the description. There are currently no preprocessors included in the Wasserstein package but this may change in the future.

Returns

string
- The description of the PairwiseEMD object.

clear

clear()

Frees some memory in use by the PairwiseEMD object. This should not normally need to be called by the user, except possibly after a larger computation. Note that this also removes any external EMD handler that may be present.

C++

The PairwiseEMD C++ class is a template class that accepts a fully-qualified EMD type parameter, from which it obtains information about event types, pairwise distance types, etc. There is a second template parameter which defaults to the value type of the EMD type, but can also be set separately. There are two constructors: one that takes many of the same arguments as the EMD class in order to construct them internally and one that acquires these settings from an EMD object.

template<class EMD, typename Value = typename EMD::value_type>
wasserstein::PairwiseEMD(Value R = 1, Value beta = 1, bool norm = false,
                         int num_threads = -1,
                         std::ptrdiff_t print_every = -10,
                         unsigned verbose = 1,
                         bool request_mode = false,
                         bool store_sym_emds_raw = true,
                         bool throw_on_error = false,
                         int omp_dynamic_chunksize = 10,
                         unsigned n_iter_max = 100000,
                         Value epsilon_large_factor = 1000,
                         Value epsilon_small_factor = 1,
                         std::ostream & os = std::cout);

template<class EMD>
wasserstein::PairwiseEMD(const EMD & emd,
                         int num_threads = -1,
                         std::ptrdiff_t print_every = -10,
                         unsigned verbose = 1,
                         bool store_sym_emds_raw = true,
                         bool throw_on_error = false,
                         int omp_dynamic_chunksize = 10,
                         std::ostream & os = std::cout)

See the Python class of the same name for the meaning of these arguments. os is an output stream that will be used for printing if verbose > 0.

Compute Distances

The operator() and compute methods are overloaded to compute collections of EMD distances between two sets of events in different ways. Each has a one-argument version that computes all EMDs between every pair of events in the provided collection. Each also has a two-argument version that computes all EMDs between pairs of events with one from each set. Just as with the EMD methods of the same name, the operator() methods accept "proto events" (which may be of type Event or are something that Event can be constructed from) that will be preprocessed and compute accepts fully-constructed events that will not be preprocessed. Events or proto events may be provided as a vector (for both operator() and compute) or as a first and last iterator (operator() only)

operator()

// computes EMDs between pairs of events in one collection, provided as a vector
template<class ProtoEvent>
void operator()(const std::vector<ProtoEvent> & proto_events,
                const std::vector<Value> & event_weights = {});

// computes EMDs between pairs of events in one collection, provided as iterators
template<class ProtoEventIt>
void operator()(ProtoEventIt proto_events_first, ProtoEventIt proto_events_last,
                const std::vector<Value> & event_weights = {});

// computes EMDs between pairs of events in two collections, provided as vectors
template<class ProtoEventA, class ProtoEventB>
void operator()(const std::vector<ProtoEventA> & proto_eventsA,
                const std::vector<ProtoEventB> & proto_eventsB,
                const std::vector<Value> & event_weightsA = {},
                const std::vector<Value> & event_weightsB = {});

// computes EMDs between pairs of events in two collections, provided as iterators
template<class ProtoEventAIt, class ProtoEventBIt>
void operator()(ProtoEventAIt proto_eventsA_first, ProtoEventAIt proto_eventsA_last,
                ProtoEventBIt proto_eventsB_first, ProtoEventBIt proto_eventsB_last,
                const std::vector<Value> & event_weightsA = {},
                const std::vector<Value> & event_weightsB = {});

This version preprocesses each event to ensure any preprocessors are called and the particle weights are normalized properly (if norm = true).

Arguments

proto_events[A/B] (vector-style only)
- Collection of proto events; the actual events will be obtained by calling Event(proto_event) for each proto_event in the vector.
proto_events[A/B]_first (iterator-style only)
- Iterator to first (proto) event. Should be at least a forward iterator.
proto_events[A/B]_last (iterator-style only)
- Iterator pointing to the end of the range of (proto) event . Should be at least a forward iterator.
event_weights[A/B]
- Vector of weights associated to each event; possibly used by an ExternalEMDHandler. If no event weights are provided, each event will be treated as having weight 1.

compute

void compute(const std::vector<Event> & events);
void compute(const std::vector<Event> & eventsA, const std::vector<Event> & eventsB);

This version does not apply any preprocessing.

Arguments

events[A/B]
- Vector of fully-constructed events.

Access Results

Many of the methods share the same names as the Python ones. These are:

const std::vector<Value> & emds(bool raw = false);
const std::vector<Event> & events() const;
Value emd(std::ptrdiff_t i, std::ptrdiff_t j) const;
std::ptrdiff_t nevA() const;
std::ptrdiff_t nevB() const;
std::ptrdiff_t num_emds() const;
wasserstein::EMDPairsStorage storage() const;
double duration() const;
bool errored() const;
const std::vector<std::string> & error_messages() const;

The emds method returns a reference to a vector of the EMD results (raw = true corresponds to the Python raw_emds method). The events method returns a const reference to the vector of events that was used for the computations.

Get/Set Options

These methods can be used to get the current settings of the EMD object or to set new ones. The getter methods are:

Value R() const;
Value beta() const;
bool norm() const;
bool have_external_emd_handler() const;
int omp_dynamic_chunksize() const;
bool request_mode() const;

template<class EMDHandler>
EMDHandler * external_emd_handler();

The external_emd_handler method accepts one template parameter that is used to dynamically cast the internal pointer to that type. The setter methods are:

void set_R(Value R);
void set_beta(Value beta);
void set_norm(bool norm);
void set_external_emd_handler(ExternalEMDHandler & handler);
void set_omp_dynamic_chunksize(int chunksize);
void set_request_mode(bool mode);
void set_network_simplex_params(unsigned n_iter_max = 100000,
                                Value epsilon_large_factor = 1000,
                                Value epsilon_small_factor = 1);

See External EMD Handlers for more on the provided external EMD handler classes provided in Wasserstein.

Other Methods

preprocess

template<template<class> class P, typename... Args>
PairwiseEMD & preprocess(Args && ... args)

Adds a preprocessor to the internal list. Each event will be preprocessed by the preprocessors in the order they were given. Currently, there is one preprocessor as part of Wasserstein, CenterWeightedCentroid. Since this preprocessor takes no arguments, it can be added as:

pairwise_emd_obj.preprocess<wasserstein::CenterWeightedCentroid>()

In general, any arguments to the preprocessor class are given as arguments to this method. A reference to the PairwiseEMD object is returned.

description

std::string description()

Returns

A string that describes the PairwiseEMD object.

clear

void clear()

Frees some memory in use by the PairwiseEMD object. This should not normally need to be called by the user.