EMD

The EMD class handles computation of individual Wasserstein distances between pairs of distributions. Adopting language from particle physics, we will call the distributions "events," the discrete entities in the ground space "particles," and the particle weights (probability mass) "energy".

The optimal transport problem can be specified as:

\text{EMD}_{\beta,R}(\mathcal E,\mathcal E^\prime)=\min_{\{f_{ij}\ge0\}}\sum_{i=1}^M\sum_{j=1}^{M'}f_{ij}\left(\frac{ \theta_{ij}}{R}\right)^\beta + \left|\sum_{i=1}^ME_i-\sum_{j=1}^{M'}E^\prime_j \right|,
\sum_{j=1}^{M'}f_{ij}\le E_i, \quad \sum_{i=1}^Mf_{ij}\le E^\prime_j, \quad\sum_{i=1}^M\sum_{j=1}^{M'}f_{ij}=\min\left(\sum_{i=1}^ME_i,\sum_{j=1}^{M'}E^\prime_j\right),

where \mathcal E and \mathcal E' are events with M and M' particles, energies \{E_i\} and \{E'_j\}, respectively, and pairwise distance matrix \theta_{ij}. R>0 is a parameter controlling the tradeoff between transporting and creating/destroying energy (R should be large than half the maximum distance in the ground space in order to ensure the EMD defines a proper metric). \beta>0 is an angular exponent, and the EMD only satisfies the triangle inequality for \beta>1 if it is raised to the 1/\beta power.


Python

The Python EMD function returns an object (either wasserstein.EMDFloat64 or wasserstein.EMDFloat32, each of which are instantiations of the C++ template EMD) that can be used to compute EMD distances. It is designed to work with numpy arrays efficiently. The EMDYPhi function behaves similarly but implements 2\pi periodicity in the second coordinate dimension, and so is suited for using the rapidity-azimuth plane as a ground space.

wasserstein.EMD(R=1.0, beta=1.0, norm=False, do_timing=False,
                n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0,
                dtype='float64')

wasserstein.EMDYPhi(R=1.0, beta=1.0, norm=False, do_timing=False,
                    n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0,
                    dtype='float64')

Arguments

  • R : float
    • The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric satisfying the triangle inequality.
  • beta : float
    • The angular weighting exponent. The internal pairwsie distance matrix is raised to this power prior to solving the optimal transport problem.
  • norm : bool
    • Whether or not to normalize the particle weights to sum to one prior to computing the EMD.
  • do_timing : bool
    • Whether or not to keep track of the duration of the underlying computation.
  • n_iter_max : int
    • Maximum number of iterations for solving the optimal transport problem.
  • epsilon_large_factor : float
    • Controls some tolerances in the optimal transport solver. This value is multiplied by the floating points epsilon (around 1e-16 for 64-bit floats) to determine the actual tolerance.
  • epsilon_small_factor : float
    • Analogous to epsilon_large_factor but used where the numerical tolerance can be stricter.
  • dtype : one of {'float64', 'float32', numpy.float32, numpy.float64}
    • Controls the floating point precision used in the EMD computation.

Run Computation

__call__

__call__(weights0, coords0, weights1, coords1)
__call__(weights0, weights1, dists)

Compute the Wasserstein distance between two events using the EMD object.

Arguments

  • weights0 : 1d numpy.ndarray
    • The weight values of the first event.
  • weights1 : 1d numpy.ndarray
    • The weight values of the second event.

In the first version, the Euclidean ground distance is used where the particle coordinates are specified as:

  • coords0 : 2d numpy.ndarray
    • The Cartesian coordinates of the particles in the first event. If the ground space dimension is d, then it has shape (M0, d) where M0 is the length of weights0.
  • coords1 : 2d numpy.ndarray
    • The Cartesian coordinates of the particles in the second event. If the ground space dimension is d, then it has shape (M1, d) where M1 is the length of weights1.

In the second version, the ground space distances are provided directly:

  • dists : 2d numpy.ndarray
    • The ground distances between particles in the first and second events. To use the above notation, it has shape (M0, M1).

Returns

  • float
    • The Wasserstein distance between event0 and event1. See also emd().

Access Results

These methods access quantities determined during the most recent computation.

emd

emd()

The Wasserstein distance.

Returns

  • float
    • The cost of transporting event0 to event1.

dists

dists()

The ground distances between the particles.

Returns

  • numpy.ndarray (first version only)
    • The ground distance matrix as an array with shape (n0, n1).

flows

flows()

The optimal flow matrix.

Returns

  • numpy.ndarray (first version only)
    • The flow matrix as an array with shape (n0, n1).

flow

flow(i, j)
flow(ind)
  • First version: Access the flow between particle i in event0 and j in event1. Note that negative indices are accepted and count from the end, as usual in Python.
  • Second version: Access the flow using a 1D index such that ind = i*n1() + j.

Arguments

  • i : int
    • Index of particle in event0. An IndexError is raised if out of range.
  • j : int
    • Index of particle in event1. An IndexError is raised if out of range.
  • ind : int
    • Raw flow index, equal to i*n1() + j.

Returns

  • float
    • The amount of flow between the specified particles.

status

status()

The status code of the solver. A non-zero value indicates a problem. See the check_emd_status function for interpretting the code.

Returns

  • int
    • Networks simplex status code. Zero indicates success, a non-zero value indicates an error.

n0

n0()

Number of entities passed to the network simplex solver for event0.

Returns

  • int
    • The number of particles ultimately used in event0. This may be one greater than the number of given particles in event0 if norm=False and event0 has less total weight than event1.

n1

n1()

Number of entities passed to the network simplex solver for event1.

Returns

  • int
    • The number of particles ultimately used in event1. This may be one greater than the number of given particles in event1 if norm=False and event1 has less total weight than event0.

extra

extra()

Which event, if any, got an extra particle.

Returns

  • int
    • Which event, 0 or 1, got an extra particle, or if neither did, -1.

weightdiff

weightdiff()

Total weight in event1 minus total weight in event0.

Returns

  • float
    • The internally calculated difference in total weight between event1 and event0.

scale

scale()

The optimal transport problem is internally rescaled in order to provide more numerically stable computations. The scale is the value that the weights are divided by prior to running the network simplex algorithm.

Returns

  • float
    • The internally utilized scaling factor for the weights.

duration

duration()

The duration of the core of the optimal transport calculation via the network simplex algorithm.

Returns

  • float
    • The time, in seconds, of the computation. If do_timing was not True for this computation then this value is not meaningful.

n_iter

````python n_iter()

The number of iterations the most recent optimal transport calculation took to finish.

**Returns**

- _int_
    - Number of iterations run by the Network Simplex solver.

#### node_potentials

```python
node_potentials()

The potentials assigned in the Network Simplex algorithm to the particles in each event.

Returns

  • (numpy.ndarray, numpy.ndarray)
    • A pair of NumPy arrays corresponding to the node potentials in the first and second event.

Get/Set Options

These methods can be used to get the current settings of the EMD object or to set new ones. The getter methods are:

  • R() : returns float
  • beta() : returns float
  • norm() : returns bool
  • do_timing : returns bool

The setter methods are:

  • set_R(new_R) : accepts float
  • set_beta(new_beta) : accepts float
  • set_norm(new_norm) : accepts bool
  • set_do_timing(new_do_timing) : accepts bool
  • set_network_simplex_params(n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0)
    • This method resets all of the underlying network simplex solver's parameters at once.

Other Methods

preprocess_CenterWeightedCentroid

preprocess_CenterWeightedCentroid()

This method adds a CenterWeightedCentroid preprocessor to the internal list. The particles of each event will be adjusted so that the origin of the ground space corresponds to their weighted centroid.

description

description(write_preprocessors=True)

Returns a string that describes the EMD object. Printing the EMD object uses this method to describe the object.

Arguments

  • write_preprocessors : bool
    • Whether or not to include preprocessors in the description. There are currently no preprocessors included in the Wasserstein package but this may change in the future.

Returns

  • string
    • The description of the EMD object.

clear

clear()

Frees some memory in use by the EMD object. This should not normally need to be called by the user.


C++

The Wasserstein C++ interface makes heavy use of templates in order to give the user full control over the EMD computation. Most classes are templated to accept a floating point type (either float or double), though some aliases are provided to make things easier. The EMD class provides the implementation of the EMD as defined above. It accepts four template parameters: a floating point Value type, an Event class, a PairwiseDistance class, and a NetworkSimplex class. The default namespace for the package is wasserstein.

template<typename Value,
         template<typename> class Event = DefaultEvent,
         template<typename> class PairwiseDistance = DefaultPairwiseDistance,
         template<typename> class NetworkSimplex = DefaultNetworkSimplex>
wasserstein::EMD(Value R = 1, Value beta = 1, bool norm = false,
                 bool do_timing = false, bool external_dists = false,
                 unsigned n_iter_max = 100000,
                 Value epsilon_large_factor = 1000,
                 Value epsilon_small_factor = 1);

See the Python class of the same name for the meaning of these arguments. external_dists is a new flag indicating whether or not the internal PairwiseDistance object should be used to obtain ground distances or whether the ground distances will be externally provided (see the ground_dists() method below).

Ordinarily, users will make use of one of the template specializations of EMD, which only require one to specify an Event and a PairwiseDistance class. These are:

// double precision, default NetworkSimplex
template<template<typename> class Event = DefaultEvent,
         template<typename> class PairwiseDistance = DefaultPairwiseDistance>
using EMDFloat64 = EMD<double, Event, PairwiseDistance>;

// single precision, default NetworkSimplex
template<template<typename> class Event = DefaultEvent,
         template<typename> class PairwiseDistance = DefaultPairwiseDistance>
using EMDFloat32 = EMD<float, Event, PairwiseDistance>;

See the Event and PairwiseDistance docs for more on what classes are provided to fill these roles.


Run Computation

There are two methods that can be used to compute an EMD distance between two events, operator() and compute. The difference is that the first is templated to accept "proto events" (which may be of type Event, or are something that Event can be constructed from) and also preprocesses the events. The second accepts two already constructed events, does not preprocess them, and does not check that their weights have been setup properly.

operator()

template<class ProtoEvent0, class ProtoEvent1>
Value operator()(const ProtoEvent0 & pev0, const ProtoEvent1 & pev1);

This version preprocesses each event to ensure any preprocessors are called and the weights are normalized properly (if norm=true).

Arguments

  • pev0
    • The first event will be constructed by calling Event(pev0).
  • pev1
    • The second event will be constructed by calling Event(pev1).

Returns

  • The Wasserstein distance between the two events.

compute

Value compute(const Event & ev0, const Event & ev1);

This version computes the distance between the two events as they are, without any preprocessing.

Arguments

  • ev0
    • The first event.
  • ev1
    • The second event.

Returns

  • The Wasserstein distance between the two events.

ground_dists

std::vector<Value> & ground_dists();
const std::vector<Value> & ground_dists() const;

To provide external ground distances, one should use this method to access a reference to the vector of distances used by the NetworkSimplex object. The user is free to resize this vector and copy/assign distances into it. Note that if norm=false, then one of the events will generically get an extra particle to account for the difference in weights, and the external dists should account for this. The distances are expected to be provided in row-major (C-style) order.

Returns - A (const) reference to the vector of ground distances used by the NetworkSimplex object.


Access Results

Many of the methods share the same names as the Python ones. These are:

Value emd() const;
wasserstein::EMDStatus status() const;

std::vector<Value> dists() const;
std::vector<Value> flows() const;
Value flow(std::ptrdiff_t i, std::ptrdiff_t j) const;
Value flow(std::size_t ind) const;

std::ptrdiff_t n0() const;
std::ptrdiff_t n1() const;
wasserstein::ExtraParticle extra() const;
Value weightdiff() const;
Value scale() const;

double duration() const;
std::size_t n_iter() const;

std::pair<std::vector<Value>, std::vector<Value>> node_potentials() const;

EMDStatus and ExtraParticle are enums; see the C++ Utils


Get/Set Options

These methods can be used to get the current settings of the EMD object or to set new ones. The getter methods are:

Value R() const;
Value beta() const;
bool norm() const;
bool do_timing() const;
bool external_dists() const;
const NetworkSimplex & network_simplex() const;
const PairwiseDistance & pairwise_distance() const;

external_dists() indicates whether or not to use the internal PairwiseDistance object or to assume that the ground distances have been externally set. network_simplex() and pairwise_distance() access the underlying objects used to do the heavy lifting of the computation.

The setter methods are:

void set_R(Value R);
void set_beta(Value beta);
void set_norm(bool norm);
void set_do_timing(bool do_timing);
void set_external_dists(bool exdists);
void set_network_simplex_params(unsigned n_iter_max = 100000,
                                Value epsilon_large_factor = 1000,
                                Value epsilon_small_factor = 1);

Other Methods

preprocess

template<template<class> class P, typename... Args>
EMD & preprocess(Args && ... args)

Adds a preprocessor to the internal list. Each event will be preprocessed by the preprocessors in the order they were given. Currently, there is one preprocessor as part of Wasserstein, CenterWeightedCentroid. Since this preprocessor takes no arguments, it can be added as:

emd_obj.preprocess<wasserstein::CenterWeightedCentroid>()

In general, any arguments to the preprocessor class are given as arguments to this method. A reference to the EMD object is returned.

description

std::string description(bool write_preprocessors = true)

Returns a string that describes the EMD object.

Arguments

  • write_preprocessors
    • Whether or not to include preprocessors in the description. There are currently no preprocessors included in the Wasserstein package but this may change in the future.

Returns

  • The description of the EMD object.

clear

void clear()

Frees some memory in use by the EMD object. This should not normally need to be called by the user.