EMD
The EMD
class handles computation of individual Wasserstein distances between pairs of distributions. Adopting language from particle physics, we will call the distributions "events," the discrete entities in the ground space "particles," and the particle weights (probability mass) "energy".
The optimal transport problem can be specified as:
where \mathcal E and \mathcal E' are events with M and M' particles, energies \{E_i\} and \{E'_j\}, respectively, and pairwise distance matrix \theta_{ij}. R>0 is a parameter controlling the tradeoff between transporting and creating/destroying energy (R should be large than half the maximum distance in the ground space in order to ensure the EMD defines a proper metric). \beta>0 is an angular exponent, and the EMD only satisfies the triangle inequality for \beta>1 if it is raised to the 1/\beta power.
Python
The Python EMD
function returns an object (either wasserstein.EMDFloat64
or wasserstein.EMDFloat32
, each of which are instantiations of the C++ template EMD
) that can be used to compute EMD distances. It is designed to work with numpy arrays efficiently. The EMDYPhi
function behaves similarly but implements 2\pi periodicity in the second coordinate dimension, and so is suited for using the rapidity-azimuth plane as a ground space.
wasserstein.EMD(R=1.0, beta=1.0, norm=False, do_timing=False,
n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0,
dtype='float64')
wasserstein.EMDYPhi(R=1.0, beta=1.0, norm=False, do_timing=False,
n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0,
dtype='float64')
Arguments
- R : float
- The R parameter in the EMD definition that controls the relative importance of the two terms. Must be greater than or equal to half of the maximum ground distance in the space in order for the EMD to be a valid metric satisfying the triangle inequality.
- beta : float
- The angular weighting exponent. The internal pairwsie distance matrix is raised to this power prior to solving the optimal transport problem.
- norm : bool
- Whether or not to normalize the particle weights to sum to one prior to computing the EMD.
- do_timing : bool
- Whether or not to keep track of the duration of the underlying computation.
- n_iter_max : int
- Maximum number of iterations for solving the optimal transport problem.
- epsilon_large_factor : float
- Controls some tolerances in the optimal transport solver. This value is multiplied by the floating points epsilon (around 1e-16 for 64-bit floats) to determine the actual tolerance.
- epsilon_small_factor : float
- Analogous to
epsilon_large_factor
but used where the numerical tolerance can be stricter.
- Analogous to
- dtype : one of
{'float64', 'float32', numpy.float32, numpy.float64}
- Controls the floating point precision used in the EMD computation.
Run Computation
__call__
__call__(weights0, coords0, weights1, coords1)
__call__(weights0, weights1, dists)
Compute the Wasserstein distance between two events using the EMD object.
Arguments
- weights0 : 1d numpy.ndarray
- The weight values of the first event.
- weights1 : 1d numpy.ndarray
- The weight values of the second event.
In the first version, the Euclidean ground distance is used where the particle coordinates are specified as:
- coords0 : 2d numpy.ndarray
- The Cartesian coordinates of the particles in the first event. If the ground space dimension is
d
, then it has shape(M0, d)
whereM0
is the length ofweights0
.
- The Cartesian coordinates of the particles in the first event. If the ground space dimension is
- coords1 : 2d numpy.ndarray
- The Cartesian coordinates of the particles in the second event. If the ground space dimension is
d
, then it has shape(M1, d)
whereM1
is the length ofweights1
.
- The Cartesian coordinates of the particles in the second event. If the ground space dimension is
In the second version, the ground space distances are provided directly:
- dists : 2d numpy.ndarray
- The ground distances between particles in the first and second events. To use the above notation, it has shape
(M0, M1)
.
- The ground distances between particles in the first and second events. To use the above notation, it has shape
Returns
- float
- The Wasserstein distance between event0 and event1. See also
emd()
.
- The Wasserstein distance between event0 and event1. See also
Access Results
These methods access quantities determined during the most recent computation.
emd
emd()
The Wasserstein distance.
Returns
- float
- The cost of transporting
event0
toevent1
.
- The cost of transporting
dists
dists()
The ground distances between the particles.
Returns
- numpy.ndarray (first version only)
- The ground distance matrix as an array with shape
(n0, n1)
.
- The ground distance matrix as an array with shape
flows
flows()
The optimal flow matrix.
Returns
- numpy.ndarray (first version only)
- The flow matrix as an array with shape
(n0, n1)
.
- The flow matrix as an array with shape
flow
flow(i, j)
flow(ind)
- First version: Access the flow between particle
i
inevent0
andj
inevent1
. Note that negative indices are accepted and count from the end, as usual in Python. - Second version: Access the flow using a 1D index such that
ind = i*n1() + j
.
Arguments
- i : int
- Index of particle in
event0
. AnIndexError
is raised if out of range.
- Index of particle in
- j : int
- Index of particle in
event1
. AnIndexError
is raised if out of range.
- Index of particle in
- ind : int
- Raw flow index, equal to
i*n1() + j
.
- Raw flow index, equal to
Returns
- float
- The amount of flow between the specified particles.
status
status()
The status code of the solver. A non-zero value indicates a problem. See the check_emd_status
function for interpretting the code.
Returns
- int
- Networks simplex status code. Zero indicates success, a non-zero value indicates an error.
n0
n0()
Number of entities passed to the network simplex solver for event0
.
Returns
- int
- The number of particles ultimately used in
event0
. This may be one greater than the number of given particles inevent0
ifnorm=False
andevent0
has less total weight thanevent1
.
- The number of particles ultimately used in
n1
n1()
Number of entities passed to the network simplex solver for event1
.
Returns
- int
- The number of particles ultimately used in
event1
. This may be one greater than the number of given particles inevent1
ifnorm=False
andevent1
has less total weight thanevent0
.
- The number of particles ultimately used in
extra
extra()
Which event, if any, got an extra particle.
Returns
- int
- Which event,
0
or1
, got an extra particle, or if neither did,-1
.
- Which event,
weightdiff
weightdiff()
Total weight in event1
minus total weight in event0
.
Returns
- float
- The internally calculated difference in total weight between
event1
andevent0
.
- The internally calculated difference in total weight between
scale
scale()
The optimal transport problem is internally rescaled in order to provide more numerically stable computations. The scale
is the value that the weights are divided by prior to running the network simplex algorithm.
Returns
- float
- The internally utilized scaling factor for the weights.
duration
duration()
The duration of the core of the optimal transport calculation via the network simplex algorithm.
Returns
- float
- The time, in seconds, of the computation. If
do_timing
was notTrue
for this computation then this value is not meaningful.
- The time, in seconds, of the computation. If
n_iter
````python n_iter()
The number of iterations the most recent optimal transport calculation took to finish.
**Returns**
- _int_
- Number of iterations run by the Network Simplex solver.
#### node_potentials
```python
node_potentials()
The potentials assigned in the Network Simplex algorithm to the particles in each event.
Returns
- (numpy.ndarray, numpy.ndarray)
- A pair of NumPy arrays corresponding to the node potentials in the first and second event.
Get/Set Options
These methods can be used to get the current settings of the EMD
object or to set new ones. The getter methods are:
R()
: returns floatbeta()
: returns floatnorm()
: returns booldo_timing
: returns bool
The setter methods are:
set_R(new_R)
: accepts floatset_beta(new_beta)
: accepts floatset_norm(new_norm)
: accepts boolset_do_timing(new_do_timing)
: accepts boolset_network_simplex_params(n_iter_max=100000, epsilon_large_factor=1000.0, epsilon_small_factor=1.0)
- This method resets all of the underlying network simplex solver's parameters at once.
Other Methods
preprocess_CenterWeightedCentroid
preprocess_CenterWeightedCentroid()
This method adds a CenterWeightedCentroid
preprocessor to the internal list. The particles of each event will be adjusted so that the origin of the ground space corresponds to their weighted centroid.
description
description(write_preprocessors=True)
Returns a string that describes the EMD
object. Printing the EMD
object uses this method to describe the object.
Arguments
- write_preprocessors : bool
- Whether or not to include preprocessors in the description. There are currently no preprocessors included in the Wasserstein package but this may change in the future.
Returns
- string
- The description of the
EMD
object.
- The description of the
clear
clear()
Frees some memory in use by the EMD
object. This should not normally need to be called by the user.
C++
The Wasserstein C++ interface makes heavy use of templates in order to give the user full control over the EMD computation. Most classes are templated to accept a floating point type (either float
or double
), though some aliases are provided to make things easier. The EMD
class provides the implementation of the EMD as defined above. It accepts four template parameters: a floating point Value
type, an Event
class, a PairwiseDistance
class, and a NetworkSimplex
class. The default namespace for the package is wasserstein
.
template<typename Value,
template<typename> class Event = DefaultEvent,
template<typename> class PairwiseDistance = DefaultPairwiseDistance,
template<typename> class NetworkSimplex = DefaultNetworkSimplex>
wasserstein::EMD(Value R = 1, Value beta = 1, bool norm = false,
bool do_timing = false, bool external_dists = false,
unsigned n_iter_max = 100000,
Value epsilon_large_factor = 1000,
Value epsilon_small_factor = 1);
See the Python class of the same name for the meaning of these arguments. external_dists
is a new flag indicating whether or not the internal PairwiseDistance
object should be used to obtain ground distances or whether the ground distances will be externally provided (see the ground_dists()
method below).
Ordinarily, users will make use of one of the template specializations of EMD
, which only require one to specify an Event
and a PairwiseDistance
class. These are:
// double precision, default NetworkSimplex
template<template<typename> class Event = DefaultEvent,
template<typename> class PairwiseDistance = DefaultPairwiseDistance>
using EMDFloat64 = EMD<double, Event, PairwiseDistance>;
// single precision, default NetworkSimplex
template<template<typename> class Event = DefaultEvent,
template<typename> class PairwiseDistance = DefaultPairwiseDistance>
using EMDFloat32 = EMD<float, Event, PairwiseDistance>;
See the Event and PairwiseDistance docs for more on what classes are provided to fill these roles.
Run Computation
There are two methods that can be used to compute an EMD distance between two events, operator()
and compute
. The difference is that the first is templated to accept "proto events" (which may be of type Event
, or are something that Event
can be constructed from) and also preprocesses the events. The second accepts two already constructed events, does not preprocess them, and does not check that their weights have been setup properly.
operator()
template<class ProtoEvent0, class ProtoEvent1>
Value operator()(const ProtoEvent0 & pev0, const ProtoEvent1 & pev1);
This version preprocesses each event to ensure any preprocessors are called and the weights are normalized properly (if norm=true
).
Arguments
- pev0
- The first event will be constructed by calling
Event(pev0)
.
- The first event will be constructed by calling
- pev1
- The second event will be constructed by calling
Event(pev1)
.
- The second event will be constructed by calling
Returns
- The Wasserstein distance between the two events.
compute
Value compute(const Event & ev0, const Event & ev1);
This version computes the distance between the two events as they are, without any preprocessing.
Arguments
- ev0
- The first event.
- ev1
- The second event.
Returns
- The Wasserstein distance between the two events.
ground_dists
std::vector<Value> & ground_dists();
const std::vector<Value> & ground_dists() const;
To provide external ground distances, one should use this method to access a reference to the vector of distances used by the NetworkSimplex
object. The user is free to resize this vector and copy/assign distances into it. Note that if norm=false
, then one of the events will generically get an extra particle to account for the difference in weights, and the external dists should account for this. The distances are expected to be provided in row-major (C-style) order.
Returns
- A (const) reference to the vector of ground distances used by the NetworkSimplex
object.
Access Results
Many of the methods share the same names as the Python ones. These are:
Value emd() const;
wasserstein::EMDStatus status() const;
std::vector<Value> dists() const;
std::vector<Value> flows() const;
Value flow(std::ptrdiff_t i, std::ptrdiff_t j) const;
Value flow(std::size_t ind) const;
std::ptrdiff_t n0() const;
std::ptrdiff_t n1() const;
wasserstein::ExtraParticle extra() const;
Value weightdiff() const;
Value scale() const;
double duration() const;
std::size_t n_iter() const;
std::pair<std::vector<Value>, std::vector<Value>> node_potentials() const;
EMDStatus
and ExtraParticle
are enums; see the C++ Utils
Get/Set Options
These methods can be used to get the current settings of the EMD
object or to set new ones. The getter methods are:
Value R() const;
Value beta() const;
bool norm() const;
bool do_timing() const;
bool external_dists() const;
const NetworkSimplex & network_simplex() const;
const PairwiseDistance & pairwise_distance() const;
external_dists()
indicates whether or not to use the internal PairwiseDistance
object or to assume that the ground distances have been externally set. network_simplex()
and pairwise_distance()
access the underlying objects used to do the heavy lifting of the computation.
The setter methods are:
void set_R(Value R);
void set_beta(Value beta);
void set_norm(bool norm);
void set_do_timing(bool do_timing);
void set_external_dists(bool exdists);
void set_network_simplex_params(unsigned n_iter_max = 100000,
Value epsilon_large_factor = 1000,
Value epsilon_small_factor = 1);
Other Methods
preprocess
template<template<class> class P, typename... Args>
EMD & preprocess(Args && ... args)
Adds a preprocessor to the internal list. Each event will be preprocessed by the preprocessors in the order they were given. Currently, there is one preprocessor as part of Wasserstein, CenterWeightedCentroid
. Since this preprocessor takes no arguments, it can be added as:
emd_obj.preprocess<wasserstein::CenterWeightedCentroid>()
In general, any arguments to the preprocessor class are given as arguments to this method. A reference to the EMD
object is returned.
description
std::string description(bool write_preprocessors = true)
Returns a string that describes the EMD
object.
Arguments
- write_preprocessors
- Whether or not to include preprocessors in the description. There are currently no preprocessors included in the Wasserstein package but this may change in the future.
Returns
- The description of the
EMD
object.
clear
void clear()
Frees some memory in use by the EMD
object. This should not normally need to be called by the user.