Colloquium Talk: Welcome to Optimal Transport

Here is the rough content of our presentation October 18, 2022 at MRU. EA 1026.

Optimal Transport starts with question: how to get from $x$ to $y$? Or specifically, how to get unit mass $\delta_x$ from a productive source to a unit mass $\delta_y$ at consumption target? Here is instance where we have too many choices! So we begin to consider Hamiltonian Principle, and looking for ground states of energies, or, cost minimizers.

To motivate transport between Borel-Radon measures. We did not emphasize the continuity and stability of OT like we wanted to. But Borel-Radon measures and weak_$*$ convergence and the continuity properties are very important for applications to topology. So we imagine a distribution $\sigma$ of productive sources and a given target $\tau$ of consumptive targets, and the question we are asking is "How to get from $\sigma$ to $\tau$?"

Now as we'll describe the OT program requires a choice of $\sigma$, $\tau$, and a choice of cost $c: X\times Y \to {\bf{R}} \cup\{ +\infty\}$.

Monge, Deblais et Reblais

Let's go through history and start with Monge [(1761)] where he's literally moving volumes of earth for Napoleon. The Monge Problem is $$(MP) : ~~ \inf_{T} \int ||x- T(x)||d\sigma(x)$$ where the infimum is over all maps $T: {\bf{R}}^n \to {\bf{R}}^n$ satisfying $T\#\sigma = \tau$ (if any such maps exist at all!) Given some basic regularity on $T$, i.e. Lipschitz, then we can infer the Monge-Ampere equation $$det(DT(x)) = \frac{g(T(x))}{f(x)}$$ but this equation is effectively useuless in practice.

Okay, so our point is this: that Monge started with a very difficult problem, namely with the unfortunate cost $c(x,y)=d^1(x,y) = ||x-y||$. The fact that this cost is not strictly convex is problematic in practice. Moreover the fact that $x\mapsto ||x||$ is not differentiable at $x=0$ is also extremely problematic in practice. In otherwords, the $L^1$ norm is a very difficult choice of cost, especially to begin with. As we will explain below, in a certain sense the most canonical cost is actually the quadratic cost $c(x,y)=d^2(x,y)/2 = ||x-y||^2/2$.

Kantorovich's Linearization

After Monge, we next consider Leonid Kantorovich, and this is big jump in history. We say that Kantorovich basically linearized the whole problem, replacing the rather intractable space of maps $T: X \to Y$ with the space of Borel-Radon measures $\pi$ on the product $X\times Y$, and the pushforward condition $T\# \sigma = \tau$ being replaced with two constraints $$pr_X \# \pi \leq \sigma, ~~~ pr_Y \# \pi =\tau.$$ Here $pr_X$ and $pr_Y$ are the canonical projections from $X\times Y$ to $X,Y$, respectively. Let $SC(\sigma, \tau)$ represent this set of semicoupling measures from the source $\sigma$ to target $\tau$. The important point is that $SC(\sigma, \tau)$ is a nonempty, convex, weak-$*$ compact subset of Borel-Radon measures on $X\times Y$.

Next Kantorovich observes that for continuous costs $c: X\times Y \to {\bf{R}} \cup \{+\infty\}$ we obtain continuous linear functionals on $SC$ defined by $$\pi \mapsto \int c(x,y) ~d\pi(x,y).$$ So for every choice of source, target, and cost, we find the Monge-Kantorovich program: $$(MK): ~~~ \min_{\pi\in SC(\sigma, \tau)} \int c(x,y) ~ d\pi(x,y).$$ From this program we see that indeed the minimum is attained, being a linear function on a convex compact set.

At this stage we would like to focus on finitely supported source and targets, namely $$\sigma = \delta_{x_1} +\cdots+\delta_{x_n}$$ and $$\tau = \delta_{y_1} +\cdots+\delta_{y_n}.$$

Then the set of measure-preserving mappings from $\sigma$ to $\tau$ corresponds to bijective maps from the n-element sets, i.e. basically corresponding to a permutation from $X$ to $Y$. However Kantorovich introduces the set of bistochastic $n\times n$ bistochastic matrices $$BS(n) = \{(a_{ij}) \} $$ where $a_{ij}$ satisfes $$ a_{ij} \geq 0,~~~ \sum_i a_{ij} = 1 = \sum_j a_{ij}~~ $$ for every choice fo indexes $i,j$. In otherwords all the couplings from $X$ to $Y$ corresponds to an invertible stochastic transformation, where the sum of every column and the sum of every row is equal to unity.

The key point about $BS(n)$ is the so-called Choquet-Birkhoff-von Neumman theorem that the set of extreme points of $BS(n)$ is exactly the $n!$ set of permutation matrices $extreme(BS(n)) \approx Sym_n$. This means every linear functional on $BS(n)$ is minimized at a permutation matrix, i.e. represented by a bijection map $X \to Y$. Equivalently the optimal coupling will not "split" any mass. The above remarks are intended to give some idea of the structure of the set of semicouplings $SC$. It's somehow accessible.

So again the Monge perspective is to minimize over the bijection maps $T: X\to Y$, i.e. permutations $T: \underline{n} \to \underline{n}$. The linearized Kantorovich perspective to study the convex set $BS(n)$ which consists of multivalued type maps between $\underline{n}$ and $\underline{n}$. Then we seek to minimize a linear function over a convex set (as opposed to minimizing over the discrete set of extreme points).

Kantorovich's Dual Max Program

The next step is to introduce Kantorovich's dual max program. How can we motivate this amazing step? Abruptly we can just introduce the transporters point of view, which consists of setting prices $-\phi(x)$ on the source $x\in X$, and prices $\psi(y)$ on the target $y\in Y$. The prices are competitive if and only if they satisfy the important inequality $$-\phi(x) + \psi(y) \leq c(x,y)$$ pointwise for all $x\in X$, $y\in Y$. The Kantorovich program looks to maximize the surplus over all pairs of potentials $\phi, \psi$ on $X, Y$, respectively. Specifically: $$(MAX)~:~~\sup_{\phi, \psi~} \left( -\int \phi(x)d\sigma(x) + \int \psi(y) d\tau(y) \right),$$ where the supremum is taken over all potentials satisfying the above inequality $-\phi+\psi \leq c$. This inequality implies immediately that (MAX) is less than or equal to (MK), i.e. $$(MAX) \leq (MK).$$ However the amazing fact underlying optimal transportation is that the max equals the min, and there is no duality gap. This implies that $c$-optimal semicouplings are supported on the set where equality is attained $$-\phi(x)+\psi(y) \leq c(x,y)$$ with equality if and only if $(x,y)\in spt(\pi)$. This motivates the definition of $c$-subdifferentials $\partial^c \phi(x)$.

(Twist) Condition and Uniqueness

Basic Regularity

Topics Briefly Touched On:

Kantorovich's dual program enables a dimension reduction! We don't really need to search out the $dim(X) \times dim(Y)$ space of semicouplings, but instead the optimal transports are defined by basically $dim(Y)$ parameters, or equivalently $dim(X)$ parameters.
We did not cover McCann's displacement convexity, the qualitative curvature descriptions, that $R_{ij} \geq 0$ if and only if entropy is concave along displacement interpolations.
The singularity $Z$ and the basic Brouwer No-Retract theorem. We interpret semicouplings from $X$ to $\partial X$ as basically representing deformation retracts.
We briefly elaborated how $c(x,y)=d^2(x,y)/2=||x-y||^2/2$ is the canonical cost, or equivalently $c(x,y)=\langle x ,y \rangle$. However there is no real canonical cost from ${\bf{R}}^n \to {\bf{R}}^k$.
Challenging to construct costs between arbitrary surfaces, e.g. $T^2 \times S^2 \to \bf{R}$ or $S_g \times S^2 \to \bf{R}$ where $S_g$ is a genus $g$ hyperbolic surface. In fact more interesting is constructing costs from $X$ to $AG_0(S^2)$ using the Dold-Thom theorem.