Locality-Sensitive Hashing

\gdef\S{\mathrm{S}} \gdef\darc{d_{\mathrm{arc}}} \gdef\a#1#2{\mathrm{A}_{#1}\p{#2}} \gdef\sign#1{\mathrm{sign}\p{#1}}

Let \S_n be an n-dimensional unit n-sphere

\begin{align} \mathrm{S}_n = \setb{\vec x ∈ \R^n}{\norm{\vec x}_2 = 1} \end{align}

and let \darc be the arc length distance

\begin{align} \darc\p{\vec a, \vec b} = \cos^{-1}\p{\vec a ⋅ \vec b} \end{align}

Random projections

Hyperplanes are defined by a normal unit vector \vec h ∈ \S^n. For a given vector \vec x we can determine on which side of the plane it lies by the sign of the dot product \vec x ⋅ \vec h.

\begin{align} \mathrm{Pr}_{\vec h ∈ \mathcal S}\left[\sign{\vec h ⋅ \vec a} = \sign{\vec h ⋅ \vec b} \right] = 1 - \frac{\darc\p{\vec a, \vec b}}{π} \end{align}

If we pick m random hyperplanes \vec h_i ∈ \S^n we can map each vector \vec x ∈ \S^n into a m-bit vector b(\vec x):

\begin{align} b_i\p{\vec x} = \begin{cases} 0 & \vec h_i ⋅ \vec x ≤ 0 \\ 1 & \vec h_i ⋅ \vec x > 0 \end{cases} \end{align}

Assuming the hyperplanes are uniformly random, the probability of two bitvectors being equal is:

\begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align}

The hyperplanes form a tesselation of \S^n where each region has a unique bitvector and each boundary represents a single bit change. Interestingly this implies that a Hamiltonian path through regions (if one exists) creates a Gray code.

We have two distributions on \S^n, \mathcal{U}_{match} and \mathcal{U}_{non-match}. Given a vector \vec x we want to classify it's distribution.

Assumption. Iris embeddings of different irises are uniformly distributed on \S^n so the \darc of non-matching irises follows the formula.

\begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align}

Amplification

What is the probability of a match if \darc < T.

We are a looking for a family of hash functions \mathcal H, h: \S^n → H such that

Irreversibility

Assume we are given H\p{\vec a}, what can we learn about \vec a?

We can at best hope to learn all of the hyper-plane signs. This is m⋅k bits of information. The subset of S^n that has the same signs is a convex polygon with an (assumed) expected area of 2^{-m⋅k} A_n. We would not have any more information than this, so our posterior would at best be the uniform distribution over this region.

We can start by drawing many samples from \mathcal S. and find matching vectors k.

TODO https://www.notion.so/Draft-Outline-99ecafcd45384d58a06bc1003fe1088b

https://en.wikipedia.org/wiki/Locality-sensitive_hashing

https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf

https://arxiv.org/abs/2010.09393

https://randorithms.com/2019/09/19/Visual-LSH.html

https://arxiv.org/pdf/2010.09393.pdf

https://arxiv.org/pdf/1901.09769.pdf

Remco Bloemen
Math & Engineering
https://2π.com