Locality-Sensitive Hashing
$$ \gdef\S{\mathrm{S}} \gdef\darc{d_{\mathrm{arc}}} \gdef\a#1#2{\mathrm{A}_{#1}\p{#2}} \gdef\sign#1{\mathrm{sign}\p{#1}} $$
Let $\S_n$ be an $n$-dimensional unit $n$-sphere
$$ \begin{align} \mathrm{S}_n = \setb{\vec x ∈ \R^n}{\norm{\vec x}_2 = 1} \end{align} $$
and let $\darc$ be the arc length distance
$$ \begin{align} \darc\p{\vec a, \vec b} = \cos^{-1}\p{\vec a ⋅ \vec b} \end{align} $$
Random projections
-
$\mathcal S$ The uniform distribution on $\S^n$.
-
$\mathcal F$ The distribution by constructing.
-
$n$ The number of dimensions.
-
$m$ The number of hyperplanes in a vector.
-
$k$ The number of vectors.
Hyperplanes are defined by a normal unit vector $\vec h ∈ \S^n$. For a given vector $\vec x$ we can determine on which side of the plane it lies by the sign of the dot product $\vec x ⋅ \vec h$.
$$ \begin{align} \mathrm{Pr}_{\vec h ∈ \mathcal S}\left[\sign{\vec h ⋅ \vec a} = \sign{\vec h ⋅ \vec b} \right] = 1 - \frac{\darc\p{\vec a, \vec b}}{π} \end{align} $$
If we pick $m$ random hyperplanes $\vec h_i ∈ \S^n$ we can map each vector $\vec x ∈ \S^n$ into a $m$-bit vector $b(\vec x)$:
$$ \begin{align} b_i\p{\vec x} = \begin{cases} 0 & \vec h_i ⋅ \vec x ≤ 0 \\ 1 & \vec h_i ⋅ \vec x > 0 \end{cases} \end{align} $$
Assuming the hyperplanes are uniformly random, the probability of two bitvectors being equal is:
$$ \begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align} $$
The hyperplanes form a tesselation of $\S^n$ where each region has a unique bitvector and each boundary represents a single bit change. Interestingly this implies that a Hamiltonian path through regions (if one exists) creates a Gray code.
We have two distributions on $\S^n$, $\mathcal{U}_{match}$ and $\mathcal{U}_{non-match}$. Given a vector $\vec x$ we want to classify it's distribution.
Assumption. Iris embeddings of different irises are uniformly distributed on $\S^n$ so the $\darc$ of non-matching irises follows the formula.
$$ \begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align} $$
Amplification
What is the probability of a match if $\darc < T$.
We are a looking for a family of hash functions $\mathcal H$, $h: \S^n → H$ such that
- $∀_{\vec a, \vec b ∈ \S^n}\, \darc\p{\vec a, \vec b} < T → \mathrm{Pr}_{h ∈ \mathcal H}\left[h\p{\vec a} = h\p{\vec b} \right] ≥ P_1$
- $∀_{\vec a \S^n}\, \mathrm{Pr}_{h ∈ \mathcal H}\left[h\p{\vec a} = h\p{\vec b} \right] ≥ P_1$
Irreversibility
Assume we are given $H\p{\vec a}$, what can we learn about $\vec a$?
We can at best hope to learn all of the hyper-plane signs. This is $m⋅k$ bits of information. The subset of $S^n$ that has the same signs is a convex polygon with an (assumed) expected area of $2^{-m⋅k} A_n$. We would not have any more information than this, so our posterior would at best be the uniform distribution over this region.
We can start by drawing many samples from $\mathcal S$. and find matching vectors $k$.
TODO https://www.notion.so/Draft-Outline-99ecafcd45384d58a06bc1003fe1088b
https://en.wikipedia.org/wiki/Locality-sensitive_hashing
https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf
https://arxiv.org/abs/2010.09393
https://randorithms.com/2019/09/19/Visual-LSH.html