Locality-Sensitive Hashing
\gdef\S{\mathrm{S}} \gdef\darc{d_{\mathrm{arc}}} \gdef\a#1#2{\mathrm{A}_{#1}\p{#2}} \gdef\sign#1{\mathrm{sign}\p{#1}}
Let \S_n be an n-dimensional unit n-sphere
\begin{align} \mathrm{S}_n = \setb{\vec x ∈ \R^n}{\norm{\vec x}_2 = 1} \end{align}
and let \darc be the arc length distance
\begin{align} \darc\p{\vec a, \vec b} = \cos^{-1}\p{\vec a ⋅ \vec b} \end{align}
Random projections
-
\mathcal S The uniform distribution on \S^n.
-
\mathcal F The distribution by constructing.
-
n The number of dimensions.
-
m The number of hyperplanes in a vector.
-
k The number of vectors.
Hyperplanes are defined by a normal unit vector \vec h ∈ \S^n. For a given vector \vec x we can determine on which side of the plane it lies by the sign of the dot product \vec x ⋅ \vec h.
\begin{align} \mathrm{Pr}_{\vec h ∈ \mathcal S}\left[\sign{\vec h ⋅ \vec a} = \sign{\vec h ⋅ \vec b} \right] = 1 - \frac{\darc\p{\vec a, \vec b}}{π} \end{align}
If we pick m random hyperplanes \vec h_i ∈ \S^n we can map each vector \vec x ∈ \S^n into a m-bit vector b(\vec x):
\begin{align} b_i\p{\vec x} = \begin{cases} 0 & \vec h_i ⋅ \vec x ≤ 0 \\ 1 & \vec h_i ⋅ \vec x > 0 \end{cases} \end{align}
Assuming the hyperplanes are uniformly random, the probability of two bitvectors being equal is:
\begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align}
The hyperplanes form a tesselation of \S^n where each region has a unique bitvector and each boundary represents a single bit change. Interestingly this implies that a Hamiltonian path through regions (if one exists) creates a Gray code.
We have two distributions on \S^n, \mathcal{U}_{match} and \mathcal{U}_{non-match}. Given a vector \vec x we want to classify it's distribution.
Assumption. Iris embeddings of different irises are uniformly distributed on \S^n so the \darc of non-matching irises follows the formula.
\begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align}
Amplification
What is the probability of a match if \darc < T.
We are a looking for a family of hash functions \mathcal H, h: \S^n → H such that
- ∀_{\vec a, \vec b ∈ \S^n}\, \darc\p{\vec a, \vec b} < T → \mathrm{Pr}_{h ∈ \mathcal H}\left[h\p{\vec a} = h\p{\vec b} \right] ≥ P_1
- ∀_{\vec a \S^n}\, \mathrm{Pr}_{h ∈ \mathcal H}\left[h\p{\vec a} = h\p{\vec b} \right] ≥ P_1
Irreversibility
Assume we are given H\p{\vec a}, what can we learn about \vec a?
We can at best hope to learn all of the hyper-plane signs. This is m⋅k bits of information. The subset of S^n that has the same signs is a convex polygon with an (assumed) expected area of 2^{-m⋅k} A_n. We would not have any more information than this, so our posterior would at best be the uniform distribution over this region.
We can start by drawing many samples from \mathcal S. and find matching vectors k.
TODO https://www.notion.so/Draft-Outline-99ecafcd45384d58a06bc1003fe1088b
https://en.wikipedia.org/wiki/Locality-sensitive_hashing
https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf
https://arxiv.org/abs/2010.09393
https://randorithms.com/2019/09/19/Visual-LSH.html