2024-06-28

Locality-Sensitive Hashing

$\gdef\S{\mathrm{S}} \gdef\darc{d_{\mathrm{arc}}} \gdef\a#1#2{\mathrm{A}_{#1}\p{#2}} \gdef\sign#1{\mathrm{sign}\p{#1}}$

Let $\S_n$ be an $n$ -dimensional unit $n$ -sphere

$\begin{align} \mathrm{S}_n = \setb{\vec x ∈ \R^n}{\norm{\vec x}_2 = 1} \end{align}$

and let $\darc$ be the arc length distance

$\begin{align} \darc\p{\vec a, \vec b} = \cos^{-1}\p{\vec a ⋅ \vec b} \end{align}$

Random projections

$\mathcal S$ The uniform distribution on $\S^n$ .
$\mathcal F$ The distribution by constructing.
$n$ The number of dimensions.
$m$ The number of hyperplanes in a vector.
$k$ The number of vectors.

Hyperplanes are defined by a normal unit vector $\vec h ∈ \S^n$ . For a given vector $\vec x$ we can determine on which side of the plane it lies by the sign of the dot product $\vec x ⋅ \vec h$ .

$\begin{align} \mathrm{Pr}_{\vec h ∈ \mathcal S}\left[\sign{\vec h ⋅ \vec a} = \sign{\vec h ⋅ \vec b} \right] = 1 - \frac{\darc\p{\vec a, \vec b}}{π} \end{align}$

If we pick $m$ random hyperplanes $\vec h_i ∈ \S^n$ we can map each vector $\vec x ∈ \S^n$ into a $m$ -bit vector $b(\vec x)$ :

$\begin{align} b_i\p{\vec x} = \begin{cases} 0 & \vec h_i ⋅ \vec x ≤ 0 \\ 1 & \vec h_i ⋅ \vec x > 0 \end{cases} \end{align}$

Assuming the hyperplanes are uniformly random, the probability of two bitvectors being equal is:

$\begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align}$

The hyperplanes form a tesselation of $\S^n$ where each region has a unique bitvector and each boundary represents a single bit change. Interestingly this implies that a Hamiltonian path through regions (if one exists) creates a Gray code.

We have two distributions on $\S^n$ , $\mathcal{U}_{match}$ and $\mathcal{U}_{non-match}$ . Given a vector $\vec x$ we want to classify it's distribution.

Assumption. Iris embeddings of different irises are uniformly distributed on $\S^n$ so the $\darc$ of non-matching irises follows the formula.

$\begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align}$

Amplification

What is the probability of a match if $\darc < T$ .

We are a looking for a family of hash functions $\mathcal H$ , $h: \S^n → H$ such that

$∀_{\vec a, \vec b ∈ \S^n}\, \darc\p{\vec a, \vec b} < T → \mathrm{Pr}_{h ∈ \mathcal H}\left[h\p{\vec a} = h\p{\vec b} \right] ≥ P_1$
$∀_{\vec a \S^n}\, \mathrm{Pr}_{h ∈ \mathcal H}\left[h\p{\vec a} = h\p{\vec b} \right] ≥ P_1$

Irreversibility

Assume we are given $H\p{\vec a}$ , what can we learn about $\vec a$ ?

We can at best hope to learn all of the hyper-plane signs. This is $m\cdotk$ bits of information. The subset of $S^n$ that has the same signs is a convex polygon with an (assumed) expected area of $2^{-m⋅k} A_n$ . We would not have any more information than this, so our posterior would at best be the uniform distribution over this region.

We can start by drawing many samples from $\mathcal S$ . and find matching vectors $k$ .

TODO https://www.notion.so/Draft-Outline-99ecafcd45384d58a06bc1003fe1088b

https://en.wikipedia.org/wiki/Locality-sensitive_hashing

https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf

https://arxiv.org/abs/2010.09393

https://randorithms.com/2019/09/19/Visual-LSH.html

https://arxiv.org/pdf/2010.09393.pdf

https://arxiv.org/pdf/1901.09769.pdf