$\ell^p$ distance

Under construction

This section is currently being developed. If you're interested in helping write this section, feel free to open a pull request; otherwise, please check back later.

Definition

The $\ell^p$ distance between two length-$n$ vectors $x$ and $y$ is defined as

\[\|x - y\|_{\ell^p} = \left(\sum_{i=1}^n \left|x_i - y_i\right|^p\right)^{1/p}\]

$\ell^p$ distance is a valid norm for all $p \ge 1$, and is bounded between $0$ and $+\infty$. In the context of locality-sensitive hashing, we say that two points are similar if $\|x - y\|_{\ell^p}$ is small, and dissimilar if $\|x - y\|_{\ell^p}$ is large[1].

In the LSHFunctions module, you can calculate the $\ell^p$ distance between two points using the function ℓp. The functions ℓ1 and ℓ2 are also defined for $\ell^1$ and $\ell^2$ distance, respectively, since they're so commonly used:

julia> using LSHFunctions;

julia> x = [1, 2, 3]; y = [4, 5, 6];

julia> ℓ1(x,y) == ℓp(x,y,1) == abs(1-4) + abs(2-5) + abs(3-6)
true

julia> ℓ2(x,y) == ℓp(x,y,2) == √(abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)
true

You can also compute the $\ell^p$-norm of a vector ($\|x\|_{\ell^p}$, or equivalently $\|x - 0\|_{\ell^p}$) by calling ℓ1_norm, ℓ2_norm, or ℓp_norm:

julia> x = [1, 2, 3];

julia> ℓ1_norm(x) == ℓ1(x,zero(x))
true

julia> ℓ2_norm(x) == ℓ2(x,zero(x))
true

julia> ℓp_norm(x,2.2) == ℓp(x,zero(x),2.2)
true

LpHash

This module defines L1Hash and L2Hash to hash vectors on their $\ell^1$ and $\ell^2$ distances. It is based on Datar et al. (2004)[Datar04], who use the notion of a $p$-stable distribution to construct their hash function. Such distributions exist for all $p$ such that $0 < p \le 2$; the LSH family of Datar et al. (2004)[Datar04] is able to hash vectors on their $\ell^p$ distance for all $p$ in this range.

Limitations on p

The LSHFunctions package currently only supports hashing $\ell^p$ distances of order $p = 1$ and $p = 2$ due to some additional complexity involved with sampling $p$-stable distributions of different orders. This problem has been filed under issue #18.

Using L1Hash and L2Hash

Currently only $p = 1$ and $p = 2$ are supported. You can construct hash functions for $\ell^1$ distance and $\ell^2$ distance using L1Hash and L2Hash:

julia> hashfn = L1Hash();

julia> n_hashes(hashfn)
1

julia> hashfn = L2Hash(10);

julia> n_hashes(hashfn)
10

To hash a vector, simply call hashfn(x). Note that the hashes returned by an LpHash type such as L1Hash or L2Hash are signed integers:

julia> hashfn = L2Hash(128);

julia> hashtype(hashfn)
Int32

julia> x = rand(20);

julia> hashes = hashfn(x);

julia> typeof(hashes)
Array{Int32,1}

L1Hash and L2Hash support a keyword parameter called scale. scale impacts the collision probability: if scale is large then hash collisions are more likely (even among distant points). If scale is small, then hash collisions are less likely (even among close points).

julia> x = rand(10); y = rand(10);

julia> hashfn_1 = L1Hash(128; scale=0.1);  # Small value of scale

julia> n_collisions_1 = sum(hashfn_1(x) .== hashfn_1(y));

julia> hashfn_2 = L1Hash(128; scale=10.);  # Large value of scale

julia> n_collisions_2 = sum(hashfn_2(x) .== hashfn_2(y));

julia> n_collisions_2 > n_collisions_1
true

Good values of scale will depend on your dataset. If your data points are very far apart then you will likely want to choose a large value of scale; if they're tightly packed together then a small value is generally better. You can use the collision_probability function to help you choose a good value of scale.

Collision probability

The probability that two vectors $x$ and $y$ collide under a hash function sampled from the LpHash family is

\[Pr[h(x) = h(y)] = \int_0^r \frac{1}{c}f_p\left(\frac{t}{c}\right)\left(1 - \frac{t}{r}\right) \hspace{0.15cm} dt\]

where

  • $r$ is the reciprocal of the scale factor used by LpHash, i.e. r = 1/scale;
  • $c = \|x - y\|_{\ell^p}$; and
  • $f_p$ is the p.d.f. of the absolute value of the $p$-stable distribution used to construct the hash.

The most important ideas to take away from this equation are that the collision probability $Pr[h(x) = h(y)]$ increases as scale increases (or, equivalently, as $r$ increases), and that it decreases as $\|x - y\|_{\ell^p}$ increases. The figure below visualizes the relationship between $\ell^p$ distance and collision probability for $p = 1$ (left) and $p = 2$ (right).

Probability of collision for L1Hash and L2Hash

For further information about the collision probability, see Section 3.2 of the reference paper[Datar04].

Footnotes

  • 1"small" and "large" are relative terms, of course. LpHash has a parameter scale that influences the relationship between $\ell^p$ distance and collision probability, which helps us differentiate between what distances are small and which are large.
  • Datar04Datar, Mayur and Indyk, Piotr and Immorlica, Nicole and Mirrokni, Vahab. (2004). Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.