Cosine similarity

Under construction

This section is currently being developed. If you're interested in helping write this section, feel free to open a pull request; otherwise, please check back later.

Definition

Cosine similarity, roughly speaking, measures the angle between a pair of inputs. Two inputs are very similar if the angle between them is low, and their similarity drops as the angle between them increases.

Concretely, cosine similarity is computed as

\[\text{cossim}(x,y) = \frac{\left\langle x,y\right\rangle}{\|x\|\cdot\|y\|} = \left\langle\frac{x}{\|x\|},\frac{y}{\|y\|}\right\rangle\]

where $\left\langle\cdot,\cdot\right\rangle$ is an inner product (e.g., dot product) and $\|\cdot\|$ is the norm derived from that inner product. $\text{cossim}(x,y)$ goes from $-1$ to $1$, where $-1$ corresponds to low similarity and $1$ corresponds to high similarity. To calculate cosine similarity, you can use the cossim function exported from the LSH module:

julia> using LSHFunctions, LinearAlgebra

julia> x = [5, 3, -1, 1];  # norm(x) == 6

julia> y = [2, -2, -2, 2]; # norm(y) == 4

julia> cossim(x,y) == dot(x,y) / (norm(x)*norm(y))
true

julia> cossim(x,y) == (5*2 + 3*(-2) + (-1)*(-2) + 1*2) / (6*4)
true

SimHash

SimHash^[1]^[Charikar02] is a family of LSH functions for hashing with respect to cosine similarity. You can generate a new hash function from this family by calling SimHash:

julia> hashfn = SimHash();

julia> n_hashes(hashfn)
1

julia> hashfn = SimHash(40);

julia> n_hashes(hashfn)
40

Once constructed, you can start hashing vectors by calling hashfn(x):

hashfn = SimHash(100)

# x and y have high cosine similarity since they point in the same direction
# x and z have low cosine similarity since they point in opposite directions
x = randn(128)
y = 2x
z = -x

hx, hy, hz = hashfn(x), hashfn(y), hashfn(z)

# Among the 100 hash functions that we generated, we expect more hash
# collisions between x and y than between x and z
sum(hx .== hy) > sum(hx .== hz)

Note that SimHash is a one-bit hash function. As a result, hashfn(x) returns a BitArray:

julia> hashfn = SimHash();

julia> n_hashes(hashfn)
1

julia> hashes = hashfn(randn(4));

julia> typeof(hashes)
BitArray{1}

julia> length(hashes)
1

Since a single-bit hash doesn't do much to reduce the cost of similarity search, you usually want to generate multiple hash functions at once. For instance, in the snippet below we sample 10 hash functions, so that hashfn(x) is a length-10 BitArray:

julia> hashfn = SimHash(10);

julia> n_hashes(hashfn)
10

julia> hashes = hashfn(randn(4));

julia> length(hashes)
10

The probability of a hash collision (for a single hash) is

\[Pr[h(x) = h(y)] = 1 - \frac{\theta}{\pi}\]

where $\theta = \text{arccos}(\text{cossim}(x,y))$ is the angle between $x$ and $y$. This collision probability is shown in the plot below.

Probability of collision for SimHash

Footnotes

1 SimHash API reference
Charikar02Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC '02, page 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10.1145/509907.509965.