FuzzyEmbeddingMatch.jl

Overview

The FuzzyEmbeddingMatch module is designed to facilitate fuzzy string matching by leveraging embeddings. It primarily consists of structures and functions to embed strings, calculate similarities between these embeddings, and find the best or all matches within a set of candidates. Key components include EmbeddedString, MatchCandidate, bestmatch, and allmatches.

This module uses memoization for embedding strings to reduce API calls.

Installation

You can install this package with

import Pkg
Pkg.add("FuzzyEmbeddingMatch")

or, from the REPL:

] add FuzzyEmbeddingMatch

Usage

To begin, make sure that your environment variable OPENAI_API_KEY is set. If you do not have the environment variable set at the system level, you can add it with

ENV["OPENAI_API_KEY"] = "........" # Replace this with your key

Structures

  • EmbeddedString: Represents a string with its associated embedding.
  • MatchCandidate: A candidate for matching, containing two strings, their embeddings, and a similarity score.

Key Functions

  • embed: Embeds a string using aiembed from PromptingTools.jl.
  • corpus: Generates a corpus of embedded strings.
  • getembeddings: Returns embeddings for a vector of strings.
  • cosinesimilarity: Calculates cosine similarity between two embeddings.

Matching Functions

  • allmatches: Finds all matches for a given string in a list of candidates.
  • bestmatch: Finds the best match for a given string in a list of candidates.

Examples

Using allmatches

# Example strings and candidates
thing = "Example string"
candidates = ["Sample text", "Example string", "Another example"]

# Finding all matches
matches = allmatches(thing, candidates)

# Output the matches
for match in matches
    println(match)
end

Output:

MatchCandidate("Example string", "Sample text", 0.9022957888579418)
MatchCandidate("Example string", "Example string", 0.9999999999999998)
MatchCandidate("Example string", "Another example", 0.8847227646389876)

Using bestmatch

# Example string and candidates
thing = "Example string"
candidates = ["Sample text", "Example string", "Another example"]

# Finding the best match
best_match = bestmatch(thing, candidates)

# Output the best match
println("Best match: ", best_match)

Output:

Best match: MatchCandidate("Example string", "Example string", 0.9999999999999998)