Building a Vector Database in Ruby Using Hash and PStore

 Âˇ 
ruby ai semantic search vector database

This is an excerpt of my new book learn more.

Today, I’m sharing a very cool piece of code I’ve written. It is a vector database in pure Ruby that you could use for tiny datasets or to test some things in your console quickly.

First, why would we need a vector database?

A quick aside: what are vector databases, you ask? Why would I want to use one? Vector databases have recently become very popular in AI because they allow us to store “embeddings” in them and find other “embeddings” that are close to them. “Embeddings” are vectors that capture the meaning of text or images as numbers. They are designed in such a way that if the texts of two vectors are ‘similar’ to each other, the meaning of the texts is also close to each other.

So, the vectors “I like pizza” and “I like souvlaki” are more “similar” than “The car has a flat tyre.”

In this post, I will show you how we do these calculations and what these embeddings look like.

Why build one in Ruby?

I needed to play with some semantic search queries in a console without setting up a dedicated Vector database. So I thought, if we leave out a few things that a ‘real’ vector database does, we can probably write a bit of Ruby code that stores documents and vectors in an Array that we can loop through to find items closest to our query.

Very quickly, I thought about inheriting from Hash. Hash is already sort of like a database; it allows us to store items based on an ID and key, and it is Enumerable, so we can loop through items.

However, when you use Hash, you can’t persist your data. So, the next step is to port the Hash database to PStore so that everything can be persistent.

I’ll start with Hash, though and before that with generating embeddings.

Generating embeddings

How do we get these embeddings? OpenAI has a practical API for this, and we can use the great ruby-openai client library by Alex Rudall gem to generate the embeddings.

Let’s create a little wrapper for it to isolate the functionality:

# app/lib/openai_client.rb

class OpenaiClient
  def embedding(text, model: 'text-embedding-3-small')
    response = api_client.embeddings(
      parameters: {
        model: model,
        input: text,
      },
    )

    response.dig("data", 0, "embedding")
  end

  private

  def api_client
    @api_client ||= OpenAI::Client.new(access_token: Rails.application.credentials.openai[:api_key])
  end
end

We use the text-embedding-3-small model, which returns vectors with a length of 1536 tokens (numbers).

Replace Rails.application.credentials.openai[:api_key]) with whatever way you prefer to store your API keys.

Let’s take it for a spin now.

oc = OpenaiClient.new

oc.embedding("Crime")

=>
[-0.001625204,
 0.01763012,
 -0.008460364,
 0.09889614,
 0.026863838,
 # ...

 oc.embedding("Criminal")

[0.029961154,
 0.00703114,
 0.0073463763,
 0.0476418,
 -0.00031073904,
 -0.0002184382,
 # ...

Hmm, as they say in Germany, I only understand train station; that is to say, I don’t understand, but the algorithms probably do.

Let’s work with something easier to read:

godfather = oc.embedding("A coming-of-age story of a violent mafia son and his father's unhealthy obsession with oranges.")
meet = oc.embedding("A funny meeting between a father and a man who can milk just about anything with nipples, not having seen this is a crime.")
big = oc.embedding("A man gets his rug soiled by German nihilists who have no regard of the law.")

query = oc.embedding("Crime movie")

Now, we have a few arrays and need to calculate the cosine similarity. To do this, we need a few methods:

  def dot_product(vector1, vector2)
    vector1.zip(vector2).map { |a, b| a * b }.sum
  end

The dot product is the multiplication of two vectors. This can be calculated like so for vectors a and b:

a ¡ b = a[0] Ă— b[0] + a[1] Ă— b[1] ...

In Ruby, we can ‘zip’ two arrays together like this with the zip method:

vec1 = [1, 2, 3, 4]
vec2 = [4, 5, 6, 7]

puts vec1.zip(vec2).inspect
=> [[1, 4], [2, 5], [3, 6], [4, 7]]

We map through this zipped array of the two vectors, multiply them (a * b) and then sum these multiplied numbers.

puts dot_product(vec1, vec2)

=> 60

This would be enough for the OpenAI vectors since they are normalised. This means that all of the magnitudes are 1. This, in turn, means that the dot product of the two vectors will give their similarity. We can’t always assume this will be the case, so we will continue calculating the cosine similarity.

  def magnitude(vector)
    Math.sqrt(vector.map { |component| component**2 }.reduce(:+))
  end

Here is the magnitude of the un-normalised vectors vec1 and vec2

puts "Magnitude vec1: #{magnitude(vec1)}"
puts "Magnitude vec2: #{magnitude(vec2)}"
=>
Magnitude vec1: 5.477225575051661
Magnitude vec2: 11.224972160321824

Cosine similarity normalizes the dot product by the magnitudes of the vectors, effectively focusing on the direction rather than the magnitude.

Here is how we calculate it.

  def cosine_similarity(vector1, vector2)
    dot_prod = dot_product(vector1, vector2)
    magnitude1 = magnitude(vector1)
    magnitude2 = magnitude(vector2)
    dot_prod / (magnitude1 * magnitude2)
  end
puts cosine_similarity(vec1, vec2)
=> 0.9759000729485332

Now let’s try it on our movies dataset with the query: “Crime movie.”

cosine_similarity(query, meet)
=> 0.22359566594195673
cosine_similarity(query, big)
=> 0.21230758115230425
cosine_similarity(query, godfather)
=> 0.3426902243058039

Firstly, these numbers are hard to interpret. Are they close or far apart? They live in 1500+ dimensional space, after all. Bear with me a little bit. Further down, we will try many more queries, which will make things clearer.

We can see here that The Godfather is the closest, probably because it mentions ‘story’ and ‘mafia’. Meet the Parents is the furthest because it doesn’t mention an actual crime but contains a saying with the word ‘crime’.

Nice! It’s a bit like what we expected. Let’s try another:

query = oc.embedding("Gangster")

cosine_similarity(query, meet)
=> 0.18473859726676306
cosine_similarity(query, big)
=> 0.21540343497659717
cosine_similarity(query, godfather)
=> 0.3779570884570852

The word Mafia has a much stronger effect here because members of the mafia are also called gangsters.

Building the database

Now that we’ve figured out how to generate our embeddings let’s start with the Hash-based database.

class VectorDb < Hash
  def add_item(id, content:, embedding: nil)
    self[id] = { content:, embedding: }
  end
end

This will store an item on a given ID along with its content and embedding.

We now want to store the embedding if we haven’t passed one in yet; for this, we need the OpenAI client:

class VectorDb < Hash
  # ...

  def add_item(id, content:, embedding: nil)
    embedding ||= openai_client.embedding(content)
    self[id] = { content:, embedding: }
  end

  private

  def openai_client
    @openai_client ||= OpenaiClient.new
  end
end

We add the same calculation methods as before to calculate distances between vectors:

class VectorDb < Hash
  # ...

  def dot_product(vector1, vector2)
    vector1.zip(vector2).map { |a, b| a * b }.reduce(:+)
  end

  def magnitude(vector)
    Math.sqrt(vector.map { |component| component**2 }.reduce(:+))
  end

  def cosine_similarity(vector1, vector2)
    dot_prod = dot_product(vector1, vector2)
    magnitude1 = magnitude(vector1)
    magnitude2 = magnitude(vector2)
    dot_prod / (magnitude1 * magnitude2)
  end
end

Now, I implement a search method where this all comes together.

The concept here is that we will generate the query embedding, loop through the stored embeddings, and calculate their similarity to that query embedding.

class VectorDb < Hash
  def search(query)
    query_embedding = openai_client.embedding(query)

    result = each_with_object({}) do |(id, item), results|
      results[id] = cosine_similarity(query_embedding, item[:embedding])
    end

    result = result.sort_by { |_id, similarity| similarity }.reverse.to_h
  end

  #...
end

🧙‍♂️ Lo and behold ✨ a vector database.

Let’s try it out:

vb = VectorDb.new

vb.add_item("The Godfather", content: "A coming of age story of a violent mafia son and his father's unhealthy obsession with oranges.")
vb.add_item("Meet the Parents", content: "A funny meeting between a father and a man who can milk just about anything with nipples, not having seen this is a crime.")
vb.add_item("The Big Lebowski", content: "A man gets his rug soiled by German nihilists who have no regard of the law.")

vb.search("Being criminal")
=>
{"The Big Lebowski" => 0.2902403092098413,
 "Meet the Parents" => 0.20513406388011676,
 "The Godfather" => 0.19525673473056426}

vb.search("Movie about breaking the law")
=>
{"The Godfather" => 0.35009263294726534,
 "The Big Lebowski" => 0.2518669715823097,
 "Meet the Parents" => 0.22558495028153422}

vb.search("Farming cattle")
=>
{"Meet the Parents" => 0.22182056416947168,
 "The Godfather" => 0.12570011314672785,
 "The Big Lebowski" => 0.008469368264364811}

Seems like it works. Let’s add an item to skew the results a bit:

vb.add_item("Snatch", content: "A movie about a bunch of gangsters stealing a diamond and a dog.")

# Rerun our query
vb.search("Movie about breaking the law")
=>
{"Snatch"=>0.5023058260371518,
 "The Godfather"=>0.35009263294726534,
 "The Big Lebowski"=>0.2518669715823097,
 "Meet the Parents"=>0.22558495028153422}

I suspect adding “a movie about” has influenced the search quite a bit here. Which is a good thing. The other descriptions in no way reflected that they were about movies. We can test this assumption by adding yet another movie and rerunning the search:

vb.add_item("Snatch v2", content: "A bunch of gangsters stealing a diamond and a dog.")

vb.search("Movie about breaking the law")
=>
{"Snatch"=>0.5023058260371518,
 "The Godfather"=>0.35009263294726534,
 "Snatch v2"=>0.34647917573476067,
 "The Big Lebowski"=>0.2518669715823097,
 "Meet the Parents"=>0.22558495028153422}

The hypothesis was correct. Who would’ve thought adding more context would give better results? It highlights the need to provide the models with high-quality, well-labeled information.

Ruby’s Vector-type

Now that we understand how this all works, I’ll tell you a little secret. Ruby has a Vector data type!

This is how it works:

require 'matrix'

vec1 = Vector[1, 2, 3, 4]
vec2 = Vector[4, 5, 6, 7]

puts vec1.dot(vec2)
=> 60

puts vec1.magnitude
puts vec2.magnitude
=>
Magnitude vec1: 5.477225575051661
Magnitude vec2: 11.224972160321824

# `magnitude` is aliased as `r` and `norm`

def cosine_similarity(vector1, vector2)
  vector1.dot(vector2) / (vector1.norm * vector2.norm)
end

You can create a vector from an embedding like this:

embedding = openai_client.embedding("My text")
vector = Vector[*embedding]
# or
vector = Vector.elements(embedding)

Update VectorDb to use Vector-types.

We can replace all the calculation methods with the following:

def cosine_similarity(vector1, vector2)
  vector1.dot(vector2) / (vector1.norm * vector2.norm)
end

This leaves us with the following final result:

require 'matrix'

class VectorDb < Hash
  def search(query)
    query_embedding = openai_client.embedding(query)
    query_embedding = Vector.elements(query_embedding)

    result = each_with_object({}) do |(id, item), results|
      results[id] = cosine_similarity(query_embedding, item[:embedding])
    end

    result = result.sort_by { |_id, similarity| similarity }.reverse.to_h
  end

  def add_item(id, content:, embedding: nil)
    embedding = openai_client.embedding(content) if embedding.nil?

    self[id] = { content: content, embedding: Vector.elements(embedding) }
  end

  def cosine_similarity(vector1, vector2)
    vector1.dot(vector2) / (vector1.norm * vector2.norm)
  end

  private

  def openai_client
    @openai_client ||= OpenaiClient.new
  end
end

Let’s rerun the first query to double-check the result is the same:

vb.search("Being criminal")
=>
# Before
{"The Big Lebowski"=>0.2902403092098413,
 "Meet the Parents"=>0.20513406388011676,
 "The Godfather"=>0.19525673473056426}

# After
{"The Big Lebowski"=>0.2902403092098413,
 "Meet the Parents"=>0.20513406388011676,
 "The Godfather"=>0.19525673473056426}

Looking very good.

PStore

So far, since we have been using Hash as a basis for our database, we have had to hit the API every time to get the embeddings again. This not only costs time but also money. It isn’t a big problem for testing similarities between a handful of sentences, but if we could store our data, we’d even be able to use it as a real database.

Enter PStore.

PStore implements a file based persistence mechanism based on a Hash. User code can store hierarchies of Ruby objects (values) into the data store file by name (keys). An object hierarchy may be just a single object. User code may later read values back from the data store or even update data, as needed. ruby/pstore: PStore implements a file based persistence mechanism based on a Hash.

Let’s get started with the implementation. I will call the new database Vstore and add the add_item and cosine_similarity methods.

require 'pstore'

class Vstore < PStore
  def add_item(id, content:, embedding: nil)
    transaction do
      embedding = openai_client.embedding(content) if embedding.nil?
      embedding = Vector.elements(embedding)

      self[id] = { content: content, embedding: embedding }
    end
  end

  def cosine_similarity(vector1, vector2)
    vector1.dot(vector2) / (vector1.norm * vector2.norm)
  end

  private

  def openai_client
    @openai_client ||= OpenaiClient.new
  end
end

As you can see, PStore is very similar to Hash, except that we have to wrap operations in a transaction block.

We need to pass a path to a location when initialising the database, like so:

vdb = Vstore.new("my_vector_store.pstore")

The search method will be a bit different, though, because of how we have to loop over the records:

def search(query)
  query_embedding = openai_client.embedding(query)
  query_embedding = Vector.elements(query_embedding)
  result = {}

  transaction(true) do
    roots.each do |id|
      item = self[id]
      next if !item.key?(:embedding) || item[:embedding].nil?

      result[id] = cosine_similarity(query_embedding, item[:embedding])
    end
  end

  result = result.sort_by { |_id, similarity| similarity }.reverse.to_h
end

We start a read-only transaction with transaction(true) , loop over all the keys in the store by calling roots, and fetch the data with self[id]within the each-block.

Apart from this, the implementation is the same as that of the Hash-based one.

Let’s check the results:

vb = Vstore.new('movies.pstore')

vb.add_item("The Godfather", content: "A coming of age story of a violent mafia son and his father's unhealthy obsession with oranges.")
vb.add_item("Meet the Parents", content: "A funny meeting between a father and a man who can milk just about anything with nipples, not having seen this is a crime.")
vb.add_item("The Big Lebowski", content: "A man gets his rug soiled by German nihilists who have no regard of the law.")

vb.search("Being criminal")
=>
{"The Big Lebowski"=>0.2902403092098413,
 "Meet the Parents"=>0.20513406388011676,
 "The Godfather"=>0.19525673473056426}

# With Hash:
{"The Big Lebowski"=>0.2902403092098413,
 "Meet the Parents"=>0.20513406388011676,
 "The Godfather"=>0.19525673473056426}

Beautiful! We’ve created a vector database that generates the embeddings for the items we put into it and allows us to do semantic search!

Want to learn more?

This post is based on a chapter from my new book, “From RAG to Riches: How to Build Insanely Good AI Applications That Use Your Own Data,” which is out now!

I wrote the book to share all the knowledge I gained from building AI applications that use documents and other data to give far better results.

It irks me that big tech companies are trying to sell simple AI features as expensive add-ons or products while the underlying technology is not so complicated that you wouldn’t understand it. It is just so new at the moment that it will take a long time to figure out how it all comes together.

I’ve combined all my learnings in the book, and after reading it, you should be at the top of the industry in using AI in Rails applications.

From RAG to Riches cover

Get my new book From RAG to Riches.