Building a Vector Database in Ruby Using Hash and PStore
Today, Iâm sharing a very cool piece of code Iâve written. It is a vector database in pure Ruby that you could use for tiny datasets or to test some things in your console quickly.
First, why would we need a vector database?
A quick aside: what are vector databases, you ask? Why would I want to use one? Vector databases have recently become very popular in AI because they allow us to store âembeddingsâ in them and find other âembeddingsâ that are close to them. âEmbeddingsâ are vectors that capture the meaning of text or images as numbers. They are designed in such a way that if the texts of two vectors are âsimilarâ to each other, the meaning of the texts is also close to each other.
So, the vectors âI like pizzaâ and âI like souvlakiâ are more âsimilarâ than âThe car has a flat tyre.â
In this post, I will show you how we do these calculations and what these embeddings look like.
Why build one in Ruby?
I needed to play with some semantic search queries in a console without setting up a dedicated Vector database. So I thought, if we leave out a few things that a ârealâ vector database does, we can probably write a bit of Ruby code that stores documents and vectors in an Array that we can loop through to find items closest to our query.
Very quickly, I thought about inheriting from Hash. Hash is already sort of like a database; it allows us to store items based on an ID and key, and it is Enumerable, so we can loop through items.
However, when you use Hash, you canât persist your data. So, the next step is to port the Hash database to PStore so that everything can be persistent.
Iâll start with Hash, though and before that with generating embeddings.
Generating embeddings
How do we get these embeddings? OpenAI has a practical API for this, and we can use the great ruby-openai client library by Alex Rudall gem to generate the embeddings.
Letâs create a little wrapper for it to isolate the functionality:
# app/lib/openai_client.rb
class OpenaiClient
def embedding(text, model: 'text-embedding-3-small')
response = api_client.embeddings(
parameters: {
model: model,
input: text,
},
)
response.dig("data", 0, "embedding")
end
private
def api_client
@api_client ||= OpenAI::Client.new(access_token: Rails.application.credentials.openai[:api_key])
end
end
We use the text-embedding-3-small
model, which returns vectors with a length of 1536 tokens (numbers).
Replace Rails.application.credentials.openai[:api_key])
with whatever way you prefer to store your API keys.
Letâs take it for a spin now.
oc = OpenaiClient.new
oc.embedding("Crime")
=>
[-0.001625204,
0.01763012,
-0.008460364,
0.09889614,
0.026863838,
# ...
oc.embedding("Criminal")
[0.029961154,
0.00703114,
0.0073463763,
0.0476418,
-0.00031073904,
-0.0002184382,
# ...
Hmm, as they say in Germany, I only understand train station; that is to say, I donât understand, but the algorithms probably do.
Letâs work with something easier to read:
godfather = oc.embedding("A coming-of-age story of a violent mafia son and his father's unhealthy obsession with oranges.")
meet = oc.embedding("A funny meeting between a father and a man who can milk just about anything with nipples, not having seen this is a crime.")
big = oc.embedding("A man gets his rug soiled by German nihilists who have no regard of the law.")
query = oc.embedding("Crime movie")
Now, we have a few arrays and need to calculate the cosine similarity. To do this, we need a few methods:
def dot_product(vector1, vector2)
vector1.zip(vector2).map { |a, b| a * b }.sum
end
The dot product is the multiplication of two vectors. This can be calculated like so for vectors a and b:
a ¡ b = a[0] à b[0] + a[1] à b[1] ...
In Ruby, we can âzipâ two arrays together like this with the zip
method:
vec1 = [1, 2, 3, 4]
vec2 = [4, 5, 6, 7]
puts vec1.zip(vec2).inspect
=> [[1, 4], [2, 5], [3, 6], [4, 7]]
We map through this zipped array of the two vectors, multiply them (a * b
) and then sum these multiplied numbers.
puts dot_product(vec1, vec2)
=> 60
This would be enough for the OpenAI vectors since they are normalised. This means that all of the magnitudes are 1. This, in turn, means that the dot product of the two vectors will give their similarity. We canât always assume this will be the case, so we will continue calculating the cosine similarity.
def magnitude(vector)
Math.sqrt(vector.map { |component| component**2 }.reduce(:+))
end
Here is the magnitude of the un-normalised vectors vec1
and vec2
puts "Magnitude vec1: #{magnitude(vec1)}"
puts "Magnitude vec2: #{magnitude(vec2)}"
=>
Magnitude vec1: 5.477225575051661
Magnitude vec2: 11.224972160321824
Cosine similarity normalizes the dot product by the magnitudes of the vectors, effectively focusing on the direction rather than the magnitude.
Here is how we calculate it.
def cosine_similarity(vector1, vector2)
dot_prod = dot_product(vector1, vector2)
magnitude1 = magnitude(vector1)
magnitude2 = magnitude(vector2)
dot_prod / (magnitude1 * magnitude2)
end
puts cosine_similarity(vec1, vec2)
=> 0.9759000729485332
Now letâs try it on our movies dataset with the query: âCrime movie.â
cosine_similarity(query, meet)
=> 0.22359566594195673
cosine_similarity(query, big)
=> 0.21230758115230425
cosine_similarity(query, godfather)
=> 0.3426902243058039
Firstly, these numbers are hard to interpret. Are they close or far apart? They live in 1500+ dimensional space, after all. Bear with me a little bit. Further down, we will try many more queries, which will make things clearer.
We can see here that The Godfather is the closest, probably because it mentions âstoryâ and âmafiaâ. Meet the Parents is the furthest because it doesnât mention an actual crime but contains a saying with the word âcrimeâ.
Nice! Itâs a bit like what we expected. Letâs try another:
query = oc.embedding("Gangster")
cosine_similarity(query, meet)
=> 0.18473859726676306
cosine_similarity(query, big)
=> 0.21540343497659717
cosine_similarity(query, godfather)
=> 0.3779570884570852
The word Mafia
has a much stronger effect here because members of the mafia are also called gangsters.
Building the database
Now that weâve figured out how to generate our embeddings letâs start with the Hash-based database.
class VectorDb < Hash
def add_item(id, content:, embedding: nil)
self[id] = { content:, embedding: }
end
end
This will store an item on a given ID along with its content and embedding.
We now want to store the embedding if we havenât passed one in yet; for this, we need the OpenAI client:
class VectorDb < Hash
# ...
def add_item(id, content:, embedding: nil)
embedding ||= openai_client.embedding(content)
self[id] = { content:, embedding: }
end
private
def openai_client
@openai_client ||= OpenaiClient.new
end
end
We add the same calculation methods as before to calculate distances between vectors:
class VectorDb < Hash
# ...
def dot_product(vector1, vector2)
vector1.zip(vector2).map { |a, b| a * b }.reduce(:+)
end
def magnitude(vector)
Math.sqrt(vector.map { |component| component**2 }.reduce(:+))
end
def cosine_similarity(vector1, vector2)
dot_prod = dot_product(vector1, vector2)
magnitude1 = magnitude(vector1)
magnitude2 = magnitude(vector2)
dot_prod / (magnitude1 * magnitude2)
end
end
Now, I implement a search
method where this all comes together.
The concept here is that we will generate the query embedding, loop through the stored embeddings, and calculate their similarity to that query embedding.
class VectorDb < Hash
def search(query)
query_embedding = openai_client.embedding(query)
result = each_with_object({}) do |(id, item), results|
results[id] = cosine_similarity(query_embedding, item[:embedding])
end
result = result.sort_by { |_id, similarity| similarity }.reverse.to_h
end
#...
end
đ§ââď¸ Lo and behold ⨠a vector database.
Letâs try it out:
vb = VectorDb.new
vb.add_item("The Godfather", content: "A coming of age story of a violent mafia son and his father's unhealthy obsession with oranges.")
vb.add_item("Meet the Parents", content: "A funny meeting between a father and a man who can milk just about anything with nipples, not having seen this is a crime.")
vb.add_item("The Big Lebowski", content: "A man gets his rug soiled by German nihilists who have no regard of the law.")
vb.search("Being criminal")
=>
{"The Big Lebowski" => 0.2902403092098413,
"Meet the Parents" => 0.20513406388011676,
"The Godfather" => 0.19525673473056426}
vb.search("Movie about breaking the law")
=>
{"The Godfather" => 0.35009263294726534,
"The Big Lebowski" => 0.2518669715823097,
"Meet the Parents" => 0.22558495028153422}
vb.search("Farming cattle")
=>
{"Meet the Parents" => 0.22182056416947168,
"The Godfather" => 0.12570011314672785,
"The Big Lebowski" => 0.008469368264364811}
Seems like it works. Letâs add an item to skew the results a bit:
vb.add_item("Snatch", content: "A movie about a bunch of gangsters stealing a diamond and a dog.")
# Rerun our query
vb.search("Movie about breaking the law")
=>
{"Snatch"=>0.5023058260371518,
"The Godfather"=>0.35009263294726534,
"The Big Lebowski"=>0.2518669715823097,
"Meet the Parents"=>0.22558495028153422}
I suspect adding âa movie aboutâ has influenced the search quite a bit here. Which is a good thing. The other descriptions in no way reflected that they were about movies. We can test this assumption by adding yet another movie and rerunning the search:
vb.add_item("Snatch v2", content: "A bunch of gangsters stealing a diamond and a dog.")
vb.search("Movie about breaking the law")
=>
{"Snatch"=>0.5023058260371518,
"The Godfather"=>0.35009263294726534,
"Snatch v2"=>0.34647917573476067,
"The Big Lebowski"=>0.2518669715823097,
"Meet the Parents"=>0.22558495028153422}
The hypothesis was correct. Who wouldâve thought adding more context would give better results? It highlights the need to provide the models with high-quality, well-labeled information.
Rubyâs Vector-type
Now that we understand how this all works, Iâll tell you a little secret. Ruby has a Vector data type!
This is how it works:
require 'matrix'
vec1 = Vector[1, 2, 3, 4]
vec2 = Vector[4, 5, 6, 7]
puts vec1.dot(vec2)
=> 60
puts vec1.magnitude
puts vec2.magnitude
=>
Magnitude vec1: 5.477225575051661
Magnitude vec2: 11.224972160321824
# `magnitude` is aliased as `r` and `norm`
def cosine_similarity(vector1, vector2)
vector1.dot(vector2) / (vector1.norm * vector2.norm)
end
You can create a vector from an embedding like this:
embedding = openai_client.embedding("My text")
vector = Vector[*embedding]
# or
vector = Vector.elements(embedding)
Update VectorDb to use Vector-types.
We can replace all the calculation methods with the following:
def cosine_similarity(vector1, vector2)
vector1.dot(vector2) / (vector1.norm * vector2.norm)
end
This leaves us with the following final result:
require 'matrix'
class VectorDb < Hash
def search(query)
query_embedding = openai_client.embedding(query)
query_embedding = Vector.elements(query_embedding)
result = each_with_object({}) do |(id, item), results|
results[id] = cosine_similarity(query_embedding, item[:embedding])
end
result = result.sort_by { |_id, similarity| similarity }.reverse.to_h
end
def add_item(id, content:, embedding: nil)
embedding = openai_client.embedding(content) if embedding.nil?
self[id] = { content: content, embedding: Vector.elements(embedding) }
end
def cosine_similarity(vector1, vector2)
vector1.dot(vector2) / (vector1.norm * vector2.norm)
end
private
def openai_client
@openai_client ||= OpenaiClient.new
end
end
Letâs rerun the first query to double-check the result is the same:
vb.search("Being criminal")
=>
# Before
{"The Big Lebowski"=>0.2902403092098413,
"Meet the Parents"=>0.20513406388011676,
"The Godfather"=>0.19525673473056426}
# After
{"The Big Lebowski"=>0.2902403092098413,
"Meet the Parents"=>0.20513406388011676,
"The Godfather"=>0.19525673473056426}
Looking very good.
PStore
So far, since we have been using Hash as a basis for our database, we have had to hit the API every time to get the embeddings again. This not only costs time but also money. It isnât a big problem for testing similarities between a handful of sentences, but if we could store our data, weâd even be able to use it as a real database.
Enter PStore.
PStore implements a file based persistence mechanism based on a Hash. User code can store hierarchies of Ruby objects (values) into the data store file by name (keys). An object hierarchy may be just a single object. User code may later read values back from the data store or even update data, as needed. ruby/pstore: PStore implements a file based persistence mechanism based on a Hash.
Letâs get started with the implementation. I will call the new database Vstore
and add the add_item
and cosine_similarity
methods.
require 'pstore'
class Vstore < PStore
def add_item(id, content:, embedding: nil)
transaction do
embedding = openai_client.embedding(content) if embedding.nil?
embedding = Vector.elements(embedding)
self[id] = { content: content, embedding: embedding }
end
end
def cosine_similarity(vector1, vector2)
vector1.dot(vector2) / (vector1.norm * vector2.norm)
end
private
def openai_client
@openai_client ||= OpenaiClient.new
end
end
As you can see, PStore
is very similar to Hash
, except that we have to wrap operations in a transaction
block.
We need to pass a path to a location when initialising the database, like so:
vdb = Vstore.new("my_vector_store.pstore")
The search method will be a bit different, though, because of how we have to loop over the records:
def search(query)
query_embedding = openai_client.embedding(query)
query_embedding = Vector.elements(query_embedding)
result = {}
transaction(true) do
roots.each do |id|
item = self[id]
next if !item.key?(:embedding) || item[:embedding].nil?
result[id] = cosine_similarity(query_embedding, item[:embedding])
end
end
result = result.sort_by { |_id, similarity| similarity }.reverse.to_h
end
We start a read-only transaction with transaction(true)
, loop over all the keys in the store by calling roots,
and fetch the data with self[id]
within the each-block.
Apart from this, the implementation is the same as that of the Hash
-based one.
Letâs check the results:
vb = Vstore.new('movies.pstore')
vb.add_item("The Godfather", content: "A coming of age story of a violent mafia son and his father's unhealthy obsession with oranges.")
vb.add_item("Meet the Parents", content: "A funny meeting between a father and a man who can milk just about anything with nipples, not having seen this is a crime.")
vb.add_item("The Big Lebowski", content: "A man gets his rug soiled by German nihilists who have no regard of the law.")
vb.search("Being criminal")
=>
{"The Big Lebowski"=>0.2902403092098413,
"Meet the Parents"=>0.20513406388011676,
"The Godfather"=>0.19525673473056426}
# With Hash:
{"The Big Lebowski"=>0.2902403092098413,
"Meet the Parents"=>0.20513406388011676,
"The Godfather"=>0.19525673473056426}
Beautiful! Weâve created a vector database that generates the embeddings for the items we put into it and allows us to do semantic search!
Want to learn more?
This post is based on a chapter from my new book, âFrom RAG to Riches: How to Build Insanely Good AI Applications That Use Your Own Data,â which is out now!
I wrote the book to share all the knowledge I gained from building AI applications that use documents and other data to give far better results.
It irks me that big tech companies are trying to sell simple AI features as expensive add-ons or products while the underlying technology is not so complicated that you wouldnât understand it. It is just so new at the moment that it will take a long time to figure out how it all comes together.
Iâve combined all my learnings in the book, and after reading it, you should be at the top of the industry in using AI in Rails applications.