Creating a Tiny Vector Database, Part 1

5 min readApr 17, 2024

Introduction

Vector databases allow you to search for approximate nearest neighbors from a large set of vectors. They provide an alternative to brute force nearest neighbor searches at the cost of accuracy and additional memory. Previously, many advancements in NLP and deep learning were limited to small scales due to the lack of scalable nearest neighbor search. Recently, with RAG and generative AI, the applications for vector databases are exploding. Metric learning is popularly used to project any document into an embedding space. Vector databases are then used to index a large amount of document embeddings and provide nearest neighbors for a query embedding. Common use-cases for vector databases are semantic search and multi-modal search.

Popular Methods

Early algorithms such as the k-d tree were used, but they were not very accurate with higher-dimensional embeddings. The k-d tree splits the dataset along a dimension based on a specific value. It recursively splits the dataset until a stopping criterion is reached. During inference, it traverses the tree to find the subset of points nearest to the query point.
Locality Sensitive Hashing (LSH) creates a hash for each embedding and groups points based on their hash values. As the name suggests, the hashing function is designed so that points close to each other have relatively the same hash value. A common method involves taking a random vector v’ with the same dimension as the embedding and using it to compute the metric. Multiple buckets can be created based on these metrics.
Graph algorithms like NSW and HSNW create weighted graphs with embeddings as vertices and the distances between them as bidirectional edges. During construction, new points are added iteratively and are connected to some of their nearest neighbors. During inference, a greedy traversal is performed starting from an initial random point.

NSW (Navigable Small World)

The NSW is a graph structure characterized by a high clustering coefficient, where nodes are connected to their nearest neighbors. The shortest navigation distance between any two selected nodes is proportional to the logarithm of the total number of nodes. The NSW graph is iteratively constructed from all embeddings. During a query, this graph facilitates the navigation and quick retrieval of nearest neighbors. NSW enables searches for elements with logarithmic time complexity.

Building Graph with NSW

void NSW::AddToGraph(Graph &graph, const vf &element) {
  int nodes = graph.size();
  vector<int> currentNode(0);

  auto res = Search(element, this->M, this->efConstruction);
  for (auto r : res) {
    int index = r.first;
    graph[index].push_back(nodes); // adding current element
    currentNode.push_back(index);
  }
  graph.push_back(currentNode);
}

Finding Nearest Neighbours (single nearest point)

Given a query node q. We start with a random node in graph
We iterate to it’s neighbors in direction which reduces the proximity to q
Stop iteration when this is not possible and conclude that this is our closest point
This can lead to local minimum To avoid that multiple iteration of same loop is performed
A visualization of this can be found here

Finding Nearest Neighbours (top k points)

vpif NSW::Search(const vf &query, int K = 10, int ef = 20) {
  vector<pif> ans(0);
  int nodes = this->graph.size();
  std::set<int> visited;

  // repeating this L times to find K nearest neigbours,
  // to avoid local optimum
  for (int x = 0; x < this->L; x++) {
    vector<pif> minheap(0);

    std::priority_queue<pif, vector<pif>, CompareSecond> q;

    int currNode = rand() % nodes;
    auto currSim = Dot(query, this->documents[currNode]);

    q.push({currNode, currSim});

    while (q.size()) {
      pif node = q.top();
      q.pop();
      int currNode = node.first;
      float currSim = node.second;

      // break when the quality cannot be further improved 
      if (minheap.size() > 0 && currSim < minheap[0].second)
        break;
      // already visited node
      if (visited.find(currNode) != visited.end())
        continue;
      visited.insert(node.first);

      // keep track of highesh ef points in minheap
      KeepTrackHighest(minheap, node, ef);
      
      auto &neighbors = this->graph[currNode];
      for (auto idx : neighbors) {
        auto neighborSim = Dot(query, this->documents[idx]);
        q.push({idx, neighborSim});
      }
    }

    // global optimum across all random searches
    for (auto each : minheap) {
      KeepTrackHighest(ans, each, K);
    }
  }
  return ans;
}

Benchmark

We benchmark the above algorithm on dummy dataset of randomly generated vectors. We use vector dimension of 128 and generate a random number between -1 to 1 for each value. We finally normalize all the vectors with their L_2 norm to create unit vectors. The below benchmark are done on Intel i7 MacBook Pro with cosine similarity as metric. You can find the complete code at github

Challenges with NSW:

The NSW algorithm takes a completely greedy approach, which can create problems during its operation. One major issue is its inability to make large jumps quickly. In the early stages of training, only a few edges are long, and as training goes on, these long-length edges become even less common. This leads to slow progress because the algorithm spends too much time exploring nodes that are far away. In the beginning, it should be able to jump long distances to cover more area quickly and then focus on shorter jumps as it gets closer to the solution. This strategy, which is used by the HNSW algorithm, helps in reducing unnecessary calculations of similarities between far-off nodes, making the process more efficient.

What’s Next?

HNSW solves the limitations of NSW by creating a hierarchy of points, where each point is present at certain level and all levels above. The idea is to reduce the number of similarity computations at lower layers to find an approx location. As levels increase the search space is reduced leading to faster compute.