Qdrant under the hood: Product Quantization

Kacper Łukawski

·

May 30, 2023

Qdrant under the hood: Product Quantization

Qdrant 1.1.0 brought the support of Scalar Quantization, a technique of reducing the memory footprint by even four times, by using int8 to represent the values that would be normally represented by float32.

The memory usage in vector search might be reduced even further! Please welcome Product Quantization, a brand-new feature of Qdrant 1.2.0!

Product Quantization

Product Quantization converts floating-point numbers into integers like every other quantization method. However, the process is slightly more complicated than Scalar Quantization and is more customizable, so you can find the sweet spot between memory usage and search precision. This article covers all the steps required to perform Product Quantization and the way it’s implemented in Qdrant.

Let’s assume we have a few vectors being added to the collection and that our optimizer decided to start creating a new segment.

A list of raw vectors

Cutting the vector into pieces

First of all, our vectors are going to be divided into chunks aka subvectors. The number of chunks is configurable, but as a rule of thumb - the lower it is, the higher the compression rate. That also comes with reduced search precision, but in some cases, you may prefer to keep the memory usage as low as possible.

A list of chunked vectors

Qdrant API allows choosing the compression ratio from 4x up to 64x. In our example, we selected 16x, so each subvector will consist of 4 floats (16 bytes), and it will eventually be represented by a single byte.

Clustering

The chunks of our vectors are then used as input for clustering. Qdrant uses the K-means algorithm, with $ K = 256 $. It was selected a priori, as this is the maximum number of values a single byte represents. As a result, we receive a list of 256 centroids for each chunk and assign each of them a unique id. The clustering is done separately for each group of chunks.

Clustered chunks of vectors

Each chunk of a vector might now be mapped to the closest centroid. That’s where we lose the precision, as a single point will only represent a whole subspace. Instead of using a subvector, we can store the id of the closest centroid. If we repeat that for each chunk, we can approximate the original embedding as a vector of subsequent ids of the centroids. The dimensionality of the created vector is equal to the number of chunks, in our case 2.

A new vector built from the ids of the centroids

Full process

All those steps build the following pipeline of Product Quantization:

Full process of Product Quantization

Measuring the distance

Vector search relies on the distances between the points. Enabling Product Quantization slightly changes the way it has to be calculated. The query vector is divided into chunks, and then we figure the overall distance as a sum of distances between the subvectors and the centroids assigned to the specific id of the vector we compare to. We know the coordinates of the centroids, so that’s easy.

Calculating the distance of between the query and the stored vector

Qdrant implementation

Search operation requires calculating the distance to multiple points. Since we calculate the distance to a finite set of centroids, those might be precomputed and reused. Qdrant creates a lookup table for each query, so it can then simply sum up several terms to measure the distance between a query and all the centroids.

Centroid 0Centroid 1
Chunk 00.142130.51242
Chunk 10.084210.00142

Benchmarks

Product Quantization comes with a cost - there are some additional operations to perform so that the performance might be reduced. However, memory usage might be reduced drastically as well. As usual, we did some benchmarks to give you a brief understanding of what you may expect.

Again, we reused the same pipeline as in the other benchmarks we published. We selected Arxiv-titles-384-angular-no-filters and Glove-100 datasets to measure the impact of Product Quantization on precision and time. Both experiments were launched with $ EF = 128 $. The results are summarized in the tables:

Glove-100

Original1D clusters2D clusters3D clusters
Mean precision0.71580.71430.67310.5854
Mean search time2336 µs2750 µs2597 µs2534 µs
Compressionx1x4x8x12
Upload & indexing time147 s339 s217 s178 s

Product Quantization increases both indexing and searching time. The higher the compression ratio, the lower the search precision. The main benefit is undoubtedly the reduced usage of memory.

Arxiv-titles-384-angular-no-filters

Original1D clusters2D clusters4D clusters8D clusters
Mean precision0.98370.96770.91430.80680.6618
Mean search time2719 µs4134 µs2947 µs2175 µs2053 µs
Compressionx1x4x8x16x32
Upload & indexing time332 s921 s597 s481 s474 s

It turns out that in some cases, Product Quantization may not only reduce the memory usage, but also the search time.

Good practices

Compared to Scalar Quantization, Product Quantization offers a higher compression rate. However, this comes with considerable trade-offs in accuracy, and at times, in-RAM search speed.

Product Quantization tends to be favored in certain specific scenarios:

  • Deployment in a low-RAM environment where the limiting factor is the number of disk reads rather than the vector comparison itself
  • Situations where the dimensionality of the original vectors is sufficiently high
  • Cases where indexing speed is not a critical factor

In circumstances that do not align with the above, Scalar Quantization should be the preferred choice.

Qdrant documentation on Product Quantization will help you to set and configure the new quantization for your data and achieve even up to 64x memory reduction.

Get Started with Qdrant Free

Get Started