as a tradeoff between memory and recall. The standard is Float32 with high fidelity and high memory cost. The basic solution is scalar quantization, which reduces each value to fewer bits (around 4× compression) with a slight recall loss. Although binary quantization pushes much harder, often reaching 32× compression, the retrieval result might become inconsistent due to information loss. On the other hand, product quantization may be more efficient, but it is harder to tune and operate in real production.
In early May of 2026, Qdrant released TurboQuant, a new quantization method. And they claimed that “TurboQuant can reduce memory use without making retrieval quality too unstable“. TurboQuant sounds like the kind of feature vector search teams want.
However, I wondered whether TurboQuant still holds up when we test it across different dataset sizes. Does it give a real improvement over common quantization methods, or does its advantage depend on the data?
I ran experiments to compare it with more familiar quantization methods such as scalar and binary quantization. The goal was to understand where TurboQuant is useful, where it is risky, and whether it can be treated as a serious default option for vector search.
I believe that this will help engineers, ML practitioners, and vector database users understand where TurboQuant fits compared with more common quantization methods, especially when moving from experiments to production.
Every float32 number in a vector uses 4 bytes. As a result, a 1536-dimension embedding takes 6 KB per vector; at a million vectors, the database takes up to 6 GB just for the index.
This is when we need Quantization. Quantization shrinks each number in a vector to a smaller byte number. The standard approach is Scalar quantization. It starts with finding the min and max across each dimension. Then, that range is divided into 255 equal bins. Every value in the vector is rounded to the nearest bin, and the bin number is stored as a single byte instead of four.
The original Float32 embedding now becomes a uint8 embedding at 4x compression, meaning 4 times smaller in storage size.
Figure 1 below is a simple demonstration of this process on a 6D vector.

The tiny error in the last row is called quantization error, and it accumulates across 6 dimensions of the vector during dot product computation. This is what makes similarity scores slightly wrong.
However, there are more aggressive compressions such as 8x (4-bit), 16x (2-bit), or 32x (1-bit). The more the compression, the smaller the vector size, and the bigger the error from the original one. You can see it in Figure 2 below, which demonstrates the error after transforming a Float32 number to different quantization spaces.

The tradeoff between compression and recall (or memory and recall) is obvious. More compression results in lower recall.
The real question is: what vector geometry remains after compression?
Traditional quantizers, in most cases, directly compress the vector. Scalar quantization applies the same fixed grid to every dimension, whether that dimension contains a useful signal or noise. Binary quantization keeps only the sign bit. Therefore, neither method first checks whether some dimensions carry more signal than others.
Qdrant 1.18 changes this pattern with the new TurboQuant integrated. Based on a Google Research algorithm presented at ICLR 2026, TurboQuant rotates the vector before compression. This random rotation spreads variance more evenly across dimensions, so each bit can preserve more useful information.
TurboQuant is not better because it uses fewer bits. It is better because it makes the vector easier to compress before spending those bits.
The key differences between TurboQuant and others are shown in Figure 3 below.
TurboQuant makes all dimensions look alike first, then uses one well-designed codebook. This is the same as changing all the feet to the same size and having one pair of shoes for all.

Every vector in an embedding model has structure.
A 1536-dimensional embedding might carry most of its useful signal in only a small subset of coordinates. The remaining dimensions often contribute much less, but they still appear in every vector, which adds noise and makes distance comparisons less reliable
The idea is simple. Before compressing, spin the vector through a random orthogonal rotation. That rotation does not change distances - it just redistributes energy so every dimension carries roughly the same amount of information. Then, a single precomputed codebook is applied to the rotated vectors, and it can handle all dimensions equally well. No per-dimension tuning needed. No training on your data.
Check Figure 4 below for a summary of the process.


In Figure 5, before rotation, a few dimensions carry most of the energy. The rest carry much less signal and often more noise.
After rotation, every dimension carries roughly equal energy and an equal amount of information.
However, does this indeed mean that energy transformation preserves important information and maintains distance relative to another vector, as with the original one?
I made a simple computation between 2 4D vectors, with Vector A transformed using TurboQuant, and then, at inference time, rotated Vector B with the same matrix and measured the cosine similarity in the same rotated space. This cosine similarity is compared to the original vector A vs original vector B cosine similarity.

In Figure 6, after applying TurboQuant to the original vector A, the distance between the new vector A and Vector B barely changes compared to the original vector A and Vector B, proving that the important geometry between vectors is still preserved, and recall is highly maintained.
There are 2 processes separately on Qdrant:

The overview of Indexing Flow is visualized in Figure 7. Basically, the vector is processed as follows:
original vector → normalize/prepare depending on metric → pad if needed → Hadamard rotation → optional per-coordinate calibration: x → (x + shift) · scale → Lloyd-Max centroid assignment → packed TurboQuant codes
For TurboQuant specifically, Qdrant stores the information below as written in Table 1:

An important factor introduced by Qdrant is the Length Renormalization, aka Scaling factor. It happens after quantization, when Qdrant measures how much shorter the quantized reconstruction became vs the original length, stores that ratio as a per-vector scaling factor, and then applies it during scoring at query time.
The scaling factor = original_length / centroid_reconstruction_length
Why do we need Length Renormalization?
There is an observation after quantization
The quantized vector points in the right direction but is too short
Which means when quantizing a vector, there is always a quantization error, and it systematically shrinks the length of every vector. In query time, when you compute a dot product between a quantized vector and a rotated & encoded query, you’re computing the dot product of a slightly-too-short vector, which gives a score that is consistently too low. Qdrant calls this the “recall-degrading bias”.
To fix this, we need a factor to multiply it back in during the scoring phase instead of fixing the vectors. This tactic is simple and effective.

Figure 8 shows the process of querying with the TurboQuant vector database.
The query is rotated and converted into a SIMD scoring representation, and Qdrant uses asymmetric scoring to compare that encoded query directly against the packed TurboQuant codes stored for database vectors.
After that, the stored scaling factor is multiplied by the score
Qdrant offers multiple choices for quantization, and TurboQuant also offers multiple bit-compression variants such as bits4, bits2, bits1.5, and bits1.
As per their document, lower bit depths offer higher compression at the cost of accuracy.
Figure 9 shows some suggestions for reference in case you still wonder which compression methods to use.

Change only one config in the current Qdrant code to enable TurboQuant. Your existing collections remain untouched.
Please reference the code snippet below for details.
from qdrant_client import QdrantClient, models
client = QdrantClient("localhost", port=6333)
# New collection — one config change
client.create_collection(
collection_name="my_collection",
vectors_config=models.VectorParams(
size=1536,
distance=models.Distance.COSINE,
),
quantization_config=models.TurboQuantization(
turbo=models.TurboQuantQuantizationConfig(
bits=models.TurboQuantBitSize.BITS4,
always_ram=True,
)
),
)
# Existing collection — patch without recreating vectors
client.update_collection(
collection_name="existing_collection",
quantization_config=models.TurboQuantization(
turbo=models.TurboQuantQuantizationConfig(
bits=models.TurboQuantBitSize.BITS4,
always_ram=True,
)
),
)
For more configuration, please check the Qdrant documentation for TurboQuant here.
To test TurboQuant against every other Qdrant quantizer on real embeddings, I ran multiple tests at different sizes (10K, 50K, and 100K vectors) with different quantization methods of Qdrant.
I chose the DBpedia embeddings dataset (License: CC-BY-SA 4.0 and GNU Free Documentation License) because it has a coordinate variance ratio of 233.5x - highly anisotropic. A few dimensions carry most of the signal; the rest carry noise. This is exactly the distribution where TurboQuant’s rotation should help most, and where scalar quantization’s fixed grid wastes the most bits.
Please check the details of the test environment in the Appendix section, part 9.2.
Details of the testing recall performance are in Figure 10.

Four things jump out:
Details of the testing latency performance are in Figure 11.

Figure 12 below shows the testing storage size for each quantization method.

Details of the testing index building time are in Figure 13.

Indexing time is more environment-sensitive, so treat these numbers as directional rather than absolute. Results can vary depending on CPU, memory bandwidth, disk I/O, parallelism, and the overall machine load during the run.
Overall, TurboQuant looks promising when we prioritize the balance of compression and stable retrieval quality. The results show that not all compressed formats behave the same as the dataset grows. Some methods lose recall quickly, while others stay much closer to the Float32 baseline.
In short, TurboQuant is not only about reducing memory. TQ 4-bit is the most balanced option for general use. TQ 1.5-bit with rescoring is better when compression is the top priority. The effective pattern is to pair TurboQuant with rescoring.
Important: These numbers should not be treated as a production rule. These act as a reference for your own judgment. Measure the performance on your embeddings, your queries, your hardware, and your recall targets before migrating to production.

TurboQuant improves the compression tradeoff. But it does not remove the tradeoff completely.
It is also still new. It was launched May 11, 2026. So real production experience is still limited. The safe approach is simple: benchmark it first, then decide whether it should become your default.
I want to lay out some limitations that need to be considered. A summary of the limitations can be found in Figure 14:
The first limitation is maturity. Qdrant’s benchmark results look promising. But your data may behave differently. Your embedding model, query pattern, filters, and data distribution may not match the benchmark datasets. So TurboQuant should be treated as a strong option, not an automatic replacement.
TurboQuant may also be slower than Binary Quantization at the same storage size. This matters if your main goal is throughput or speed. If you care more about speed than recall, Binary Quantization is still be the better choice. TurboQuant is more useful when you want better recall from a small memory budget.
There is also a calibration cost. TurboQuant needs a one-time calibration step for each segment. This usually takes seconds, not minutes. But it is still a cost. If your system creates many segments or rebuilds indexes often, this extra step should be considered.
Distance type is another limitation. TurboQuant works best with L2, dot product, and cosine similarity. Rotation preserves these distance relationships well. But it does not preserve L1 or Manhattan distance in the same way. L1 and Manhattan distance can still work, but they need full vector reconstruction for each comparison. That can make search slower. If Manhattan distance is important in your system, Scalar Quantization is the safer choice.
As shown in the test result, TQ 1-bit is not a safe choice. TQ 1-bit gives very high compression, but recall can drop too much. The rotation step helps, but 1 bit per dimension is often too little. It cannot always preserve enough geometry at scale. Consider rescoring in case TQ 1-bit does not give you expected performance. Or TQ 1.5-bit looks like a more practical lower limit. It still gives strong compression, but it keeps recall more stable. For very aggressive compression, it is a safer choice than TQ 1-bit.
The main lesson is not “always use TurboQuant.” The main lesson is to test what matters for your own data. TurboQuant shifts the tradeoff in a better direction. It helps reduce recall loss before the bit budget is spent. But it does not make compression free. You still need to choose between memory, speed, recall, and distance behavior.
In short, TurboQuant is a strong new option. It is especially useful with rescoring and moderate bit settings. But it should not be used blindly. Benchmark it on your own embeddings first and measure it carefully before shifting into production.
Figure 15 below is a summary of 4 quantization offers in popular vector databases for your reference.
Qdrant is one of the first services to offer TurboQuant in the market.
