Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating two IndexPQ with specific parameters continuously causes unexpected termination #3803

Closed
2 tasks
qwevdb opened this issue Aug 28, 2024 · 4 comments
Closed
2 tasks

Comments

@qwevdb
Copy link

qwevdb commented Aug 28, 2024

Summary

When creating two IndexPQ instances in a Python script with specific configuration parameters continuously, the script is unexpectedly killed during execution.

Platform

OS: Ubuntu 24.04 LTS

Faiss version: faiss-gpu 1.8.0

Installed from: Anaconda

Faiss compilation options:

Running on:

  • [✔] CPU
  • GPU

Interface:

  • C++
  • [✔] Python

Reproduction instructions

import faiss
import numpy as np

# Set parameters
d = 554
M = 1
nbits = 56
metric = faiss.METRIC_INNER_PRODUCT

index0 = faiss.IndexPQ(d, M, nbits, metric)
# Killed occurs here
index1 = faiss.IndexPQ(d, M, nbits, metric)

np.random.seed(0)
nb = 10000
nq = 1
xb = np.random.random((nb, d)).astype('float32')
xq = np.random.random((nq, d)).astype('float32')

index1.train(xb)
index1.add(xb)

k = 5
D, I = index1.search(xq, k)

print("Index:\n", I)
print("Distance:\n", D)

When running the above Python script, for unknown reason, the script itself gets killed as follow.

Killed

We have also observed that being killed is associated with the platform.
When the above Python script runs on the server with Ubuntu 24.04 LTS (Intel Core i7-11700 CPU and 64G memory), it gets killed.
However, the following errors occur when the script is running on the laptop with Ubuntu 22.04.3 LTS in WSL (AMD Ryzen 5 4600H CPU and 16G memory).

Traceback (most recent call last):
  File "/mnt/d/faiss/bug.py", line 12, in <module>
    index0 = faiss.IndexPQ(d, M, nbits, metric)
  File "/home/xxx/anaconda3/envs/faiss/lib/python3.10/site-packages/faiss/swigfaiss_avx2.py", line 5063, in __init__
    _swigfaiss_avx2.IndexPQ_swiginit(self, _swigfaiss_avx2.new_IndexPQ(*args))
MemoryError: std::bad_alloc
@ramilbakhshyiev
Copy link
Contributor

@qwevdb I have played around with this and created more failure scenarios and was able to reproduce this on the same Intel machine with bad_alloc and another failure. I'm posting the script below to help the team get started. It looks like nbits of 56, 58, 60, 62, 63 (and potentially others) cause the error messages you. In my tests, Killed came back with an error code of 137 which is out of memory and aligns with the tests in the script below. As a workaround, you can use higher or lower nbits for now to unblock yourself for the time being. Thanks!

import faiss
import numpy as np
import psutil

# Set parameters
d = 554
M = 1
nbits = 56
metric = faiss.METRIC_INNER_PRODUCT

# print starting memory used
print("Base memory used: %s" % psutil.virtual_memory().used)

# create first index that will eat up about 35GB of memory
index0 = faiss.IndexPQ(554, 1, 56, metric)
print("After first index: %s" % psutil.virtual_memory().used)

# create the second index that will require virtually no additional memory
index1 = faiss.IndexPQ(554, 2, 64, metric)
print("After second index: %s" % psutil.virtual_memory().used)

# and this will cause bad alloc
faiss.IndexPQ(554, 1, 60, metric)

# and this will cause a swap error
faiss.IndexPQ(554, 1, 63, metric)

@mdouze
Copy link
Contributor

mdouze commented Sep 2, 2024

Maybe we could just refuse to build PQ indices with nbits > 16 bits.
Otherwise the codebook tables will be too large to be practical.

@mnorris11
Copy link

Hi @qwevdb, after internal discussion, we added #3833 which sets the nbits maximum to 24 for IndexPQ. We noticed your number of subquantizers per vector (M) is 1. You can try to increase the number of subquantizers and decrease nbits for the same compression.

Actually, anything above nbits = 31 will cause integer overflow for size_t. The nbits = 64 that Ramil tried above didn't increase memory usage because it overflowed twice back down to 0. The nbits = 56 overflowed but was still large enough to OOM with 2 of them. (thanks @mengdilin for the investigation)

Copy link

This issue is stale because it has been open for 7 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants