-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: one-pass IVF_PQ accelerated builds #3001
Conversation
@@ -1448,6 +1448,7 @@ def create_index( | |||
precomputed_partition_dataset: Optional[str] = None, | |||
storage_options: Optional[Dict[str, str]] = None, | |||
filter_nan: bool = True, | |||
one_pass_ivfpq: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just make one pass the only choice for IVFPQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather we don't, since it is incompatible with local pq. I'm not strongly against making it default though.
|
||
# Handle timing for various parts of accelerated builds | ||
timers = {} | ||
if one_pass_ivfpq and accelerator is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's extract this routine to a separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm staring at this right now and imo this is gonna become an even bigger mess than it already is if I do that. Unfortunately, the way this function is set up right now, it uses a large number of variables to hold state. So if we separated it out, we'd be passing some enormous list of params, and returning a giant tuple of 7 things (if we include the timers, which would also have to be combined to be clean). This whole function should probably be refactored at some point once we finish up adding some of the related features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, let's add a test before I approve?
Just call with accelerator=torch.device("cpu")
and check recall?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test in-sample recall too? we have a util function for this https://github.com/lancedb/lance/blob/main/python/python/lance/util.py#L174
This feature improves disk IO dependence, but it is quite limited. This only works if the index type is IVF_PQ, and it will not work efficiently for local PQ in the future (unless we store all the PQ models in VRAM).
Importantly, this allows us to bypass local temp storage for storing residuals. However, this still stores PQ codes locally temporarily due to how we've implemented accelerator support, but these are much smaller (exact ratio depends on params).
I tested on my local machine, which is sufficiently fast that the accelerated builds are IO limited (but IO is also fast). I used wikipedia-40M
New feature disabled:

ivf training time: 52s
ivf transform time: 89s
pq training time: 18s
pq assignment time: 143s
create_index rust time: 8.9s
New feature enabled:

combined training time: 63.7s (not actually sure why this is faster, but it's not the big part anyway)
combined transform time: 158.8s
create_index rust time: 8.6s
Improvement should be more noticeable for bigger datasets, as usual.