-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: handle nulls when creating indices with cuda #2910
Conversation
python/python/lance/sampler.py
Outdated
if len(columns) == 1 and filter.lower() == f"{columns[0]} is not null": | ||
table = pc.drop_null(table) | ||
elif filter is not None: | ||
raise Exception(f"Can't yet run filter <{filter}> in-memory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe a bit frail at the moment.
The problem is that I can't apply the filter with lance because I am using a limit/offset and they won't be pushed down if a filter is present.
Potential solutions:
- Add a scan_range to lance that interprets limit/offset as a pre-filter limit/offset.
- Add filter support to take (and just pass a contiguous range to take)
- Use duckdb or something else to apply the filter here
logging.info("Done sampling: centroids shape: %s", init_centroids.shape) | ||
|
||
ds = TorchDataset( | ||
dataset, | ||
batch_size=20480, | ||
columns=[column], | ||
samples=sample_size, | ||
filter=filt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using filters here seems to cause a significant performance hit now (on the order of several minutes vs several seconds for kmeans clustering).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe worth a fast path if we know there are no nulls then.
No description provided.