feat: enable composable and customizable sampler in PyTorch data loader #1900

eddyxu · 2024-02-02T00:49:41Z

Provide a set of composable Sampler that works with lance dataset
New ruff made a bunch of format changes

wjones127 · 2024-02-02T01:12:27Z

python/python/lance/sampler.py

@@ -184,3 +186,130 @@ def reservoir_sampling(stream: Iterable[T], k: int) -> list[T]:
    samples = [i.item for i in heap]
    del heap
    return samples
+
+
+class Sampler(ABC):


Right now the implementations just scan in order, they don't randomize (which almost makes them not really meet the definition of "sampling".) Users could do shuffling / reservoir sampling on the batches, but it would much more efficient to do it on fragment_ids and batch indices. Do you have any plans to integrate that with this API?

Reservoir shuffling is I/O friendly with sequential read (which is NFS friendly), while yielding random batch in uniformity distribution. The time complexity is O(k + log(n/k)) where n is the # of batches, and k is small, with k memory foot prints, and amortizes the file path lookup and metadata overhead cross the scan. Within Lance itself, it is more performant to run read_batch than take. In many cases, Reservoir shuffling can provide pretty decent performance. Need more performance numbers for sure.

I might need to get another PR out to put np.random.select(fragments) and reservior_shuffle(batches) tho. This one established the APIs.

That being said, the reservior sampling need to be change to

def reservoir_sampling(stream: Iterable[T], k: int, rank: int, world_sizae: int) -> list[T]: rng = np.random.default_rng() heap = [] for idx, item in enumerate(stream): entry = PrioritizedItem(rng.integers(0, k * 2), item) if len(heap) < k: heappush(heap, entry) else: vic = heappushpop(heap, entry) if idx % world_size == rank: ## <<<<< CHANGE TO YIELD HERE yield vic del vic if idx % 10240 == 0: logging.info("Force Python GC") gc.collect() samples = [i.item for i in heap] del heap return samples

Run this with n=1M, k=8, world_size=1

chebbyChefNEQ · 2024-02-02T01:18:31Z

python/python/lance/sampler.py

+    Each rank / process will process a subset of the batches.
+    """
+
+    def __init__(self, rank: int, world_size: int):


nit: add a from_torch method

the one read from torch distributed?

chebbyChefNEQ · 2024-02-02T01:18:42Z

python/python/lance/sampler.py

+    Each rank / process will process a subset of the fragments.
+    """
+
+    def __init__(self, rank: int, world_size: int):


When support was added for sampling in #1900 it broke support for filtering on full scans (sampling and filtering is not yet supported). This PR repairs support for filtering on full scans. Closes #1932

eddyxu changed the title ~~feat: enable customized sampler in PyTorch~~ feat: enable composable and customizable sampler in PyTorch data loader Feb 2, 2024

eddyxu requested review from westonpace, wjones127 and chebbyChefNEQ and removed request for westonpace and wjones127 February 2, 2024 01:00

eddyxu added 10 commits February 1, 2024 17:02

integrations

fdd920d

doc

1bef8e7

use shampler

8bf2d5d

fix lint and format

e9943c7

add test for sharded

a40e9be

add test

d4c56fd

do not havfe pytorch exmaple yet

ad29c0d

ad

ed2140f

deprecate

0b2d743

ruff

48eb133

eddyxu force-pushed the lei/pytorch_example branch from 89faedd to 48eb133 Compare February 2, 2024 01:04

eddyxu added 2 commits February 1, 2024 17:07

ad

c4928bd

ruff format new version

ed59f8b

wjones127 reviewed Feb 2, 2024

View reviewed changes

pass in batch readahaed

8474952

chebbyChefNEQ reviewed Feb 2, 2024

View reviewed changes

chebbyChefNEQ approved these changes Feb 2, 2024

View reviewed changes

eddyxu added 4 commits February 1, 2024 17:41

fix lint and add randomless

d2d42ac

d

aa46cda

ad

e75394c

chagne default granurity to None

d306eb2

eddyxu added 2 commits February 1, 2024 18:21

fmt

2bcee44

ms

ec212dd

eddyxu merged commit 5407db8 into main Feb 2, 2024
9 checks passed

eddyxu deleted the lei/pytorch_example branch February 2, 2024 02:56

westonpace mentioned this pull request Feb 9, 2024

fix: repair the filter keyword in the torch dataset loader #1935

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: enable composable and customizable sampler in PyTorch data loader #1900

feat: enable composable and customizable sampler in PyTorch data loader #1900

eddyxu commented Feb 2, 2024 •

edited

Loading

wjones127 Feb 2, 2024

eddyxu Feb 2, 2024 •

edited

Loading

eddyxu Feb 2, 2024 •

edited

Loading

chebbyChefNEQ Feb 2, 2024

eddyxu Feb 2, 2024

chebbyChefNEQ Feb 2, 2024

chebbyChefNEQ Feb 2, 2024

feat: enable composable and customizable sampler in PyTorch data loader #1900

feat: enable composable and customizable sampler in PyTorch data loader #1900

Conversation

eddyxu commented Feb 2, 2024 • edited Loading

wjones127 Feb 2, 2024

Choose a reason for hiding this comment

eddyxu Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

eddyxu Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

chebbyChefNEQ Feb 2, 2024

Choose a reason for hiding this comment

eddyxu Feb 2, 2024

Choose a reason for hiding this comment

chebbyChefNEQ Feb 2, 2024

Choose a reason for hiding this comment

chebbyChefNEQ Feb 2, 2024

Choose a reason for hiding this comment

eddyxu commented Feb 2, 2024 •

edited

Loading

eddyxu Feb 2, 2024 •

edited

Loading

eddyxu Feb 2, 2024 •

edited

Loading