-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numba njit support #771
Comments
Not sure I follow. Zarr is a storage format for arrays. One can use those arrays in other pipelines with Numba, Dask, etc. |
@jakirkham |
Yeah wouldn't really expect in-place modifications to work. There's a lot of encoding, decompression, chunk aggregation, etc. logic that needs to happen
If one loaded this with Dask, could use |
If I were to try with that code, would do something like this... import numba as nb
import zarr
import numpy as np
@nb.njit()
def test(arr):
for i in range(arr.shape[0]):
arr[i] = 5.0
zarr_arr = zarr.full((100,), fill_value=np.nan, dtype='float64')
arr = zarr_arr[0:100]
test(arr)
zarr_arr[0:100] = arr (should add I haven't run this. so please check as well) |
To pick up this thread a bit – I think numba support would be really great. I think the biggest use case here working with very large amounts of data as quickly as possible. Ideally I could iterate through chunks of my zarr array and compute something from them without having to swap back and forth between numba and python. This process is a bit of a pain, and has a fair bit of overhead. I can see how this would be difficult. Not sure how much of the stack (e.g. possibly all of numcodecs?) would have to be compiled to make this work. |
This is generally what everyone wants to do. But I'm not sure that this proposed integration is needed for it. Zarr and Numba solve orthogonal problems. Zarr accelerates data I/O, which can speed up I/O bound problems. Zarr will help get data from files or object storage into memory quickly. Numba accelerates computation, which can speed up compute bound problems. Numba operates on in-memory data. If you want to process a lot of data quickly using Zarr and Numba:
Can you explain why this workflow does not meet your needs? A more sophisticated use case would involve using Dask to coordinate and schedule many simultaneous reading / processing tasks. |
+1 to everything Ryan said.
I think an interesting question for users asking about this would be, are you using or have you tried using Zarr + Dask + Numba? If so, what painpoints have you experienced when doing that? What do your workflows look like? If we find enough common use cases of such a workflow above, we might be able to dive deeper into how these could be improved. |
I wanted to write up a longer response to this with an example for indexing into an on disk sparse matrix, but that requires me digging up some old branches. I think I've got a nice small example though. I would like to be able to efficiently search a sorted set of genomic intervals. E.g. Basically I would be needing to do a pair of binary searches over the start and end columns. I would like to just have one implementation of the search. I may also want to find all sets of overlaps, by iterating through a pair of interval sets. This code gets quite messy if there has to be a function barrier between the numba code, and the code that retrieves the chunks from files. In both of these cases I would be dynamically choosing which chunks are read when, so these are not good fits for dask. |
Hello,
is there are any plans to support numba njit?
Can be very useful.
The text was updated successfully, but these errors were encountered: