Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_asarray_copy[dask.array] fails on CI with dask==2024.12 #209

Closed
ev-br opened this issue Dec 5, 2024 · 9 comments · Fixed by #211
Closed

test_asarray_copy[dask.array] fails on CI with dask==2024.12 #209

ev-br opened this issue Dec 5, 2024 · 9 comments · Fixed by #211

Comments

@ev-br
Copy link
Member

ev-br commented Dec 5, 2024

Observed in #208 :

FAILED tests/test_common.py::test_asarray_copy[dask.array] - assert False
 +  where False = <function test_asarray_copy.<locals>.<lambda> at 0x7ff78986c220>(dask.array<getitem, shape=(), dtype=float32, chunksize=(), chunktype=numpy.ndarray> == 0.0)

Note that it

@ev-br
Copy link
Member Author

ev-br commented Dec 5, 2024

The failing test is

       # Use the standard library array to test the buffer protocol

       # ... snip ...
    
        a = array.array('f', [1.0])
        b = asarray(a, copy=None)
        assert is_lib_func(b)
        a[0] = 0.0
        if library == 'cupy':
            # A copy is required for libraries where the default device is not CPU
            assert all(b[0] == 1.0)
        else:
           assert all(b[0] == 0.0)      # <<<<< here

A prime suspect is dask/dask#11524.

@ev-br
Copy link
Member Author

ev-br commented Dec 5, 2024

This is indeed a breaking change in dask: 2024.12: dask.from_asarray does

3588 -> if is_arraylike(x) and hasattr(x, "copy"):
3589 x = x.copy()

Here x is a numpy array which we constructed from an array.array instance at https://github.com/data-apis/array-api-compat/blame/main/array_api_compat/dask/array/_aliases.py#L142

So the meaning of array_api_compat.dask.array.asarray(..., copy=None) changes: there is no way it could reuse the existing buffer anymore, and copy=None is equivalent to copy=True.

Does this sound right @asmeurer ?

@ev-br
Copy link
Member Author

ev-br commented Dec 5, 2024

Also ping @lithomas1 : would be able to weigh in on the dask.array changes?

@lithomas1
Copy link
Contributor

lithomas1 commented Dec 5, 2024

Yeah, part of the reason we wrap asarray is because dask's asarray doesn't have a copy keyword.
(I do a manual copy in there for copy=True)

I think your analysis is right that dask is always doing a copy now.

@dcherian
Copy link

dcherian commented Dec 5, 2024

cc @phofl for awareness

@ev-br
Copy link
Member Author

ev-br commented Dec 5, 2024

So I suppose the main question for this project is whether there's a supported way going forward to convert a numpy array (or anything that supports the buffer protocol) into a dask.array without copying, so that the data is shared and remains writeable.

@crusaderky
Copy link
Contributor

This is indeed a breaking change in dask

I disagree.
The definition of copy=None is

If None, the function must reuse existing memory buffer if possible and copy otherwise.

"if possible" should be something left at the library developer's discretion, so a change of behaviour there should not be treated as a breaking change. I believe instead that the test is over-eager in expecting a specific outcome.

As @phofl and @fjetter correctly observe on dask/dask#11524, it is extremely unhealthy to store large arrays inside the dask graph. asarray and from_array are meant to be used strictly for small data - e.g. kilobytes to a few megabytes - whereas to load gigabytes one must always use dask's compute-time data loading methods (from_zarr etc.). So it makes perfect sense to disallow sharing the buffer between these small data arrays and the dask graph, as it gives no material performance benefit and is a major source of trouble, e.g. when the input array is a smaller view of something worth gigabytes or when the user accidentally (or deliberately !!!!!) modifies the input data after using it to build the graph.

@lithomas1
Copy link
Contributor

lithomas1 commented Dec 6, 2024

Hm, not sure I completely understand here why this can't be possible.

In the linked issue, there is this comment:

There are other reasons that are more complex to explain (TLDR if we use a specialized API we can make certain assumption we can't make for the generic from_array)

This suggests that dask has an internal API somewhere that can create a dask Array from a numpy array or something else in a zero-copy many. Maybe we can ask for that to be exposed instead?

This'd be pretty useful in the case where I know I won't use the original array I create the dask array from anymore after passing it into asarray.

@crusaderky
Copy link
Contributor

It's not impossible, it's just very, very unhealthy to let something at the root of your dask graph point to an array that could be updated by the user later on.

It's an antipattern to use from_array/asarray to embed large arrays in the graph to begin with, as it carries all sorts of performance issues and will very likely kill off or at least hamstring the scheduler and/or the worker processes. So the wastefulness of always deep-copying an array that must be small (<10 MB) becomes inconsequential.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants