Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with interfacing with Dask #8227

Open
bsipocz opened this issue Dec 5, 2018 · 12 comments
Open

Experiment with interfacing with Dask #8227

bsipocz opened this issue Dec 5, 2018 · 12 comments

Comments

@bsipocz
Copy link
Member

bsipocz commented Dec 5, 2018

Look into the different submodules whether there are blocking issues for adding support for Dask.

cc @Cadair @astrofrog @eteq

@Cadair
Copy link
Member

Cadair commented Dec 5, 2018

As a first pass this is probably really simple things like removing lines like https://github.com/astropy/astropy/blob/master/astropy/nddata/nddata.py#L235 which cause a dask array to load values into memory just for the repr.

@astrofrog
Copy link
Member

Just some notes on some initial hacking: one thing we were interested in was whether it's possible to construct a quantity-like object based on dask. Doing this is pretty simple if we construct a quantity object that is just a container (not subclassing anything) for data-like objects:

https://github.com/astrofrog/dasktropy/blob/master/quantities/dask_quantity.py

Here's an example of usage:

https://github.com/astrofrog/dasktropy/blob/master/quantities/demo_dask_quantity.ipynb

Unfortunately these kinds of objects can't be passed to e.g. SkyCoord because this relies on classes like Angle, Longitude, and so on and those are still ndarray subclasses (being vanilla Quantity subclasses). If we wanted to be able to seamlessly use dask in coordinates, we'd likely need to move away from having Quantity be an ndarray subclass and instead just be a container of an array-like object (whether numpy, dask, etc.) and a unit.

So one question is whether there is any possibility in future of refactoring Quantity such that it doesn't actually inherit from ndarray, but instead wraps it? If we actually re-implemented most public attributes and methods on ndarray, could we minimize any breakage? Note that this then opens up the option of having e.g. masked arrays or other types of array inside Quantity.

@mhvk - I'd be interested whether you have any thoughts on this!

@pllim
Copy link
Member

pllim commented Dec 6, 2018

⭐️ 🦄 🌈 ❗️ 👏 😍 👍

Are you able to measure performance improvement at this point or is it too early to do so?

@astrofrog
Copy link
Member

Well at the moment it's not super useful as not usable in astropy functions, so no point in worrying about performance yet :)

@mhvk
Copy link
Contributor

mhvk commented Dec 6, 2018

@astrofrog - I have not directly thought about making Quantity a container class - but it is become a possibility since with __array_ufunc__ and the future __array_function__ (experimental in numpy 1.16, hopefully available in 1.17), one can now override numpy functions more generally.

Another option that may be worth investigating is whether it is possible make Quantity an "auto-mixin" class (like the new Distribution), i.e., Quantity.__new__ would create a new class ArrayLikeQuantity(Quantity, ArrayLike) which inherits also from whatever the class of value is. This may be substantially cleaner since it would pass any isinstance(q, DaskArray) that might happen inside dask. And obviously, it would mean that the base mix with ndarray would not have to change. The only constraint would be that ArrayLike is sufficiently like ndarray that one can assume that it has a shape, etc.

@keflavich
Copy link
Contributor

I've done some alternative experimentation in spectral-cube (radio-astro-tools/spectral-cube#567), but it's unclear that any of my approaches "work". When I've tested them on other machines (i.e., not my laptop), I've gotten peculiar and unreproducible errors.

I'd like to try to understand what it would take to have spectral-cube built on a dask array object from the bottom up. A lot of spectral-cube's machinery would work better in a naturally delayed framework.

My main question, which I think is appropriate for this thread but not directly related (maybe it's more in line with sunpy's work), is whether we can get io.fits to read directly into a (delayed read) dask array. If we can do that, a lot of spectral-cube machinery can be refactored to use those tools. The quantity and ndarray stuff is all very secondary for that purpose.

@pllim
Copy link
Member

pllim commented Aug 5, 2019

whether we can get...

Just because you can

@keflavich
Copy link
Contributor

Life... finds a way. When frog DNA are involved. We have an astrofrog on the team. See? All good!

@wtbarnes
Copy link

I'm not sure what the progress is on Dask integration so rather than create a new issue, I'll just leave this here. When creating a Quantity from a Dask array, the resulting Quantity object does not preserve the delayed nature of the Dask array and instead executes it eagerly, e.g.

>>> import numpy as np;import dask.array;import astropy.units as u
>>> foo = np.random.rand(1)
>>> foo_dask = dask.array.from_array(foo)
>>> foo_dask
dask.array<array, shape=(1,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray>
>>> u.Quantity(foo_dask)
<Quantity [0.06568409]>

However, creating a Quantity by multiplying the Dask array by a Unit does preserved the "delayed-ness" by simply adding the "multiply-by-unit" operation to the DAG,

>>> foobar_dask = foo_dask * u.dimensionless_unscaled
>>> foobar_dask
dask.array<mul, shape=(1,), dtype=float64, chunksize=(1,), chunktype=astropy.Quantity> 

However, this isn't really an ideal solution as the resulting object is a Dask array (that happens to produce a Quantity once it is computed) rather than a Quantity backed by a Dask array. For example, foobar_dask.unit raises AttributeError: 'Array' object has no attribute 'unit' because it doesn't know it's a Quantity until it's computed.

@mhvk
Copy link
Contributor

mhvk commented Oct 26, 2021

@wtbarnes - thanks for the update. Unfortunately, nothing really has changed on this end - as will have been clear from the thread, this is not trivial. What is changing is that numpy is moving forward with providing dtype that can hold units. With that, things should be much easier.

@nstarman
Copy link
Member

@mhvk @eteq. I'm just seeing this issue for the first time (mentioned in astropy/astropy-project#283). Proposal astropy/astropy-project#269 will outline similar ideas. Nice to see convergent evolution! I believe Proposal astropy/astropy-project#269 solves (all?) of these issues, based on work done in #12921.

@mhvk
Copy link
Contributor

mhvk commented Jul 11, 2022

Gheez, I see indeed that my blue-sky #8227 (comment) is now becoming possible...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants