Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Roadmap #4

Open
quinnj opened this issue May 24, 2019 · 5 comments
Open

Initial Roadmap #4

quinnj opened this issue May 24, 2019 · 5 comments

Comments

@quinnj
Copy link
Member

quinnj commented May 24, 2019

I'm opening this issue as a place to discuss the path forward for this package and for people to give their feedback. Here's where things are at in my mind:

  • I've opened PRs for Tables.jl, DataValues.jl, CategoricalArrays.jl, StatsBase.jl, and DataFrames.jl showing how packages would take a dependency on DataAPI.jl (once registered obviously) and extend functions from DataAPI.jl instead of defining themselves or using Requires.jl
  • I haven't made any changes to packages using the refX DataAPI.jl functions; while I understand the basics, I'm not as familiar w/ the implementations, but I'm willing to take a stab at it if people would like. @nalimilan and @piever are much more aware of how packages like PooledArrays, CategoricalArrays, and StructArrays can take advantage of sharing the common ref functions. I think it was also suggested at some point that we may want a RefArrays.jl package that was home to various sorting/grouping optimization routines that DataFrames/StructArrays could then share. I'm happy to push forward on making those changes, but I'll need to have some discussions w/ @nalimilan and @piever for guidance.
  • I've made @davidanthoff and @piever admins in this repo, in addition to the regular JuliaData maintainers who have access by default (@nalimilan, @bkamins, @andyferris, @andreasnoack, @ararslan , etc.); I think it's important that at least the initial set of package maintainers who depend on DataAPI.jl have an equal say in how the package works/is maintained and have some control over things.

In terms of steps forward, here's what I think:

  • @nalimilan, @piever, and myself (and anyone else who'd like) work on making sure the ref* function API is solid and draft PRs showing how packages could share these functions
  • Confirm w/ @davidanthoff that the DataValues.jl PR looks good (it's very minimal code, so hopefully should be uncontroversial); I know he's been swamped lately, but hopefully we can find a few minutes to sync on this
  • Register DataAPI.jl officially; I'm planning on the initial release being 1.0.0
  • Merge PRs to respective packages taking on the dependency

Please ping anyone else who might be interested or have something useful to add to the discussion here. I don't think there's a super rush on any of this, but I know DataFrames is approaching a 1.0 release in the next few months and it would be good to cleanup its dependencies soon.

@piever
Copy link
Collaborator

piever commented May 26, 2019

I've made WIP PRs for PooledArrays and WeakRefStrings, which are enough to remove both as StructrArrays dependency in exchange for DataAPI. I was wondering: shouldn't placeholders for the Tables interface also be here? Say istable, rows, columns, rowacess, columnaccess, schema and Schema?

@davidanthoff
Copy link

Alright, I finally had some time to look at this.

I really like defaultarray. In fact I wish this was in Base.

I don't have an opinion about the ref* stuff, I'm not familiar enough with it.

I'm not a fan of nondatavaluetype and datavaluetype. I don't understand what the generic API here is, it just looks like a hacky optional dependency story to me. In particular, how would datavaluetype ever be useful unless DataValues.jl is loaded? More on this below.

I wonder whether one can avoid unwrap as an interface. Has there been progress to move Missings.T or an equivalent into base? If that existed, one could define Missings.T(::DataValue{T}) where T = T in DataValues.jl, and then anyone who wants something like unwrap could just define unwrap(x) = convert(Union{Missings.T(x),Missing}, x), without a need for a shared unwrap function. That seems like a more generic solution to me.

Also no opinion about describe.

I really don't like that all of these things that seem pretty unrelated to me are just put into one generic package. For example, the defaultarray stuff in my mind has nothing to do with data, I think that should go into a package like ArrayInterfaceExtensions.jl, and live in the juliaarrays org.

My understanding of the two datavalue function is that the only reason for that is to work around optional dependencies in Tables.jl? I think generally in a situation where package A opted for a design where it take an optional dependency on package B, that it is not a good solution to then ask package B to take on a dependency on some third package C to solve this (unless there is a clear, useful abstract interface that one can factor out into C). That approach clearly doesn't scale, and it seems to me less than ideal to have design decisions in package A impact the dependency story for package B.

So I think for the DataValues.jl situation, maybe we should investigate the following two options next instead:
a) do we understand why Requires.jl is slow? I'm actually a bit puzzled by that. I would have thought that until one actually loads one of the optional dependencies, it wouldn't do much, and should be a pretty harmless dependency? Is there maybe some inefficiency in Requires?
b) @nalimilan mentioned something somewhere that sounded like optional dependencies might come in julia 1.3 for good? If there is a realistic chance of that, then I think we should not at this point start to introduce redesigns of package dependencies, but just wait for that. If the performance problems of the Tables.jl/Requires.jl situation are really severe and need to be fixed, then maybe Tables.jl could just take a normal dependency on DataValues.jl until proper optional dependencies arrive?

@quinnj
Copy link
Member Author

quinnj commented Jun 14, 2019

it just looks like a hacky optional dependency story to me. In particular, how would datavaluetype ever be useful unless DataValues.jl is loaded? More on this below.

Yes and no. This is generally agreed upon as the best (i.e. simplest) way to handle optional dependencies at the moment, based on the extensive discussion in a discourse thread. The point of datavaluetype is that it's useful indeed WHEN users also load DataValues, and it's invisible otherwise. I.e. I want to write generic code that works on missing, but if my code happens to encounter DataValues, then I can write generic code to 1) detect (via datavaluetype, 2) compute the corresponding Missing type (nondatavaluetype), and 3) actually convert the DataValue to the Missing equivalent (unwrap). Otherwise, we leave the burden on users to take a dependency on DataValues.jl and explicitly handle things in two code paths, which isn't ideal since the operation is well-defined and could be done once for everyone to benefit. This is about convenience to users who want to convert between DataValues and Missing representations and also for developers who don't have to bifurcate their code all over.

I wonder whether one can avoid unwrap as an interface. Has there been progress to move Missings.T or an equivalent into base?

Yes, typesubtract exists in Base, but I'm not totally clear whether it would be overloadable in the way you describe in DataValues. @nalimilan, do you know whether DataValues could overload typesubtract like that? If so, that would indeed be simpler.

really don't like that all of these things that seem pretty unrelated to me are just put into one generic package

They're not unrelated. These are important functions to a whole set of "data"-related packages. As stated above, this is generally agreed upon (including core devs) as a valid, currently good way to handle optional dependencies. I've also laid out clear guidelines around how the package will be managed, including strict constraints on what can be included to ensure fast-as-possible load times and a lightweight dependency. I'm personally not a fan of creating multiple "interface" packages just for the sake of separating things that can just as well live in a single interface package; I also strongly believe in keeping things together until there's a clear need/request to separate, which we can always do later on. It's just a lot of unnecessary administrative churn for the registry, package maintenance, etc. to have multiple packages and then find out we don't need one.

package B to take on a dependency on some third package C to solve this

Isn't this the approach you've taken w/ IterableTables.jl though? Implemented the interfaces for a bunch of packages, and then work on moving those implementations upstream? I mean, I agree that generally it's not desirable, but it's also the current lack-of-optional-dependency world we live in. Interfaces are being created and people want implementations and we don't always get them defined in the upstream parent package right away, and probably for the best so that implementations can mature and be proven out. But ideally, once things settle, things can be moved upstream.

do we understand why Requires.jl is slow?

Requires.jl is indeed slow and subtly so. I spent a full two days digging into its internals and basically started the discourse thread above based on my findings. Every time a package is loaded, it does checks around any registered callbacks; the checks are compounded when multiple packages get involved, each registering callbacks, and an expensive invokelatest call is required to run the callback. Compared to the time to load the package, particularly small packages, using Requires.jl can be 2x-3x slower. I think Requires.jl could be useful for very large packages with lots of dependencies, that are more "top of the stack" kind of scenarios, but for lower in the stack packages, Requires.jl is a performance killer. Unfortunately, this whole flow is as efficient as possible given the current "callback" workflow provided by Base loading code.

I'm not aware of any current efforts to provide real optional dependency management in 1.3 or even in the foreseeable future.

Taking a dependency on DataValues.jl in Tables.jl also isn't an option; currently DataValues.jl takes ~0.2 seconds to load, and Tables.jl w/o Requires.jl takes ~0.1 seconds; Tables.jl w/ Requires.jl (current master) takes ~0.3 seconds to load, so it wouldn't help at all. Having Requires.jl is also problematic for Tables.jl because it is one of these "lower stack" type packages; lots of other packages depend on it and the Requires.jl performance hit is compounded.

@nalimilan
Copy link
Member

I wonder whether one can avoid unwrap as an interface. Has there been progress to move Missings.T or an equivalent into base?

Yes, typesubtract exists in Base, but I'm not totally clear whether it would be overloadable in the way you describe in DataValues. @nalimilan, do you know whether DataValues could overload typesubtract like that? If so, that would indeed be simpler.

Base.nonmissingtype exists and there's even a PR to export it (JuliaLang/julia#31562). I guess DataValues could overload it (that wouldn't be type piracy). But we'd have to ensure it cannot have unexpected consequences, since we usually assume that nonmissingtype(T) <: T. Maybe not a big issue.

Regarding defaultarray, I've thought another another approach which sounds even nicer and doesn't require defining a function in DataAPI: use similar(AbstractArray{T}, ...) and/or similar(AbstractArray, T, ...). Base already supports things like similar(Vector{Int}, (3,)), so allowing AbstractArray sounds like a logic generalization to request a mutable AbstractArray{T} object of the most appropriate type. If you agree I can file an issue in Julia to discuss that with core devs, and then we could add fallbacks to Compat (or to DataAPI with the blessing or core devs).

@quinnj
Copy link
Member Author

quinnj commented Jun 14, 2019

But even if DataValues overloads Base.nonmissingtype, we still don't have the inverse: datavaluetype, where for a given Union{T, Missing}, we want DataValue{T}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants