Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple arguments / activities? #311

Closed
gdalle opened this issue Jun 6, 2024 · 36 comments
Closed

Multiple arguments / activities? #311

gdalle opened this issue Jun 6, 2024 · 36 comments
Labels
backend Related to one or more autodiff backends core Related to the core utilities of the package

Comments

@gdalle
Copy link
Member

gdalle commented Jun 6, 2024

At the moment the only viable way I see is to support one active and one inactive argument, and leave it to ComponentArrays to combine several arguments if necessary. It could be done by passing a tuple (x, c) instead of the input x. @wsmoses do you think that might get us 90% of the way to where we need to be?

See the following discussions:

@wsmoses
Copy link

wsmoses commented Jun 6, 2024

That would be an improvement but we need to make sure that the decomposition of the arguments doesn't result in a materialized intermediate variable

@gdalle
Copy link
Member Author

gdalle commented Jun 6, 2024

Can you give an example and explain why that's a bad thing?

@wsmoses
Copy link

wsmoses commented Jun 6, 2024 via email

@gdalle
Copy link
Member Author

gdalle commented Jun 6, 2024

Yeah I added that link above, I guess I'm just wondering what you mean by intermediate variable in this setting. I don't see why grouping multiple arguments into one or unpacking them would require a closure

@jbrea
Copy link

jbrea commented Jun 7, 2024

and leave it to ComponentArrays to combine several arguments if necessary

Not sure, this would work for my use case. Typically, I have a function logp(data, model, parameters). data is constant and usually a vector of named tuples of whatever. model is usually a structure that contains configuration and state variables. The state variables need to be shadowed, as they change, but I don't care about their derivatives. The only thing I care about are the partial derivatives of logp with respect to parameters. Maybe it would make sense to follow Lux style and separate the model into two parts: constant configuration and non-contant state, which would result in logp(data, config, state, parameters). Would it really make sense to put data, model or data, config, state into a ComponentArray? How much would it limit the way data, config and state can be written?

To be clear: I don't care, if DifferentiationInterface covers my use case or not. I'm fine with using Enzyme directly. I was just exploring this direction and thought I report back my experience 🙂 .

@gdalle
Copy link
Member Author

gdalle commented Jun 7, 2024

Thanks for your feedback! I imagine the following people may also have opinions:

@gdalle
Copy link
Member Author

gdalle commented Jun 7, 2024

As for your specific use case, we might have to make a distinction between "active / inactive" and "shadowed (mutated) / not shadowed"?

@willtebbutt
Copy link
Member

No strongs views on my end -- activity analysis isn't presently really a thing in Tapir.jl, so happy to take the lead from others.

@gdalle
Copy link
Member Author

gdalle commented Jun 7, 2024

Also note that the reason I'm mentioning ComponentArrays is because the original plan was never for DI to support an arbitrary number of arguments: that's what we agreed AbstractDifferentation would try to achieve based on DI (ping @mohamed82008). Hence I feel like disguising multiple arguments into one (x) or two (x and c for constant) is the best we can hope for in the short term.

@gdalle
Copy link
Member Author

gdalle commented Jun 7, 2024

@Vaibhavdixit02 @ChrisRackauckas what kind of activity granularity would we need for Optimization.jl?

@ChrisRackauckas
Copy link
Member

I think Optimization.jl is the one that should be reasonable to hit with the current interfaces. We can start there, and SciMLSensitivity.jl would need the full generality 😅

@wsmoses
Copy link

wsmoses commented Jun 7, 2024

Forcing multiple arguments into a single struct — even two structs (one active one constant) will still result in the performance issues I outlined in the log density link.

@gdalle
Copy link
Member Author

gdalle commented Jun 7, 2024

Which is why I'm suggesting to force all the active arguments together into x, and all the inactive arguments together into c

@wsmoses
Copy link

wsmoses commented Jun 7, 2024

Yes but I’m saying that will still have performance oenalties — if say you were condensing two active args into one

@wsmoses
Copy link

wsmoses commented Jun 7, 2024

Copying my write up from before here — most of the arguments apply equally to non differentiated code. Thus the “wrap multiple args together forcing indirection” be it through a closure struct or whatever else will cause perf difficulties of the undifferentiated and differentiated code and possibly prevent differentiation.

Essentially replace the use of the word closures below with struct or whatever else (though box will only come up in a closure iirc).

Closures can present several issues both for Enzyme's ability to AD something, the performance of differentiated programs, and even the performance of programs not being differniated.

I'll write up a few concerns below, in no particular order:

  • Type stability. Since you mention this already I won't dive more into.

  • Boxing. If a variable is modified in a closure, it may end up being boxed by julia. In essence Julia will force it to be captured by reference. Even if the function is otherwise type stable, a box is essentially a type unstable any type, even it otherwise would be type stable. Furthermore an unnecessary box will add a level of indirection which hurts performance (see below)

julia> function h(x,y)
           function g(z)
               x[y] *= 2; y+=1; return z*x[y]
           end
           return g
       end
h (generic function with 1 method)

julia> h([1,2,3],1)
(::var"#g#3"{Vector{Int64}}) (generic function with 1 method)

julia> fieldtypes(typeof(h([1,2,3],1)))
(Vector{Int64}, Core.Box)

julia> fn = h([1,2,3],1)
(::var"#g#3"{Vector{Int64}}) (generic function with 1 method)

julia> fn(1)
2

julia> fn(1)
3
  • Indirection / Aliasing: Indirection can cause overhead. For example consider an array of ints vs an array of Ref{Int} (notably both being type stable). The first is much faster than the other.
julia> function sumvals(x)
           res = 0
           @simd for i in 1:length(x)
             res += @inbounds x[i]
           end
           return res
       end^C

julia> function sumvals_ref(x)
           res = 0
           @simd for i in 1:length(x)
             res += @inbounds x[i][]
           end
           return res
       end
sumvals_ref (generic function with 1 method)

julia> array = collect(1:10000); 

julia> rarray = collect(map(Ref, 1:10000));

julia> @btime sumvals(array)
  595.877 ns (1 allocation: 16 bytes)
50005000

julia> @btime sumvals_ref(rarray)
  3.534 μs (1 allocation: 16 bytes)
50005000

There are many reasons for this. First of all there's one less load instruction per loop in the top (the first loop needs to load the data from the array, the second loads a ref from the array, then the float from the ref pointer). Secondly, this data is layed out consecutively. This allows vector instructions to apply on the first, whereas it is impossible in the second (vector loads/stores require consecutive data). Secondly you get much better cache performance. In the second case you can get a cache miss per iteration. In the first case you get a cache hit only once per cache size. (since a single cache hit loads a consecutive cache size number of elements all at once).

Beyond the extra information, closures can obscure aliasing information. Aliasing is the ability to say that two pointers point to different memory. This is critical for optimization. Consider the following.

@noinline function inner(y)
   @show y
end

function twouse(x, y)
  x1 =  x * 2
  inner(y)
  x2 = x * 3
  return x1 + x2
end

julia> @code_llvm twouse(2, 3)
;  @ REPL[30]:1 within `twouse`
define i64 @julia_twouse_778(i64 signext %0, i64 signext %1) #0 {
top:
;  @ REPL[30]:3 within `twouse`
  %2 = call i64 @j_inner_780(i64 signext %1)
;  @ REPL[30]:5 within `twouse`
; ┌ @ int.jl:87 within `+`
   %3 = mul i64 %0, 5
; └
  ret i64 %3
}

Here the two uses of x get optimized into one, fantastic. However if now we have it go through a pointer (via a Ref)

@noinline function inner_ref(y)
   @show y[]
end

function twouse_ref(x, y)
  x1 =  x[] * 2
  inner(y)
  x2 = x[] * 3
  return x1 + x2
end

julia> @code_llvm twouse_ref(Ref(2), Ref(3))
;  @ REPL[33]:1 within `twouse_ref`
define i64 @julia_twouse_ref_790({}* noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 8 dereferenceable(8) %1) #0 {
top:
;  @ REPL[33]:2 within `twouse_ref`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %2 = bitcast {}* %0 to i64*
    %3 = load i64, i64* %2, align 8
; └└
; ┌ @ int.jl:88 within `*`
   %4 = shl i64 %3, 1
; └
;  @ REPL[33]:3 within `twouse_ref`
  %5 = call nonnull {}* @j_inner_792({}* nonnull %1)
;  @ REPL[33]:4 within `twouse_ref`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %6 = load i64, i64* %2, align 8
; └└
; ┌ @ int.jl:88 within `*`
   %7 = mul i64 %6, 3
; └
;  @ REPL[33]:5 within `twouse_ref`
; ┌ @ int.jl:87 within `+`
   %8 = add i64 %7, %4
; └
  ret i64 %8
}

You'll note that now the two loads of x[] are there explicitly and not combined into one. This is because the call to inner could possibly write to the same memory of x, making it possible that it changed the value of x inside the function (even though here we know it doesn't). Getting rid of the inner function call indeed does let a single load replace both uses.

julia> function twouse_ref2(x, y)
         x1 =  x[] * 2
         x2 = x[] * 3
         return x1 + x2
       end
twouse_ref2 (generic function with 1 method)

julia> @code_llvm twouse_ref2(Ref(2), Ref(3))
;  @ REPL[35]:1 within `twouse_ref2`
define i64 @julia_twouse_ref2_793({}* noundef nonnull align 8 dereferenceable(8) %0, {}* noundef nonnull align 8 dereferenceable(8) %1) #0 {
top:
;  @ REPL[35]:2 within `twouse_ref2`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %2 = bitcast {}* %0 to i64*
    %3 = load i64, i64* %2, align 8
; └└
;  @ REPL[35]:4 within `twouse_ref2`
; ┌ @ int.jl:87 within `+`
   %4 = mul i64 %3, 5
; └
  ret i64 %4
}

Introducing a closure means that there is significant worse aliasing information since each of the captured variables is behind a pointer, and any aliasing information (e.g. variables x and y don't overlap) is now lost.

This makes the regular code slow, and can also make the genereated AD code much slower. As the various enzyme papers have shown running optimization before AD makes a compound impact on the generated code. In this case worse aliasing information may mean that Enzyme has to cache a lot more data, dramatically slowing things down and possibly preventing previously differntiable code from being able to be run at all (if available memory is insufficient).

  • Mixed activity. Just like how closures can introduce type instability, they can introduce activity instability. In this case the question is whether a variable is differentiable or not. If one stores a constant vector into a struct alongside other duplicated vectors that will create a mixed activity error (since there is no derivative vector for that which can be defined consistently -- in the worst case leading to wrong results if done incorrectly, so instead here we error). Notably the user can enable runtime activity to work around this like described in our docs, but this will come with a performance penalty (again again provide some combination of previously working code no longer working and slowdown).

  • Unnecessary differentiation. Multiple arguments allows you to mark some data as non-differentiable, which feeds an analysis (activity analysis) which determines the minimum number of instructions/values to differentiate. Combining a bunch of variables together (e.g. in a struct/closure), and in particular no longer specify some as being non-differentiated, can prevent the analysis from proving some code doesn't need to be differentiated. If that code isn't presently supported (e.g. due to a type instability in the function being differentiated), you'll now error

@gdalle
Copy link
Member Author

gdalle commented Jun 7, 2024

Thanks for the clarification, I didn't realize that some of the issues with closures also apply to indirection.

Do you have suggestions for a middle ground that would improve performance on enzyme without being as permissive as enzyme itself? I really am open to hearing your ideas cause mine are running out.
The trouble is that everything I allow for the enzyme backend I also have to support and test for the other 12 (and for combinations thereof in the case of second order). Except that for most other backends, this level of generality will be an outright nuisance. For instance, ForwardDiff and ReverseDiff only support single argument functions.

@gdalle
Copy link
Member Author

gdalle commented Jun 7, 2024

At the moment Enzyme has lots of momentum, so it makes sense for me to go a little out of my way to improve support for you. But three years ago it was Zygote everywhere, and maybe three years from now it will be Tapir everywhere, who knows. Either project may run out of funding or volunteer manpower. In case changing backends becomes necessary, people will be happy to have an interface that is designed to be a good compromise, instead of fully customized to Enzyme's very specific needs. That's why I'm kinda hesitant to go in all the way on activities and shadows and stuff.

@adrhill
Copy link
Collaborator

adrhill commented Jun 7, 2024

In case changing backends becomes necessary, people will be happy to have an interface that is designed to be a good compromise

It's worth reiterating that this goes both ways. For me, part of the motivation for DI and the DifferentiationInterfaceTest.jl tests and benchmarks has always been to make the transition from Zygote to Enzyme on my own projects as simple as possible. And in my eyes, swapping Zygote.gradient(f, x) for gradient(f, AutoZygote(), x) and finally gradient(f, AutoEnzyme(), x) achieves this beautifully.

To add more background to this comment:
The original idea of DI was to implement the "intersection" of functionality between AD backends, while AbstractDiff would build on DI and extend it to the "union" of functionality. Guillaume has managed to squeeze a lot of backend-specific functionality into a unified interface by introducing the extras argument and operator preparation, but by supporting multiple arguments and Enzyme-specific functionality, we might end up exporting an "interface" that isn't fully compatible with all backends.

@gdalle gdalle added backend Related to one or more autodiff backends core Related to the core utilities of the package labels Jun 8, 2024
@wsmoses
Copy link

wsmoses commented Jun 10, 2024

It's obviously completely up to you how you want your package to progress.

Fundamentally it's going to be a question of trade offs.

Either you choose to have less/simpler code in DifferentiationInterface and consequently worse support and/or performance for various AD package backends. Obviously this makes it easier to get things up and running, and perhaps an easier entry point for users. However, the trade off you make here is that it'll be harder for bigger / more workload-critical users to adopt the package. For example, if a codebase already successfully makes use of a given AD package, its going to be much more difficult to have them swap to DI if the use of DI itself (in contrast to direct use of the AD package) prevents code from being differentiated, or makes the differentiated code slower. I'm struggling to find the right link, but I recall that closures + Zygote was a significantly reason it didn't take off. If any interface were causing such a "cannot differentiate and/or slower perf issue" for any backend that doesn't exist on the original tool, I would likely recommend the user directly use a tool and file a bug on the interface in the meantime.

On the other hand, you could have more/specialized code in DI. Obviously this means more code/complexity here, in exchange for more potential users.

Either way, happy to try to support you.

Side note, all the code examples I gave above weren't even differentiated -- so at least the performance penalties I mention would be true of most julia code regardless of additional AD.

@gdalle
Copy link
Member Author

gdalle commented Jun 28, 2024

Not at the moment unfortunately. My current priority is to make the single-input part good enough that it can be used by most of the SciML ecosystem (in particular Optimization and DifferentialEquations). Once I get that right, I'll turn to multiple arguments, but it is an insane amount of code to write and I'm just one dude ^^

If you want to help speed things up, the most useful contribution would be suggestions on how I could achieve this without 1) completely transforming the existing API and 2) turning into an Enzyme front end. What's hard is that a solid 2/3 of backends don't support this at all, so I need to make it work for them with a closure while allowing top performance for more sophisticated ones like Tapir and Enzyme.

@mohamed82008
Copy link
Member

Also note that the reason I'm mentioning ComponentArrays is because the original plan was never for DI to support an arbitrary number of arguments: that's what we agreed AbstractDifferentation would try to achieve based on DI (ping @mohamed82008). Hence I feel like disguising multiple arguments into one (x) or two (x and c for constant) is the best we can hope for in the short term.

I don't think aiming for supporting multiple arguments in DI.jl for backends that don't natively support multiple arguments will be too useful. It might be better to document backend limitations and ask users to define their own closures. If DI.jl is to implement multiple argument support, we might as well deprecate AD.jl and go with DI.jl for everything. I personally have no issue with that if DI eventually supported everything AD currently supports. I would have preferred AD.jl to be the one package but development here is clearly much more active than in AD so maybe that's better for the ecosystem as a whole. Alternatively, we could keep the arrangement we agreed on but you contribute the multiple argument support to AD.jl. I think everyone here has write access to AD.jl and can contribute there instead. It's really up to you where you want to contribute this code. If we decide to make DI.jl the new AD.jl, then perhaps we should also figure out a way to give some credit to the AD.jl contributors and original designers.

@wsmoses
Copy link

wsmoses commented Jun 29, 2024

@mohamed82008 I don't think I have write access. I left comments on the AD.jl wip Enzyme PR a while ago, I'd be happy to do some minor fixups but don't have permission to xD

@mohamed82008
Copy link
Member

@mohamed82008 I don't think I have write access.

Fixed :)

@gdalle
Copy link
Member Author

gdalle commented Jun 29, 2024

Thanks for your answer Mohamed 🙏 A lot has happened in DI over the last couple of months, and it's already being used by big players like SciML and JuliaSmoothOptimizers. As a result, I honestly think it would be best for the ecosystem to put all of this stuff in one place, and aim for multiple argument support in DI as well (either for a few backends or for all with a closure fallback, not sure yet).
Putting myself in your shoes, I understand how frustrating that can be, and we definitely need to give credit where credit is due. DI could not exist without AbstractDifferentiation, which was literally the blueprint I based my code on. How about we put both the AbstractDifferentiation paper and the future DI paper in the CITATION.bib here?

@mohamed82008
Copy link
Member

How about we put both the AbstractDifferentiation paper and the future DI paper in the CITATION.bib here?

That might be a reasonable way to give credit to the paper's authors but not all of the contributors, e.g David. I suppose there is no perfect solution. But the CITATION.bib suggestion + a README comment would be the best we can do. I am ok with that.

@gdalle
Copy link
Member Author

gdalle commented Jul 1, 2024

That might be a reasonable way to give credit to the paper's authors but not all of the contributors, e.g David.

Yeah I agree that is a problem, but how do you even cite contributors that are not authors, except by citing the repo itself? Which we can also recommend btw

@gdalle
Copy link
Member Author

gdalle commented Jul 1, 2024

@mohamed82008
Copy link
Member

What do you think of the wording here?
https://github.com/gdalle/DifferentiationInterface.jl?tab=readme-ov-file#citation

Looks good. Thanks. I think the reference to AD.jl as "the inspiration" sufficiently gives credit to the AD.jl contributors and the citation sufficiently gives credit to the AD.jl designers and paper authors. I don't think we can do much better than that. Asking users to cite the AD.jl repo as well may be a bit much but if David prefers that, it would be nice to do as well. I am happy with the current status.

@gdalle
Copy link
Member Author

gdalle commented Jul 15, 2024

This is me trying to sum up the discussion and propose a new API.

Problem

Some backends support multiple arguments or activity specifications.
With DI's current API, we force the user to put everything in the same array and differentiate through it all, which is suboptimal. In addition, closures are not guaranteed to give the correct result (see #339).

Backend abilities

backend multiple arguments activities
ChainRulesCore yes kinda
Diffractor yes no
Enzyme yes yes
FastDifferentiation yes no
FiniteDiff no no
FiniteDifferences no no
ForwardDiff yes no
PolyesterForwardDiff no no
ReverseDiff no no
Symbolics yes no
Tapir yes someday
Tracker yes no
Zygote yes kinda

The most generic backend is Enzyme's with Const(x), Active(x) and Duplicated(x, dx).

Possible API

At the moment, DI supports:

signature input output return mutated constant
f(x) x f(x) output - -
f!(y, x) x y nothing y -

For a given input argument, we can ask:

  • Whether we care about its initial value
  • Whether we care about its derivatives => provide derivative seed or not
  • Whether it is mutated => provide shadow or not
  • Whether it's also the output (there can be only one output)

First proposition: split between:

  • "input": care about initial value and derivatives, not mutated
  • "output": care about derivatives but not initial value, possibly mutated
  • "parameters": care about initial value but not derivatives, not mutated
  • "cache": don't care about initial value or derivatives, mutated

The API could look like:

jacobian(f, backend, input, params, cache)
jacobian(f!, output, backend, input, params, cache)

pushforward(f, backend, input, params, cache, input_seed)
pushforward(f!, output, backend, input, params, cache, input_seed)

pullback(f, backend, input, params, cache, output_seed)
pullback(f!, output, backend, input, params, cache, output_seed)

Not sure about the name params for something that is constant, maybe constant would be better?

Implementation

In practice, input, params and cache can each be something like a NamedTuple, and we could probably splat them inside the function if necessary.

For most backends other than Enzyme, operators would fall back on the following closure:

f(input) = f(input, params, cache)
f!(output, input) = f!(output, input, params, cache)

@gdalle
Copy link
Member Author

gdalle commented Jul 15, 2024

The gist is that DI is a math-inspired interface: functions $f : x \mapsto y$ only have one input and one output, everything else is either a parameter of the function $f_p$ or a cache

@Vaibhavdixit02
Copy link
Contributor

I don't fully understand what you mean by "care about initial values"?

@gdalle
Copy link
Member Author

gdalle commented Jul 16, 2024

When you provide a cache, you don't give a damn what's there at the start of the function, you can initialize it with similar. When you provide a parameter, you do care

@lassepe
Copy link

lassepe commented Jul 21, 2024

This is me trying to sum up the discussion and propose a new API.

I like this API and I believe that it is near optimal given the constraint that DI should be the "intersection of all AD packages" (a job which it does really well right now). Beyond enabling the use of Enzyme, your API would also facilitate some additional performance optimization; i.e. to work around the slow closure bug. I don't think that there is a much more flexible API hat all packages can support. This also makes it easier to support future packages quickly in DI without them having to develop a large set of features before joining the DI family.

The only alternative I see is to "wait for Enzyme to be able to handle closures reliably" and stick to single-arg API; the simplicity of which is certainly appealing. If "one day" ™️ the slow closure bug is gone and Enzyme is able to handle closures more reliably, I would certainly prefer the single-arg + closure route.

As for naming bike shedding: I feel like users with ML background would expect the function to be differentiable in params. I think constant or context would be clearer.

@adrhill
Copy link
Collaborator

adrhill commented Aug 5, 2024

Something we might want to consider as well when discussing future API changes are "auxiliary" outputs à la JAX:

has_aux – Optional, bool. Indicates whether fun returns a pair where the first element is considered the output of the mathematical function to be differentiated and the second element is auxiliary data. Default False.

@gdalle
Copy link
Member Author

gdalle commented Sep 25, 2024

As of the newly-released v0.6, users can specify an arbitrary number of "context" arguments which are not differentiated. At the moment, there is only one context type named Constant, but I'll add a second one named Cache soon enough.
The syntax looks as follows:

gradient(f, backend, x, Constant(c1), Constant(c2))

There still cannot be more than a single active (differentiated) argument x, but I think this goes most of the way towards what is needed in the ecosystem (especially since many backends support non-array inputs in practice). Closing this issue in favor of more focused ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Related to one or more autodiff backends core Related to the core utilities of the package
Projects
None yet
Development

No branches or pull requests

10 participants