-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending the standard library #312
Comments
I strongly oppose this. One of the reasons I am able to read WDLs of other people is because they cannot have their own functions. We use WDL all the time, and the times we needed to do some stuff that was not in the standardlib we made a task. This works well enough. It does not happen often. One big advantage of the stdlib is: there is documentation!. People just don't document their customized function. They will start using customized functions instead of stdlib functions (why read the docs if you can have a customized function right away?) . WDL will become universally more unreadable because non-standard functions are allowed. I think this undermines one of the core strengths of WDL. The fact that you can NOT write your own custom code in it. |
I agree with @rhpvorderman for the most part. Although I wouldn't say there is a huge difference between writing custom code (or functions) and writing a task containing some (eg.) python code. The biggest difference (I guess) is that having to write a task might dissuade people from doing non-standard things too often, whereas introducing a custom function syntax would promote it. I would prefer to dissuade here. Also as far as the append functionality is concerned, there is a task append {
input {
Array[String] list
Sting extraValue
}
command <<<
python <<OEF
for x in ['~{sep="','" list}']:
print(x)
print('~{extraValue}')
OEF
>>>
output {
Array[String] newList = read_lines(stdout())
}
} |
@rhpvorderman, this is fair criticism. The alternative is to implement a standard library that handles maps, arrays, pairs, user-defined structs, and recursive structures. Here is what relevant languages have: These are large libraries, with significant conformance testing suites. As a practical matter, I can see the Broad supporting this. But I can't see a second implementation, the resources just aren't there. |
@orodeh Sorry if I missed it, but how are you proposing the syntax to look? FWIW I'm skeptical, purely IMO WDL has already gone too far down the road towards general purpose programming language and this would be another big leap further. |
I agree with @geoffjentry. Let's not compare WDL to python or scala, because it has a different purpose. I like the fact that the stdlib of WDL is small, so it does not require me as much effort to learn as an entire language. Can you name any concrete examples of stuff that can't be solved by a task, and for which we would need such a syntax? |
A syntax for implementing a function, that would be very similar to tasks, would:
To call it, you would do:
The kind of method that is missing from the current stdlib is |
So why not do this in a task? task YamlToJson {
input {
File yaml
String outputJson = basename(yaml, "\.ya?ml$") + ".json"
String dockerTag = "3.13-py37-slim"
}
command {
set -e
mkdir -p $(dirname ~{outputJson})
python <<CODE
import json
import yaml
with open("~{yaml}", "r") as input_yaml:
content = yaml.load(input_yaml)
with open("~{outputJson}", "w") as output_json:
json.dump(content, output_json)
CODE
}
output {
File json = outputJson
}
runtime {
docker: "biowdl/pyyaml:" + dockerTag
}
} No need to add a If such a task is used thousands of times, and everyone has this use case, we can think of adding it to the stdlib. But adding a general
THIS is the future I envision for Furthermore, Snakemake and Apache Airflow already exist, there you can add custom code to the pipelines. Knowing the horrible effects this has on the readability and maintainability I fled to WDL, which was much nicer. I have strong opinions on this, because not having things such as |
The original problem we are trying to solve is how to extend the standard library. Cataloging the requests we already have, we get about four modules: Math Array Map String Clearly, all four categorizes will grow. We'll need string operations, math operations, and data structure methods. As an engine implementer, I would like to build such a WDL standard library without writing a massive amount of code. Assuming the engine is written in Scala, a way to solve this is to translate WDL expressions to Scala expressions. Then, runtime evaluation reduces to Scala evaluation. For example, if you have a workflow snippet such as: Map[String, Int] chromosomes = {
"1": 259,
"2": 242,
"3": 198,
"4": 190,
"X": 156}
Array[String] s = chromosomes.keys The compiler will convert val chromosomes : Map[String, Int] = Map(
"1" -> 259,
"2" -> 242,
"3" -> 198,
"4" -> 190,
"X" -> 156)
val s : Vector[String] = chromosomes.keys.toVector Since the expression sub-language is purely functional, it has equivalents in scala. The same can be done for python, including lambda functions (both languages support this). I hope this will allow using much of the math, string, array, and map operations, without needing to write them ourselves. |
The question is whether we need full maths and string operations support in wdl. I get that there are some features there that would be useful, but we should remember that wdl was intended to be a workflow language and not a programming language. I honestly feel that, unless some functionality is vital to defining a workflow, we shouldn't add loads of functions to the stdlib. To quote the openwdl website:
I feel that extending the standard library too much or adding syntax like the one suggested in "map over an array (#203)", would greatly impact understandability for those without a lot of programming experience. |
Ah, this is where the misunderstanding arises. I don't think a small standard library is a problem. I think it is a blessing.
I don't see how all of these things can not be accomplished with a task. With the exception of #43 I think these PRs are too far-fetched for a Workflow Description Language. As @DavyCats pointed out it is not a Workflow Programming Language. If you want to use a programming language: use a programming language! WDL allows you all the freedom you could ever want within the command brackets of a task statement. @orodeh You have not answered my core concern: why can these things not be done in a simple task? |
If you are fine with writing stdlib stuff in tasks, then, a way to do this is outlined at the top of the thread. To recap: |
@orodeh thank you for putting this thought experiment together! I think this issue really is a defining conversation about what is the difference between WDL and a general purpose programming language. To me, WDL is and and has always been a Workflow language that provides some high level functionallity for wrapping my low level functions (ie docker, bash scripts etc). While there has been some situations where I have been frustrated to no end due to the lack of certain stdlib functions, this minimalist approach is a blessing in disguise in my opinion. IMO one of the reasons why WDL has caught on is because of its simplicity in writing and documentation. WDL Is easy to read, and easily portable. The introduction of user defined functions opens a can of worms on multiple fronts. First of all, readability plummets, as was mentioned above. Also we start to shift work OUTSIDE of the scope of the Additionally, if we were to go this route, I can forsee us quickly getting into the dreaded use case where users want a fully functioning programming language within a I think there is a good balance that we can strive for between functionality and minimalism when it comes to the |
@orodeh I also think buried in this discussion is a very legitimate question about how |
Relatedly, I quite like the idea of some kind of "standard library" of tasks, which could presumably do many of the things which people would want from "extension" functions, without the need to re-write them individually each time someone is writing a WDL. Running a full task just to append to an array sucks, but from an engine/performance point of view I could also see (if it turns out to be important) a small set of "known" tasks be optimized either ad-hoc (eg with a The nice thing about doing this "in engine" from my point of view is, people don't need to know what the magic is, it just happens for them. All they'd need to know is where the standard library is and how to import its tasks. |
Just to clarify: when I say "standard library of tasks", I mean being able to write something like |
@cjllanwarne Over at @biowdl we have a common.wdl for such things. I'm not against having a standard library of commonly used tasks, although we might want to consider naming it differently as to not confuse the engine functions (currently refered to as the standard library) with these standard wdl tasks. |
Should we consider a package repository to centralize task libraries? We’re
planning to make such a library for GATK and probably a couple of other
popular bioinfx tools. Would be nice for people to be able to look for
what’s available in a single place.
And maybe that’s just dockstore, but with something to identify task
libraries from workflows. Thoughts?
On Wed, May 1, 2019 at 9:46 AM DavyCats ***@***.***> wrote:
@cjllanwarne <https://github.com/cjllanwarne> Over at @biowdl
<https://github.com/biowdl> we have a common.wdl
<https://github.com/biowdl/tasks/blob/develop/common.wdl> for such
things. I'm not against having a standard library of commonly used tasks,
although we might want to consider naming it differently as to not confuse
the engine functions (currently refered to as the standard library) with
these standard wdl tasks.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#312 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAU7AEZGZKZSH34HDBHQYNLPTGNKZANCNFSM4HIORIXQ>
.
--
Geraldine A. Van der Auwera, Ph.D.
Associate Director of Outreach and Communications
Data Sciences Platform
Broad Institute
|
@orodeh One question I have - I'm interpreting the purpose of the function as a hint to the engine "Hey, you can run this in band if you want", does that sound right? If so, why couldn't that just always be true of a task? (In fact, as far as I'm concerned, that's already the case). My $0.02 - I already wish the stdlib wasn't as grandiose as it is, but have come to terms with its existence. There's a tipping point somewhere between "easy to grok DSL" and "might as well use a GP language" and IMO we're hovering around that point. I do quite like the idea of bringing more attention to projects like @biowdl and encouraging people to place their libraries of useful WDL tasks into Dockstore. |
This is exactly what I meant by By the way, a question about a library function for find in arrays was recently posted in the WDL forum. I think this issue is not going away. |
@orodeh I think runtime attributes would serve this purpose (letting the engine know a task should be run locally) well enough. We could model an attribute after cromwell's backend concept and standardize a |
@orodeh I can see the desire to have a task run locally, and I think if we define this as a hint, then it would be fine. My main concerns with that would certain around the performance impact that running a task locally could have. As long as the behaviour was completely opt in able then its not that bad. I personally would favor just starting a new VM entirely even for a small task as we do normally. As for the |
In light of this discussion, I suggest the following two ideas:
Perhaps we can use the existing We could stop here. I think that a topic for future discussion is a limited form of stdlib extension. Functions that do not take lambda expressions, but do perform data structure manipulation. For example: but not: I believe the majority view is that we do not want to support lambda expressions. If so, it would be good to codify. |
For what it's worth @orodeh the |
@orodeh I am closing this since it looks like for the time we are decided. For your first point, #315 should actually provide a mechanism to to let your BE know that the task CAN be run locally. this approach does not enshrine it in the spec, but allows minor optimizations through hints. Additionally, for your 2nd point, i agree with you that should be supported. IF you would not mind making a PR or clarifying the current spec if this is not the current behaviour. |
Continuing this discussion slightly, I do wish there was a way to communicate to the engine that you're just shuffling files, rather than attempting to re-upload them. (i.e., to split on lanes, etc) |
A common request is to extend the standard library (#296, #276, #234, #203, #170, #137, #133, #120, #117, #43). This proposal, or thought experiment, describes a method for allowing users to write their own functions.
Assume we want to add an
append
function. Given an array of stringselements
, and a stringa
, it will return a new array withelements
, anda
concatenated at the end.The type would be:
It can (almost) be implemented as a regular task
Calling it from a workflow, can look like this:
This almost works in Cromwell version 40. Running it as above results in the errors:
The issue is that
write_json
doesn't work onArray[String]
norString
types. By improving the implementation so that it will work, at least for types that have a simple mapping into JSON (Array
,String
,Int
), we would allow writing additional functions in a language of the user's choosing. We would thus avoid the need to add these functions to the standard library.A potential optimization down the road is running such tasks in an efficient manner by the engine, avoiding the need to launch a job for the evaluation. Language support can be added for this purpose, for example, using
function
instead oftask
.The text was updated successfully, but these errors were encountered: