-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auk: eBird Data Extraction with AWK #136
Comments
Hi @mstrimas, Editor checks:
Here are some notes from a
|
Thanks, @karthik, I just fixed several of these, but the following remain:
|
@mstrimas No worries. Thank you for fixing the warnings. Regarding those warnings, have you tried adding a blank line at the end to those files. That should make the warnings go away. |
Reviewer 1 is @aurielfournier |
Reviewer 2 is @emhart |
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 6 (this is my first package review so I spent more time then I suspect I might on future reviews) Reviewer Comments
The package is solid, though since I am not well versed in AWK by any means I'm unable to comment on those fine details. The vignette is quite extensive, which is fantastic! I'm reading this vignette thinking about 'the average ebird data user' who isn't necessarily someone with a extensive R background and so while this is a very detailed vignette, and I think the detail is good and important, it might be better if it was rearranged so that the heavy technical detail was towards the end, and the 'how to use this package' is more up front. Since heavy users will keep reading, but less experienced users may get overwhelmed by details that are not essential to them using the package. My biggest suggestion would be to remove I would encourage you to avoid abbreviations since most people aren't going to read the vignette word for word, and consider not using Throughout the vignette and function documentation you use pipes, which is great, I like pipes, but lots of people don't. In some cases because they don't like them and in others because they find them confusing. I think it would be valuable to also include examples of how the functions would be used without pipes in the vignette and in the function specific help files. Since it is not good practice to write over an object with the same name Function specific feedbackThe function help examples in the different filter functions don't include Build/InstallI don't check build/installation on things very often. So this is not going to be the high point of my review.
|
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 5Review CommentsThe authors present a very elegant solution to a difficult problem in R, how to handle a very large data set such as the eBird data (larger than most people's personal computers could load into RAM) when most users only need a subset of the data in the full file. Their solution is to provide a way to automatically build an Awk script, execute it, and write a new output file. While this could be done without this R package, they make the data accessible to a much larger audience. Overall I found the code to very comprehensive and elegant and think it will be a good addition to the rOpenSci package suite. The authors largely adhere to the rOpenSci package guidlines and are exceedingly diligent in their error handling in each function. Also I was impressed with their coverage use cases in their tests. They went above in beyond in writing an exhaustive suite of tests for each function. I find no major issues with how their code is written. I do think there are a couple minor areas for improvement in making it easier for end users. The biggest issue I had was initially grokking that this was a multi-step workflow that involved writing a file to disk. My first impression was that I could simlply run a bunch of filters and the Minor comments
Community guidelines A CONTRIBUTING or way to cotribute in the README is not present. Consider adding contributor guidelines. Examples I ran all examples using Tests I ran all tests with Checks I built the package on the following system using All checks were passed with no notes, errors, or warninging. Test coverage I checked for the amount of test coverage using Furthermore I reviewed all the tests in sessionInfo() Just so you can see what versions of packages I used to run my tests:
|
@mstrimas As an aside from my review I wanted to say that this is a really cool solution to a big problem in R that I actually encounter in my work often. I might have a dataset that's a 20-30 GB and I don't want to actually crunch the whole thing in R. So I do a slightly more hacky approach which is to do some filtering in the data export phase (in SQL) and then some basic shell commands to sample / trim it down more, and then read it into R to do things like model POC. Do you think there's a way to make this package completely generic? I'm imagining a scenario where I input the file location, column header names, a series of generic filters, and then the same basic workflow happens, awk executes the script and then writes an output file. Then this same workflow could work on any large text file. It seems like that would be a really powerful tool that would extend the functionality of this approach beyond eBird. Do you think that would be feasible? |
Thanks for all the helpful feedback! I'll start working through your suggestions and incorporating them. @emhart yes, I think there is potential to make a more general AWK package for working with large files. In fact, I did originally considering doing that first, then making |
@aurielfournier A gentle ping 🙏 |
@aurielfournier Sorry I totally missed that your review was above Ted's. My apologies. |
No problem @karthik ! Our reviews came in withing a few hours of each other, easy to miss. I appreciate the gentle reminder, those are often necessary to keep me on top of things. |
Finally getting to this, here are responses to @aurielfournier comments:
Thanks for the code review!!! |
Here are my responses to @emhart:
Thanks!!! |
@emhart Just added ability to manually set awk path by setting the |
@aurielfournier I've added pipe-free examples to all functions for pipe haters. @karthik what's the next step here? The eBird taxonomy and EBD was just updated and my intention is to submit a new version of auk to CRAN in the next few days reflecting the taxonomy changes and the suggestion from the reviewers. |
Sorry for the delay @mstrimas here are a few quick thoughts:
|
@emhart thanks! I like the idea of column subsetting and will start looking at the best way to implement that. |
|
@karthik just released a new version to CRAN with most of the changes suggested by the review process included, as well as a variety of other new features and bug fixes. let me know what the next steps are to get this up on rOpenSci. Thanks! |
@emhart @aurielfournier Could you two take a look at the recent updates and let me know if you are ready to sign off? 🙏 |
I will do my best to get to this in the next two weeks. This are a bit swamped on my end at the moment. |
Thank you @aurielfournier! much appreciated. 🙏 |
Congrats on your package being accepted @mstrimas! 🎉 🎈 Here are your next steps:
Please also add a footer to the bottom of your README
Once moved, please re-run all checks in preparation for submission to CRAN. I can help with this if you run into any issues. Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing. |
Hi @karthik, |
@mstrimas Hi Matt, understood. You can skip the transfer step and I'll look into the best way for setting up a mirror and get back to you with further details. |
Hello @mstrimas. Congratulations on |
@stefaniebutland sure, I'm happy to modify the vignette into a blog post. Probably won't be able to get to it for a week or two though. |
No problem @mstrimas. I have Tues Feb 27 available for a post and we typically ask for a draft for review at least a week before the post date. What do you think about Tues Feb 20 to submit a draft via pull request? |
That works for me, @stefaniebutland, thanks! |
@mstrimas Just wanted to give you a quick update. We are still working out a good way to do the mirroring, but I'll let you know soon. Also I am about to travel for a bit, so I'll update the thread upon my return (late Feb). 🙏 |
Hi @stefaniebutland, looks like there's going to be a large update to the data underlying this package in mid March that will requiring some changes to the package that break backward compatibility. Would you be open to pushing the blog post back until the new version is released? If this messes things up for you, no worries, I can proceed with the post as is and just avoid the features that will get broken. |
@mstrimas It's whatever you think is best. Blog post timing is flexible. You only really get the audience once for this kind of thing though, so if you prefer to publish after updates to avoid frustrating people once they're engaged, then we can postpone. Perhaps you can draft your post ideas for yourself now before they go stale ;-) and then fill in soon after you've made the required changes to the package. I'll mark my calendar to check in with you in late March. |
Ok, that's what I'm thinking, only one chance to catch people's eye so better to have the package in tip top shape. Late March sounds good. Thanks! |
Checking in to see if timing is right for to draft a blog post - no rush if pkg not updated yet |
By the way, it'd be grand if the blog post explained a bit how to choose between using |
@stefaniebutland I think the package is ready, I can start putting something together this week. Thanks for the reminder! @maelle |
Thanks a lot for the explanations @mstrimas! It'd be a nice footnote as well in my opinion (of the post and READMEs). When you say 30 days of observation you mean for raw occurrence data right? For frequency derived from it it seems you can get older data e.g. https://github.com/stephhazlitt/ruhu-ebird-observations/blob/master/R/ruhu-ebird-observations.md |
@maelle I wasn't aware of the I'll add something to the README explaining the difference, thanks for the suggestion! |
Awesome! It'll be super useful to guide users finding any of the 2 packages first! I wonder if the info should also live in the vignette because of people installing from CRAN and therefore not having the README 🤔 |
@maelle updated the README and vignette as per your suggestion |
Fantastic! Speaking of other rOpenSci packages, I am also wondering whether/how one could use |
@stefaniebutland here's a first draft of a blog post. What topicid and date should I use? Also, is there somewhere in the website repo I can put a couple data files (~ 3 MB). If there isn't a good spot, I'll just leave them in my GitHub repo. If this looks good I can submit a pull request to the rOpenSci website repo. |
@mstrimas I apologize for my long delay in responding. I am temporarily putting most blog post reviews and scheduling on hold until after our unconference. Rest assured your post will get a proper review and be published on our blog. I anticipate getting to this in early June.
Leave topicid blank. I add that immediately before publication and that links comments to our discussion forum. We'll determine the date when I review your draft. Sorry about this delay. We really appreciate it when package authors do the extra work of contributing a post! |
Hi @mstrimas. I hope you're still up for a post about
Scott says good to leave them in your GitHub repo. Happy to answer any questions |
@mstrimas Are you still interested in writing a post or a short tech note about |
Sorry @stefaniebutland, I've been out of town off and on the last couple weeks, and won't be back until the end of next week, so this slipped through the cracks. If you feel what I already put together looks good for a blog post, I'd be happy to use that: https://github.com/mstrimas/auk-blog-post/blob/master/auk.md What needs to be done to that to get it ready for posting? |
That reads nicely as a blog post @mstrimas. Please read on when you have time. I have a couple of suggestions for edits. When you're ready, please submit a pull request as outlined here: https://github.com/ropensci/roweb2#contributing-a-blog-post. Use date 2018-08-07 and if it is ready sooner and a spot opens up, we can change the date and publish sooner.
@maelle is going to refer to using Hope this is helpful. Happy to answer any questions. |
Ahah no I actually reviewed the post simultaneously, great minds meet @stefaniebutland! Thanks again for your interesting post @mstrimas, my comments are over at https://github.com/mstrimas/auk-blog-post/issues/1 |
Summary
Access to the eBird database, consisting of over 400 million observations, is provided via a huge (>150 GB) text file. The
auk
package extracts records from this file and imports them into R for analysis. Both presence only and presence/absence data can be generated.URL for the package (the development repository, not a stylized html page): https://github.com/CornellLabofOrnithology/auk
Please indicate which category or categories from our package fit policies this package falls under and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
This package falls somewhere at the intersection of data retrieval and extraction. It provides access to the eBird database; however, it does so by processing a text file downloaded from eBird that contains the full database.
Anyone looking to work with eBird data for science or conservation.
yours differ or meet our criteria for best-in-category?
rebird
provides access to eBird data via the eBird API; however, this only gives access to last 30 days of data. This package is the only one giving access to full eBird database.Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
with a high-level description in the package root or ininst/
.Detail
Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
The text was updated successfully, but these errors were encountered: