-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdata, read R datasets from Python #144
Comments
Hi @vnmabus, just letting you know we have started the search for an editor for this review. In the meantime, here are the initial editor checks. I am happy to report that rdata passes with flying colors. Please see a couple of comments below. You are not required to address these for us to start the review, but I do think doing so would benefit your package--just trying to help, you get a free review from the editor in chief too 🙂 Editor in Chief checksHi there! Thank you for submitting your package for pyOpenSci Please check our Python packaging guide for more information on the elements
Editor commentsTwo suggestions:
More on adding data: >>> parsed = rdata.parser.parse_file(rdata.TESTDATA_PATH / "test_vector.rda")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pimienta/Documents/repos/coding/opensci/pyos/test-rdata/.venv/lib/python3.10/site-packages/rdata/parser/_parser.py", line 1002, in parse_file
data = path.read_bytes()
File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/pathlib.py", line 1126, in read_bytes
with self.open(mode='rb') as f:
File "/home/linuxbrew/.linuxbrew/opt/python@3.10/lib/python3.10/pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/home/pimienta/Documents/repos/coding/opensci/pyos/test-rdata/.venv/lib/python3.10/site-packages/rdata/tests/data/test_vector.rda' I think what's going on is that the test data is not in the built package? You may have intended this snippet to be run with just the development version, but I would strongly suggest you add small amounts of data to the package itself. It's also possible you meant to "include" the data in your built package, but the setuptools build is not configured correctly? I have found the way setuptools does this a bit confusing in the past, but I think it's now possible for you to include all the files without using a MANIFEST.in file (which if I understand correctly is now a "legacy" way of including files with setuptools that should no longer be used). Note that other build backends have their own methods for this; for example flit would include everything by default, although you can have more fine grained control. You can then provide access to this data to users (and yourself, for tests) by using importlib-resources as discussed in this talk. Here's an example of using importlib-resources in a package I develop--it might be a little easier to read than more general scientific Python packages that have a lot of internal infrastructure around their built-in datasets. I have a draft section of the guide on how to do this here--it's very rough still but the core ideas are there and might help you. If you have any feedback on this section, we'd love to hear it. |
@NickleDave I have added the test data to the manifest and changed the package to use Some observations of the process:
|
Great, thank you @vnmabus--this feedback is really helpful. I will read this in more detail, and share in our Slack team, but I want to let you know right away I appreciate it. It might be worth sharing your observations on the process you went through to include data in our Discourse, if you wouldn't mind starting a new topic there, since that's open and we can get more eyeballs on it: https://pyopensci.discourse.group/. We could share the topic on social media too (e.g. Mastodon). I just sent out an email to another potential editor today--will update you as soon as we have one! We want to make sure we find someone with the right expertise in both Python and R. |
or if you're ok with it @vnmabus I can start the topic on our Discourse and you could feel free to reply there. Please let me know We'd like to get input from maintainers of setuptools there, to better understand when we need MANIFEST.IN files -- this has come up before in GitHub review of our packaging guide |
If you can start the topic and give me a link, that would be awesome, thank you! |
Hi again @vnmabus I asked a question about MANIFEST.in here, please feel free to chime in if I'm not understanding all the complexities of your particular situation: |
Hi @vnmabus, I'm very happy to report that @isabelizimm will be the editor for this review. We are now looking for reviewers. |
@NickleDave And I am also happy to report that I have added a couple of more detailed examples, as per your second suggestion. Please, tell me if that is what you had in mind: https://rdata.readthedocs.io/en/latest/auto_examples/index.html |
These look great @vnmabus, thanks so much. This is exactly what I had in mind. |
Hey there! I'm excited to be the editor for this package--I can't count the amount of times I've done the R->Python data dance. I'm on the hunt for reviewers, and will check back end of next week with an update! |
👋 Hi @rich-iannone and @has2k1! Thank you SO MUCH for volunteering to review for pyOpenSci 🎉 @vnmabus these two reviewers are individuals who are deeply involved in both the R and Python data and packaging world. I am super excited to be part of and learn from them through this review process! Please fill out our pre-review surveyBefore beginning your review, please fill out our pre-review survey. This helps us improve all aspects of our review and better understand our community. No personal data will be shared from this survey - it will only be used in an aggregated format by our Executive Director to improve our processes and programs.
The following resources will help you complete your review:
If anyone has any questions/comments/concerns, please don't hesitate to get in touch! Your review is due: Dec 8 |
Hi @vnmabus, First, I am happy to review rdata. Second, currently there is no tag for 0.9.2.dev1. Though, as there aren't many changes between when this issue was opened (after v0.9.1) and today, I am fine with addressing my review to any version/commit after that. |
Yes, sorry, the 0.9.2.dev1 version was the number of the develop version when I opened this issue. As I added some breaking changes, it was renamed and published until version 0.10 instead. That said, it would be better if you include in your review also the recent changes in develop (I know it is a moving target, but this package does not usually change that much). In particular, there is a recent PR from a user that greatly improves XDR parsing speed, removing the |
That is okay with me. On |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Readme file requirements
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
UsabilityReviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Functionality
For packages also submitting to JOSS
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted. The package contains a
Final approval (post-review)
Estimated hours spent reviewing: 3 Review CommentsThe software is good and useful. I find myself having a need for this very thing. In terms of letting people know what the package does, it's somewhat terse at the Other use cases can be presented to get the casual user's imagination going, as well. This may involve hosting .rds files (maybe in a public GH repo) and demonstrating Some additional documentation things:
Package README and Documentation Site The organization of the project's GH page is pretty good. Without a direct link, however, it's a bit hard to search for (appearing on the 4th page of results) and that's A useful badge to have is the Python versions supported badge. The list of versions could match those that you're testing and the relevant Community Considerations To make things better for community contributions, I suggest adding these two things:
You already have a CONTRIBUTING document that describes ways to make contributions and this is great. The PR and issue templates can link to that document (this increases I recommend adding a CODE_OF_CONDUCT.md file. These are very commonplace so you can get a file from another project's repo. Code Comments I found the code to well organized and easy to follow. All conventions seem to be followed and many helpful docstrings were present for virtually all classes and methods. Testing It's good to see that testing is being done on a solid matrix of recent Python versions and platforms (Linux/Ubuntu, Windows, MacOS). There does need to be some updates Another recommendation is to run pylint for linting, perhaps through pre-commit (see https://pylint.pycqa.org/en/latest/user_guide/installation/pre-commit-integration.html). There are a lot of tests in place and very good line coverage. Having the coverage results sent to CodeCov makes it very easy to see which code paths go through tests. It's Perhaps consider adding Python 3.12 to the testing matrix. If you want to go even further with testing, have different R versions on different platforms generate .rds/.rda files and test those directly. Citation File The citation file should have a version field, updated every time a version of the software is released. This can be a bit of a pain but it makes for a better, more |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Readme file requirements
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
UsabilityReviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole.
Functionality
For packages also submitting to JOSS
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted. The package contains a
Final approval (post-review)
Estimated hours spent reviewing: 5 hours Review CommentsNotes & Suggestions
parsed = rdata.parser.parse_file("dataframe.rds")
data = rdata.conversion.convert(parsed) would be data = rdata.read_rds("dataframe.rds") And maybe also a These functions could include optional arguments to pass on to the parser and/or converter. This would simplify the user experience as the
Final CommentWhile I have noted some suggestions, the project is otherwise well motivated, implemented and tested and I do not consider the above suggestions to be "blockers". |
THANK YOU!!!! To @has2k1 and @rich-iannone for your very thorough reviews 🙌 your time is so appreciated, and I think a great asset to this project!
I have gone ahead and changed the status of the issue to awaiting changes. There were a number of updates suggested by the reviewers; some of these comments will be quick to implement, and others might take a bit more time. This part of the review is a bit more back and forth, as @vnmabus updates Thank you all again for your collaboration! |
Ah, this is a great round of improvements-- thank you for the update @vnmabus! For @has2k1 and @rich-iannone, are you able to comment on if these fixes are what you expected? |
@vnmabus has resolved the all core queries and addressed the 2 conditional issues. To keep track of one, I have filed an issue about the comparison with other python packages, and the In between, the tables for default conversions are great and super informative. Great work Carlos. |
Sorry for being very late here, but thank you @vnmabus for all the improvements made. I've made a few comments in error previously so I thank you for the clarifications on those! The examples are great, no need to go further beyond the loading part (which is the key thing here). As far as adding a version number to the citation file, I take what I said back since I'm in favor of your reasoning (and you bring up some good points about practical usage of the citation text). Testing is always a potential marathon, you're doing great with this and the inclusion of asv tests is very helpful. The page in the documentation about the supported R translations is quite useful. Thanks for adding that in. To wrap up, this package fully meets my quality and usability expectations. Excellent work! |
With two reviewers giving the 👍 ... it is my absolute pleasure to say that ... 🎉 ... Thank you @vnmabus for submitting Author Wrap Up TasksThere are a few things left to do to wrap up this submission:
It looks like you would like to submit this package to JOSS. Here are the next steps:
🎉 Congratulations! You are now published with both JOSS and pyOpenSci! 🎉 Editor Final ChecksPlease complete the final steps to wrap up this review. Editor, please do the following:
If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide. |
Thank you @isabelizimm for the smooth reviewing experience, and also thanks to the reviewers @rich-iannone and @has2k1! I already did the steps at section "There are a few things left to do to wrap up this submission:". I will plan to do the necessary steps to submit to JOSS and create a blog post in the next weeks. |
@isabelizimm I wanted to tell you that I have written a draft of a blog post, and I am not sure how to send it to you for review. On an unrelated note: rdata shows now in the main pyOpenSci page, but it is not shown in a particular category (maybe it should be shown in "data munging"?). |
hey @vnmabus just a quick note that i saw this message and noticed a bug in our website! things should be working and rdata should filter as expected now: |
So @isabelizimm , @lwasser : is it possible to add a blog post or not? |
Apologies for late response-- you are welcome to make a PR with your post to the pyOpenSci website's repo! Here is an example of what this PR can look like 😄 pyOpenSci/pyopensci.github.io#42 |
hi team! i'm circling back on this review and doing a bit of cleanup. rdata was approved - yay and i see the pyos-accepted label! i also believe we published your blog post @vnmabus did this package however get to joss? and if it did it should have a doi if it was inscope. we should close the issue if we you decided to forgo joss! but if it had a joss submission we should link to that issue in JOSS and update the label to also include joss! many thanks!! |
Sorry, I got a new job the same day this package was accepted in PyOpenSci, so I had a few months without a lot of time to invest on it. My plan is still to submit to JOSS after vnmabus/rdata#40 is merged. |
hey @vnmabus no worries. no need to apologize! A few notes
there is no huge rush but I highly encourage you to submit to joss before adding more code to your package to keep our existing review synced with what joss accepts as a publication. please let me know if you have any questions! |
Is there no way to communicate to them the changes since your review, so that they can review the new commits if they want to? I feel that the changes in that PR, and a couple of PRs before by the same author, enrich the package in such a way that it seems unfair not to mention them in the JOSS paper. In fact, I wanted to offer the author of these changes to co-author the JOSS paper with me if he wants. Sorry, because I completely understand the issue with the fast-tracking and DOI. This was not planned when I submitted the package for pyOpenSci review: I wanted to submit the same version to JOSS at that time. Otherwise, I would have just waited to do these changes before submitting to pyOpenSci. So, what do you think it is the best way to approach this problem in a way convenient for everyone? |
@vnmabus i'm going to ping @arfon on this specific issue from JOSS. We ask authors to submit a package to us when the API has stabilized and similar to an academic paper being reviewed, it is not ideal if changes are being made while the software / (or paper!) is in active review. Arfon - in this case we have reviewed the rdata package and it has been pyOpenSci accepted. However, since the pyOpenSci acceptance of rdata there are new commits that look to me like fairly significant enhancements that have been added to the package. Because there is new functionality, we may not want to move forward with a fast track (these changes might require another review). Generally we ask the maintainer to submit based on the release / doi created at the end of our pyOpenSci review and as such JOSS can trust that we have reviewed the entire code base. In this case we have a divergence. How would you prefer to handle this on the JOSS side of things? I believe that this is the first time this has happened. many thanks! |
Hrm, yes isn't ideal (I agree with the idea that the software should essentially be the same between the pyOpenSci and JOSS reviews). That said, we do have the ability to associate a specific version of a software package with a JOSS publication (e.g., |
I could do that, but I wonder if it would at least be possible to mention the post-reviews PRs, maybe as future/ongoing work? Again, I would like to offer @trossi to be a coauthor, given that his contributions added a lot of value to the package (and a JOSS paper is mostly an academic/citable record of the software itself), so it would make sense to at least mention them in passing. (Again, I want to apologize and thank everyone for their time providing guidance in this matter). |
@vnmabus I appreciate that! I'd really like @arfon or someone from JOSS to answer that question. From our end, you are always welcome to update a blog post (or write a blog post) noting new features, etc post review. We know that development goes on. pyOpenSci is here to support you through the process as well! @arfon would it make sense for @vnmabus to submit the package to review for JOSS noting this specific item (when they have bandwidth!)? And you / JOSS can make a call on addressing the new features in that paper over there? We will link back to the review here as we always do so there is a clear communication trail. if we do that our fearless editor @isabelizimm can watch the package go through JOSS to ensure things get wrapped up on both ends. ✨ |
@vnmabus – you're welcome to include future-looking work in the paper, and @trossi as author too given their current role on the project.
Sounds good to me! |
Ok, wonderful. @vnmabus, why don't you go ahead and submit it to JOSS? Please provide the link to the issue when the review opens here. Could you include me in the conversation? That way, the editors at JOSS can make a call. One of two things might happen: Once they accept your package at the release point received here, it would be 0.11.0. OR there is another review. This decision needs to be made by the JOSS editors rather than us. Please let me know if you have any questions. |
I already submitted to JOSS. The link to the pre-review issue is here: openjournals/joss-reviews#7442 |
Submitting Author: (@vnmabus)
All current maintainers: (@vnmabus)
Package Name: rdata
One-Line Description of Package: Read R datasets from Python.
Repository Link: https://github.com/vnmabus/rdata
Version submitted: 0.9.2.dev1
Editor: @isabelizimm
Reviewer 1: @rich-iannone
Reviewer 2: @has2k1
Archive:
JOSS DOI: TBD
Version accepted: 0.11.0
Date accepted (month/day/year): 2/29/2024
Code of Conduct & Commitment to Maintain Package
Description
.rda
and.rds
files, containing serialized R objects, and convert them to Python. The users can influence this conversion and provide conversion routines for custom classes.Scope
Please indicate which category or categories.
Check out our package scope page to learn more about our
scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
Domain Specific & Community Partnerships
Community Partnerships
If your package is associated with an
existing community please check below:
Its main purpose is to be able to read .rda and .rds files, the files used for storing data in the R programming language, and convert them to Python objects for further processing.
The target audience includes users that want to open in Python datasets created in R. These include scientists working in both Python and R, scientists who want to compare results among the two languages using the same data, or simply Python scientists that want to be able to use the numerous datasets available in CRAN, the R repository of packages.
The package rpy2 can be used to interact with R from Python. This includes the ability to load data in the RData format, and to convert these data to equivalent Python objects. Although this is arguably the best package to achieve interaction between both languages, it has many disadvantages if one wants to use it just to load RData datasets. In the first place, the package requires an R installation, as it relies in launching an R interpreter and communicating with it. Secondly, launching R just to load data is inefficient, both in time and memory. Finally, this package inherits the GPL license from the R language, which is not compatible with most Python packages, typically released under more permissive licenses.
The recent package pyreadr also provides functionality to read some R datasets. It relies in the C library librdata in order to perform the parsing of the RData format. This adds an additional dependency from C building tools, and requires that the package is compiled for all the desired operating systems. Moreover, this package is limited by the functionalities available in librdata, which at the moment of writing does not include the parsing of common objects such as R lists and S4 objects. The license can also be a problem, as it is part of the GPL family and does not allow commercial use.
@tag
the editor you contacted:#143
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication Options
JOSS Checks
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.
Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Confirm each of the following by checking the box.
Please fill out our survey
submission and improve our peer review process. We will also ask our reviewers
and editors to fill this out.
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
The editor template can be found here.
The review template can be found here.
Footnotes
Please fill out a pre-submission inquiry before submitting a data visualization package. ↩
The text was updated successfully, but these errors were encountered: