-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Live updates and sorting for R data explorer #287
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I think it just needs some changes for tibble support.
709783a
to
bf01f7a
Compare
@lionel- thanks for the detailed review! I think I've addressed all your comments, LMK if anything still looks off. |
I'll also take a look tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I've just pushed a couple of commits that add tests for tibble data frames. I've confirmed that the previous way of subsetting columns causes a failure with this new test, but the new way succeeds.
I was expecting this to live update, it does in RStudio, am I doing it wrong? library(nycflights13)
x <- flights
View(x)
x <- data.frame(a = 1:10)
Screen.Recording.2024-04-03.at.9.42.25.AM.movOh, the env isn't passed through in the #[harp::register]
pub unsafe extern "C" fn ps_view_data_frame(x: SEXP, title: SEXP) -> anyhow::Result<SEXP> {
let x = RObject::new(x);
let title = RObject::new(title);
let title = unwrap!(String::try_from(title), Err(_) => "".to_string());
let main = RMain::get();
let comm_manager_tx = main.get_comm_manager_tx().clone();
RDataExplorer::start(title, x, None, comm_manager_tx)?;
Ok(R_NilValue)
} |
#' @export | ||
.ps.null_count <- function(col) { | ||
# Include NA and NaN values in the null count | ||
is_null <- function(x) { is.na(x) | is.null(x) | is.nan(x) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_null <- function(x) { is.na(x) | is.null(x) | is.nan(x) } | |
is_null <- function(x) { is.na(x) | is.null(x) } |
NaN
s are included in is.na()
crates/harp/src/table.rs
Outdated
/// - `column_index` - The index of the column to extract. Passed directly to R, | ||
/// so uses 1-based indexing. | ||
/// - `kind` - The kind of table `x` is (matrix or data frame). | ||
/// | ||
pub fn tbl_get_column(x: SEXP, column_index: i32, kind: TableKind) -> anyhow::Result<RObject> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lionel- and I typically use a convention where we do the 0 vs 1 based indexing at the FFI boundary. This typically makes the code cleaner and easier to understand in the long run as it avoids a proliferation of + 1
or - 1
everywhere. So I'd expect column_index
to be 0 based here and the conversion to happen internally as you build the call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that. Done in 1c05603.
// Generate an intial set of row indices that are just the | ||
// row numbers | ||
let row_indices: Vec<i32> = (1..=shape.num_rows).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to manually check that (1..shape.num_rows)
gives an empty vector when num_rows = 0
! That's nice rust behavior here, but I was a little surprised by it! I'm not sure if there is something we can do to make it clearer that we do indeed nicely handle the 0 row case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessarily verbose, but maybe this is clearer? 609d14a
@@ -1,20 +1,25 @@ | |||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature request - RStudio has this behavior where if you update the binding to something it can't show but then reupdate the binding to a data frame again, then it picks up where it left off just fine.
That's a pretty nice behavior for interactive work, where its definitely possible you can temporarily mask that binding with something that isn't a data frame, but then you say "oops let me undo that" and the data viewer picks right back up.
library(nycflights13)
library(dplyr)
df <- flights
# cool!
df <- data.frame(x = 1:5)
# disconnect
df <- 3
# was hoping this would reupdate my existing window, like rstudio
df <- flights
(this may be a bad example because the rstudio viewer can indeed show 3
, but try with like environment()
or something and this still works)
Screen.Recording.2024-04-03.at.10.30.37.AM.mov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea, but would be kind of tough to draw with the current set of stencils. Currently, the "disconnect" state tears down everything -- the thread that services RPCs, the underlying objects, the UI, etc. There's no concept of "reconnect" for a comm; you can open a new one, of course, but that's it.
Probably the way to do this would be for the comm to include some sort of unique key (separate from the comm ID). In the reopen case we'd open a new comm with a matching unique key and the UI could attach that to an existing explorer tab if there is one.
At any rate we don't even have a "disconnected" state yet so that probably has to happen first: posit-dev/positron#2516
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to ☝️ to make sure we keep track of this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
} | ||
// Add the sort order per column | ||
order.param("decreasing", RObject::try_from(decreasing)?); | ||
order.param("method", RObject::from("radix")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Tricky:
- On one hand,
radix
is nice because it gives- Consistent C locale ordering
decreasing
per column- It is fast
- On the other hand
- People probably want to see sorting in their own locale, it affects how capitalization is handled and how special characters like
ñ
in Spanish orØ
in Danish are sorted (C locale doesn't put them in the "right place", typically they unexpectedly end up afterz
).
- People probably want to see sorting in their own locale, it affects how capitalization is handled and how special characters like
RStudio seems to use base::order()
without radix
because it seems to respect the current locale (which is nice), but it doesn't seem to have to deal with the length 1 decreasing
restriction because only 1 sort key seems to be allowed at a time.
Note how when I specify C locale, the Abc
is no longer grouped with abc
as one may expect. This is the kind of stuff we lose in the C locale. It gets more annoying in other languages too when it affects actual characters and not just capitalization.
Screen.Recording.2024-04-03.at.10.49.45.AM.mov
The fully generic way we do in dplyr is to use vctrs:::vec_order_radix()
, which allows for length >1 decreasing
and also has a chr_proxy_collate
argument that transforms a character vector into a secondary character vector that can be sorted in the C locale but the result with be as if you sorted in the user locale. We use stringi with something like stringi::stri_sort_key(col, locale = "en")
for this but adding both of these deps to ark is kind of a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added radix
on @lionel- 's recommendation. And you're right, RStudio does not support sorting on more than one column so it doesn't have to deal with some of these problems!
We can probably start with radix
for now and see what feedback people give us. We might be able to get to 95% by using radix
sometimes (as I suspect most sorting happens on numeric values) and only dropping it when one of the sort columns is a character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've promoted this to an issue here: posit-dev/positron#2647
Co-authored-by: Davis Vaughan <davis@rstudio.com>
@DavisVaughan Not hard to do! Implemented now. |
Co-authored-by: Davis Vaughan <davis@rstudio.com>
This change adds live updates and sorting for the data explorer backend for R.
Addresses posit-dev/positron#2159 (sorting portion)
Addresses posit-dev/positron#2386
Addresses posit-dev/positron#2333
Most of this is accomplished by having the data explorer's backend maintain knowledge of the object it's looking at (in the form of a name/environment binding), and the current sorting state (in the form of a set of sorting keys and sorted row indices).
Includes integration tests for the new functionality.