Live updates and sorting for R data explorer #287

jmcphers · 2024-04-01T23:44:04Z

This change adds live updates and sorting for the data explorer backend for R.

Addresses posit-dev/positron#2159 (sorting portion)
Addresses posit-dev/positron#2386
Addresses posit-dev/positron#2333

Most of this is accomplished by having the data explorer's backend maintain knowledge of the object it's looking at (in the form of a name/environment binding), and the current sorting state (in the form of a set of sorting keys and sorted row indices).

Includes integration tests for the new functionality.

…-updates

…-sorting

lionel-

Looks good! I think it just needs some changes for tibble support.

crates/ark/src/data_explorer/r_data_explorer.rs

crates/harp/src/object.rs

crates/ark/src/data_explorer/r_data_explorer.rs

crates/ark/tests/data_explorer.rs

…-sorting

jmcphers · 2024-04-02T23:24:22Z

@lionel- thanks for the detailed review! I think I've addressed all your comments, LMK if anything still looks off.

DavisVaughan · 2024-04-02T23:52:45Z

I'll also take a look tomorrow morning

lionel-

Looks good!

I've just pushed a couple of commits that add tests for tibble data frames. I've confirmed that the previous way of subsetting columns causes a failure with this new test, but the new way succeeds.

DavisVaughan · 2024-04-03T13:43:51Z

I was expecting this to live update, it does in RStudio, am I doing it wrong?

library(nycflights13)
x <- flights
View(x)
x <- data.frame(a = 1:10)

Screen.Recording.2024-04-03.at.9.42.25.AM.mov

Oh, the env isn't passed through in the View() case. I think that will be important to do. I know that's the primary way my wife pulls up the data viewer in rstudio.

#[harp::register]
pub unsafe extern "C" fn ps_view_data_frame(x: SEXP, title: SEXP) -> anyhow::Result<SEXP> {
    let x = RObject::new(x);

    let title = RObject::new(title);
    let title = unwrap!(String::try_from(title), Err(_) => "".to_string());

    let main = RMain::get();
    let comm_manager_tx = main.get_comm_manager_tx().clone();

    RDataExplorer::start(title, x, None, comm_manager_tx)?;

    Ok(R_NilValue)
}

DavisVaughan · 2024-04-03T13:47:13Z

crates/ark/src/modules/positron/r_data_explorer.R

+#' @export
+.ps.null_count <- function(col) {
+    # Include NA and NaN values in the null count
+    is_null <- function(x) { is.na(x) | is.null(x) | is.nan(x) }


Suggested change

is_null <- function(x) { is.na(x) | is.null(x) | is.nan(x) }

is_null <- function(x) { is.na(x) | is.null(x) }

NaNs are included in is.na()

crates/harp/src/object.rs

DavisVaughan · 2024-04-03T13:56:59Z

crates/harp/src/table.rs

+/// - `column_index` - The index of the column to extract. Passed directly to R,
+///    so uses 1-based indexing.
+/// - `kind` - The kind of table `x` is (matrix or data frame).
+///
+pub fn tbl_get_column(x: SEXP, column_index: i32, kind: TableKind) -> anyhow::Result<RObject> {


@lionel- and I typically use a convention where we do the 0 vs 1 based indexing at the FFI boundary. This typically makes the code cleaner and easier to understand in the long run as it avoids a proliferation of + 1 or - 1 everywhere. So I'd expect column_index to be 0 based here and the conversion to happen internally as you build the call

I like that. Done in 1c05603.

DavisVaughan · 2024-04-03T14:13:12Z

crates/ark/src/data_explorer/r_data_explorer.rs

+                    // Generate an intial set of row indices that are just the
+                    // row numbers
+                    let row_indices: Vec<i32> = (1..=shape.num_rows).collect();


I had to manually check that (1..shape.num_rows) gives an empty vector when num_rows = 0! That's nice rust behavior here, but I was a little surprised by it! I'm not sure if there is something we can do to make it clearer that we do indeed nicely handle the 0 row case

It's unnecessarily verbose, but maybe this is clearer? 609d14a

crates/ark/src/data_explorer/r_data_explorer.rs

DavisVaughan · 2024-04-03T14:33:42Z

crates/ark/src/data_explorer/r_data_explorer.rs

@@ -1,20 +1,25 @@
 //


Feature request - RStudio has this behavior where if you update the binding to something it can't show but then reupdate the binding to a data frame again, then it picks up where it left off just fine.

That's a pretty nice behavior for interactive work, where its definitely possible you can temporarily mask that binding with something that isn't a data frame, but then you say "oops let me undo that" and the data viewer picks right back up.

library(nycflights13) library(dplyr) df <- flights # cool! df <- data.frame(x = 1:5) # disconnect df <- 3 # was hoping this would reupdate my existing window, like rstudio df <- flights

(this may be a bad example because the rstudio viewer can indeed show 3, but try with like environment() or something and this still works)

Screen.Recording.2024-04-03.at.10.30.37.AM.mov

That's a great idea, but would be kind of tough to draw with the current set of stencils. Currently, the "disconnect" state tears down everything -- the thread that services RPCs, the underlying objects, the UI, etc. There's no concept of "reconnect" for a comm; you can open a new one, of course, but that's it.

Probably the way to do this would be for the comm to include some sort of unique key (separate from the comm ID). In the reopen case we'd open a new comm with a matching unique key and the UI could attach that to an existing explorer tab if there is one.

At any rate we don't even have a "disconnected" state yet so that probably has to happen first: posit-dev/positron#2516

Added a note to ☝️ to make sure we keep track of this idea.

crates/ark/src/data_explorer/r_data_explorer.rs

crates/ark/src/modules/positron/r_data_explorer.R

DavisVaughan · 2024-04-03T14:59:39Z

crates/ark/src/data_explorer/r_data_explorer.rs

        }
+        // Add the sort order per column
+        order.param("decreasing", RObject::try_from(decreasing)?);
+        order.param("method", RObject::from("radix"));


Hmm. Tricky:

On one hand, radix is nice because it gives

Consistent C locale ordering

decreasing per column

It is fast

On the other hand

People probably want to see sorting in their own locale, it affects how capitalization is handled and how special characters like ñ in Spanish or Ø in Danish are sorted (C locale doesn't put them in the "right place", typically they unexpectedly end up after z).

RStudio seems to use base::order() without radix because it seems to respect the current locale (which is nice), but it doesn't seem to have to deal with the length 1 decreasing restriction because only 1 sort key seems to be allowed at a time.

Note how when I specify C locale, the Abc is no longer grouped with abc as one may expect. This is the kind of stuff we lose in the C locale. It gets more annoying in other languages too when it affects actual characters and not just capitalization.

Screen.Recording.2024-04-03.at.10.49.45.AM.mov

The fully generic way we do in dplyr is to use vctrs:::vec_order_radix(), which allows for length >1 decreasing and also has a chr_proxy_collate argument that transforms a character vector into a secondary character vector that can be sorted in the C locale but the result with be as if you sorted in the user locale. We use stringi with something like stringi::stri_sort_key(col, locale = "en") for this but adding both of these deps to ark is kind of a big deal.

I added radix on @lionel- 's recommendation. And you're right, RStudio does not support sorting on more than one column so it doesn't have to deal with some of these problems!

We can probably start with radix for now and see what feedback people give us. We might be able to get to 95% by using radix sometimes (as I suspect most sorting happens on numeric values) and only dropping it when one of the sort columns is a character.

I've promoted this to an issue here: posit-dev/positron#2647

Co-authored-by: Davis Vaughan <davis@rstudio.com>

…-sorting

jmcphers · 2024-04-03T22:02:55Z

Oh, the env isn't passed through in the View() case. I think that will be important to do. I know that's the primary way my wife pulls up the data viewer in rstudio.

@DavisVaughan Not hard to do! Implemented now.

Co-authored-by: Davis Vaughan <davis@rstudio.com>

jmcphers added 23 commits March 27, 2024 16:36

add test for data updates

ff9dd6c

use 'attr' as accessor for attributes

a439df2

Merge remote-tracking branch 'origin/main' into feature/data-explorer…

a9453fe

…-updates

pass an optional variable binding into the data explorer

b33b214

listen for prompt signals; emit updates when data changes

a16fc2c

supply a binding in data explorer test

3102325

add a test for schema level updates

1ed7d1f

store column schema in a cache

1e659c8

test for schema vs. data update

34d180d

add test for refreshed schema

6c2a49f

Merge remote-tracking branch 'origin/main' into feature/data-explorer…

30ad29d

…-updates

store current set of sort keys

a709a6d

super basic sorting implementation

0d434a1

reset row indices after clearing sorting

78817e7

support for ascending/descending sorting

65480b6

add unit test for Vec<bool> conversion

09227e4

add a test to cover multi column sort and descending sort

460203b

ensure sorting also affects row labels

c8e20a3

implement and test sorting by matrix column

a317f70

Merge remote-tracking branch 'origin/main' into feature/data-explorer…

28ce156

…-sorting

re-sort data after detecting a change

03bb91d

close the comm when binding goes away

8dd0955

omit length when computing column type names

cee26b4

lionel- reviewed Apr 2, 2024

View reviewed changes

jmcphers added 6 commits April 2, 2024 09:34

Merge remote-tracking branch 'origin/main' into feature/data-explorer…

20dd0f1

…-sorting

clean up unused imports

9eb3bc3

simplify construction of bool vector

7216ee3

better naming for info about object's name/env

76801bc

simplify ternary expression

eef64a1

specify radix sort

f52d9fa

jmcphers added 2 commits April 2, 2024 11:41

protect result of update check

216221e

use [[ to extract columns from data frames

bf01f7a

jmcphers force-pushed the feature/data-explorer-sorting branch from 709783a to bf01f7a Compare April 2, 2024 21:40

implement null count summary stats

8ceb87f

jmcphers requested a review from lionel- April 2, 2024 23:23

lionel- added 2 commits April 3, 2024 08:53

Extract first test in a reusable function

b00ea6f

Test data explorer with a tibble

c0c1e45

lionel- approved these changes Apr 3, 2024

View reviewed changes

Clean up tibble after test

e8bc98a

DavisVaughan reviewed Apr 3, 2024

View reviewed changes

jmcphers and others added 5 commits April 3, 2024 08:51

Update crates/ark/src/data_explorer/r_data_explorer.rs

cf43115

Co-authored-by: Davis Vaughan <davis@rstudio.com>

simplify null/NA counting

db0e77c

Merge remote-tracking branch 'origin/main' into feature/data-explorer…

007eabd

…-sorting

monitor variable for updates even if provided via View()

d1d59a4

simplify object name derivation

a7aa213

jmcphers and others added 3 commits April 3, 2024 15:11

convert indices to 1-based at the FFI boundary

1c05603

clarify behavior of initial row population

609d14a

Update crates/harp/src/object.rs

22ec3bd

Co-authored-by: Davis Vaughan <davis@rstudio.com>

jmcphers merged commit 3f96567 into main Apr 3, 2024
1 check passed

jmcphers mentioned this pull request Apr 4, 2024

R: Data Explorer column sort is not lexical for character strings posit-dev/positron#2647

Open

posit-dev deleted a comment from jmcphers Apr 5, 2024

DavisVaughan deleted the feature/data-explorer-sorting branch October 14, 2024 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live updates and sorting for R data explorer #287

Live updates and sorting for R data explorer #287

jmcphers commented Apr 1, 2024

lionel- left a comment

jmcphers commented Apr 2, 2024

DavisVaughan commented Apr 2, 2024

lionel- left a comment

DavisVaughan commented Apr 3, 2024 •

edited

Loading

DavisVaughan Apr 3, 2024

DavisVaughan Apr 3, 2024

jmcphers Apr 3, 2024

DavisVaughan Apr 3, 2024

jmcphers Apr 3, 2024

DavisVaughan Apr 3, 2024

jmcphers Apr 3, 2024

jmcphers Apr 4, 2024

DavisVaughan Apr 4, 2024

DavisVaughan Apr 3, 2024

jmcphers Apr 3, 2024

jmcphers Apr 4, 2024

jmcphers commented Apr 3, 2024

	is_null <- function(x) { is.na(x) \| is.null(x) \| is.nan(x) }
	is_null <- function(x) { is.na(x) \| is.null(x) }

Live updates and sorting for R data explorer #287

Live updates and sorting for R data explorer #287

Conversation

jmcphers commented Apr 1, 2024

lionel- left a comment

Choose a reason for hiding this comment

jmcphers commented Apr 2, 2024

DavisVaughan commented Apr 2, 2024

lionel- left a comment

Choose a reason for hiding this comment

DavisVaughan commented Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmcphers commented Apr 3, 2024

DavisVaughan commented Apr 3, 2024 •

edited

Loading