Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid //expr XPaths #1358

Closed
40 tasks done
MichaelChirico opened this issue Jun 4, 2022 · 3 comments · Fixed by #1606
Closed
40 tasks done

Avoid //expr XPaths #1358

MichaelChirico opened this issue Jun 4, 2022 · 3 comments · Fixed by #1606

Comments

@MichaelChirico
Copy link
Collaborator

MichaelChirico commented Jun 4, 2022

The following currently linters use //expr:

  • any_duplicated_linter.R
  • any_is_na_linter.R
  • brace_linter.R
  • class_equals_linter.R
  • condition_message_linter.R
  • conjunct_test_linter.R
  • consecutive_stopifnot_linter.R
  • duplicate_argument_linter.R
  • equals_na_linter.R
  • expect_comparison_linter.R
  • expect_identical_linter.R
  • expect_length_linter.R
  • expect_named_linter.R
  • expect_not_linter.R
  • expect_null_linter.R
  • expect_s3_class_linter.R
  • expect_true_false_linter.R
  • expect_type_linter.R
  • fixed_regex_linter.R
  • function_left_parentheses_linter.R
  • ifelse_censor_linter.R
  • inner_combine_linter.R
  • literal_coercion_linter.R
  • missing_package_linter.R
  • nested_ifelse_linter.R
  • object_usage_linter.R
  • outer_negation_linter.R
  • package_hooks_linter.R
  • paste_linter.R
  • pipe_call_linter.R
  • redundant_ifelse_linter.R
  • regex_subset_linter.R
  • seq_linter.R
  • sprintf_linter.R
  • string_boundary_linter.R
  • strings_as_factors_linter.R
  • system_file_linter.R
  • unneeded_concatenation_linter.R
  • unused_import_linter.R
  • yoda_test_linter.R

Original issue raised

As seen in #1353, #1340, #1310, there are subtle performance implications to the way we write our XPaths.

One thing that became clear is that writing //NODE1[NODE2] is slower than //NODE2[parent::NODE1] if NODE2 is far less frequent than NODE1.

And <expr> is by far the most common node; here's a guess at the general frequency by tabulating across r-devel and my local R packages/scripts:

library(data.table)
library(knitr)

get_r_files = function(dir) list.files(dir, pattern = "\\.R$", recursive = TRUE, full.names = TRUE)
get_freq = function(f) {
  pc = tryCatch(parse(f), error = identity, warning = identity)
  if (inherits(pc, "condition")) return(NULL)
  setDT(getParseData(pc))[, .N, by = token]
}

r_files = c(get_r_files("~/github"), get_r_files("~/svn"))
token_freq = rbindlist(lapply(r_files, get_freq))
token_freq = token_freq[, .(N = sum(N)), by = token][order(-N)]
token_freq[, pct := round(100 * N/sum(N), 1)]
knitr::kable(token_freq[pct > 1])
token N pct
expr 5713829 36.4
',' 1514487 9.6
SYMBOL 1357662 8.6
'(' 1119392 7.1
')' 1119392 7.1
NUM_CONST 1086271 6.9
SYMBOL_FUNCTION_CALL 917895 5.8
STR_CONST 375598 2.4
LEFT_ASSIGN 334988 2.1
COMMENT 307772 2.0
EQ_SUB 263926 1.7
SYMBOL_SUB 254903 1.6

i.e., //expr eliminates at most 2/3 of tokens, while other tokens typically eliminate >90% of the tree.

The trade-off here is for readability. XPaths with a lot of parent::/preceding-sibling::/following-sibling:: axes tend to be less readable -- our current XPaths are fairly readable IMO. Moreover, most of our linters are built around expression-level lints, and having a comparatively small tree is the norm in that case -- I guess the overhead of iterating over expressions is usually higher than the savings from fine-tuning XPaths, and that in the presence of cacheing, performance gains will be unnoticeable in all but edge cases.

So we should proceed gently on this issue. Some ideas:

@AshesITR
Copy link
Collaborator

AshesITR commented Jun 19, 2022

~copied into the main issue body for tracking progress at a glance~

@IndrajeetPatil
Copy link
Collaborator

Just wanted to also include here a benchmark.

I am assuming the performance benefits scale with the complexity of the code tree.

library(xml2)
library(xmlparsedata)

x <- "switch(stat,
      o = {
        x <- 0.01
      },
      b = {
        x <- 0.05
      },
      # else
      {
        x <- 0.001
      }
    )"

xml <- xml_parse_data(getParseData(parse(text = x)), pretty = TRUE)

xpath1 <- "//expr[FUNCTION and @line1 != @line2 and not(expr[OP-LEFT-BRACE])]"
xpath2 <- "//FUNCTION/parent::expr[@line1 != @line2 and not(expr[OP-LEFT-BRACE])]"

xml <- as_xml_document(xml)

bench::mark(
  "with //expr"    = xml_find_all(xml, xpath1),
  "without //expr" = xml_find_all(xml, xpath2)
)
#> # A tibble: 2 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 with //expr      20.3µs   21.1µs    46530.    26.4KB     18.6
#> 2 without //expr   15.2µs   15.9µs    61757.        0B     30.9

Created on 2022-10-02 with reprex v2.0.2

@MichaelChirico
Copy link
Collaborator Author

see some of the other cited issues, e.g. #1348. the issue gets way worse on complex R files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants