Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing the R code parsing #618

Open
Tal500 opened this issue May 14, 2023 · 0 comments
Open

Optimizing the R code parsing #618

Tal500 opened this issue May 14, 2023 · 0 comments

Comments

@Tal500
Copy link

Tal500 commented May 14, 2023

Following the discussion created by @kobytg17 in #599 (brief: we have enormous amount of code in R at our company), I've been reviewing the parsing process and why it's so slow. The main parsing function of this language server is:

languageserver/R/document.R

Lines 402 to 429 in 1e71561

parse_document <- function(uri, content) {
if (length(content) == 0) {
content <- ""
}
# replace tab with a space since the width of a tab is 1 in LSP but 8 in getParseData().
content <- gsub("\t", " ", content, fixed = TRUE)
expr <- tryCatch(parse(text = content, keep.source = TRUE), error = function(e) NULL)
if (!is.null(expr)) {
parse_env <- function() {
env <- new.env(parent = .GlobalEnv)
env$packages <- character()
env$nonfuncts <- character()
env$functs <- character()
env$functions <- list()
env$signatures <- list()
env$definitions <- list()
env$documentation <- list()
env$xml_data <- NULL
env$xml_doc <- NULL
env
}
env <- parse_env()
parse_expr(content, expr, env)
env$packages <- basename(find.package(env$packages, quiet = TRUE))
env$xml_data <- xmlparsedata::xml_parse_data(expr)
env
}
}

It seems that the main parsing process, per each file, is done in the following stages:

  1. Parse the file by the built-in parse() function of R.
  2. Scan the result of 1 (recursive walk on the AST), and save this result in some recursive R Environment object.
  3. Encode the result of 1 to a string representing an XML (line 426 in the code above), and then this XML string is being decoded later in:
    update_parse_data = function(uri, parse_data) {
    if (!is.null(parse_data$xml_data)) {
    parse_data$xml_doc <- tryCatch(
    xml2::read_xml(parse_data$xml_data), error = function(e) NULL)
    }
    self$documents$get(uri)$update_parse_data(parse_data)
    },

I'm trying to see how can we make this parsing process faster (and hopefully takes less amount of RAM).

One option is to handle this problem like RStudio - have the whole language server written only in C++. They don't have a language server, but their core is written only in C++. For getting the impression, see their tokenizing code here: https://github.com/rstudio/rstudio/blob/02e3810fabbca032fcb664196a1c008e6306ac7f/src/cpp/core/include/core/r_util/RTokenizer.hpp#L346. Of course this doesn't seem to me as a reasonable option because of the huge responsibility and the refactoring.

Therefore, I'm focusing on finding ways to optimize the parsing. I see two immediate solutions:

  1. Getting rid of step 3 above, since the XML encoding and decoding intuitively seems to be a waste of computation, we must have another way to get the needed parsing features directly.
  2. Implement step 2 (recursively walk on the AST and producing a recursive R Environment object ) in raw C++. Later, if the memory footprints are still high(probably it will be since the result will be the same), we can also change the Environment object to a raw C++ recursive struct or something, and implement some more utility AST functions in C++.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant