Optimizing the R code parsing #618

Tal500 · 2023-05-14T13:30:53Z

Following the discussion created by @kobytg17 in #599 (brief: we have enormous amount of code in R at our company), I've been reviewing the parsing process and why it's so slow. The main parsing function of this language server is:

languageserver/R/document.R

Lines 402 to 429 in 1e71561

    
           parse_document <- function(uri, content) { 
        
               if (length(content) == 0) { 
        
                   content <- "" 
        
               } 
        
               # replace tab with a space since the width of a tab is 1 in LSP but 8 in getParseData(). 
        
               content <- gsub("\t", " ", content, fixed = TRUE) 
        
               expr <- tryCatch(parse(text = content, keep.source = TRUE), error = function(e) NULL) 
        
               if (!is.null(expr)) { 
        
                   parse_env <- function() { 
        
                       env <- new.env(parent = .GlobalEnv) 
        
                       env$packages <- character() 
        
                       env$nonfuncts <- character() 
        
                       env$functs <- character() 
        
                       env$functions <- list() 
        
                       env$signatures <- list() 
        
                       env$definitions <- list() 
        
                       env$documentation <- list() 
        
                       env$xml_data <- NULL 
        
                       env$xml_doc <- NULL 
        
                       env 
        
                   } 
        
                   env <- parse_env() 
        
                   parse_expr(content, expr, env) 
        
                   env$packages <- basename(find.package(env$packages, quiet = TRUE)) 
        
                   env$xml_data <- xmlparsedata::xml_parse_data(expr) 
        
                   env 
        
               } 
        
           }

It seems that the main parsing process, per each file, is done in the following stages:

Parse the file by the built-in parse() function of R.
Scan the result of 1 (recursive walk on the AST), and save this result in some recursive R Environment object.

Encode the result of 1 to a string representing an XML (line 426 in the code above), and then this XML string is being decoded later in:

languageserver/R/workspace.R

Lines 243 to 249 in 1e71561

    
           update_parse_data = function(uri, parse_data) { 
        
               if (!is.null(parse_data$xml_data)) { 
        
                   parse_data$xml_doc <- tryCatch( 
        
                       xml2::read_xml(parse_data$xml_data), error = function(e) NULL) 
        
               } 
        
               self$documents$get(uri)$update_parse_data(parse_data) 
        
           },

I'm trying to see how can we make this parsing process faster (and hopefully takes less amount of RAM).

One option is to handle this problem like RStudio - have the whole language server written only in C++. They don't have a language server, but their core is written only in C++. For getting the impression, see their tokenizing code here: https://github.com/rstudio/rstudio/blob/02e3810fabbca032fcb664196a1c008e6306ac7f/src/cpp/core/include/core/r_util/RTokenizer.hpp#L346. Of course this doesn't seem to me as a reasonable option because of the huge responsibility and the refactoring.

Therefore, I'm focusing on finding ways to optimize the parsing. I see two immediate solutions:

Getting rid of step 3 above, since the XML encoding and decoding intuitively seems to be a waste of computation, we must have another way to get the needed parsing features directly.
Implement step 2 (recursively walk on the AST and producing a recursive R Environment object ) in raw C++. Later, if the memory footprints are still high(probably it will be since the result will be the same), we can also change the Environment object to a raw C++ recursive struct or something, and implement some more utility AST functions in C++.

What do you think?

The text was updated successfully, but these errors were encountered:

Tal500 mentioned this issue May 14, 2023

High CPU percentag while typing large objects name in jupyterlab with R kernel #593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing the R code parsing #618

Optimizing the R code parsing #618

Tal500 commented May 14, 2023 •

edited

Loading

Optimizing the R code parsing #618

Optimizing the R code parsing #618

Comments

Tal500 commented May 14, 2023 • edited Loading

Tal500 commented May 14, 2023 •

edited

Loading