Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with xml_serialize()/xml_unserialize() roundtrip: Opening and ending tag mismatch [PATCH] #407

Closed
HenrikBengtsson opened this issue Oct 3, 2023 · 3 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@HenrikBengtsson
Copy link
Contributor

Issue

xml_serialize()-xml_unserialize() roundtrip failes with: "Opening and ending tag mismatch: link line 12 and head [76]"

I'd expect a roundtrip to always work.

Reproducible Example

doc <- xml2::read_html("https://www.r-project.org")
doc
#> {html_document}
#> <html lang="en">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body>\n    <div class="container page">\n      <div class="row">\n       ...

raw <- xml2::xml_serialize(doc, connection = NULL)
doc2 <- xml2::xml_unserialize(raw)
#> Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>   Opening and ending tag mismatch: link line 12 and head [76]

Traceback:

> traceback()
4: read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html, 
       options = options)
3: read_xml.character(unclass(object), ...)
2: read_xml(unclass(object), ...)
1: xml2::xml_unserialize(raw)
Session Info
> devtools::session_info() # Paste output belowSession info ─────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       Ubuntu 22.04.3 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2023-10-03
 pandoc   3.1.7 @ /home/henrik/shared/software/CBI/pandoc-3.1.7/bin/pandocPackages ──────────────────────
 package     * version date (UTC) lib source
 cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.0)
 callr         3.7.3   2022-11-02 [1] CRAN (R 4.3.0)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] RSPM (R 4.3.0)
 devtools      2.4.5   2022-10-11 [1] RSPM (R 4.3.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 fs            1.6.3   2023-07-20 [1] RSPM (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.2   2023-03-17 [1] RSPM (R 4.3.0)
 httpuv        1.6.11  2023-05-11 [1] RSPM (R 4.3.0)
 later         1.3.1   2023-05-02 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.3.0)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
 mime          0.12    2021-09-28 [1] CRAN (R 4.3.0)
 miniUI        0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0)
 pkgbuild      1.4.2   2023-06-26 [1] CRAN (R 4.3.1)
 pkgload       1.3.3   2023-09-22 [1] CRAN (R 4.3.1)
 prettyunits   1.2.0   2023-09-24 [1] RSPM (R 4.3.0)
 processx      3.8.2   2023-06-30 [1] CRAN (R 4.3.1)
 profvis       0.3.8   2023-05-02 [1] RSPM (R 4.3.0)
 promises      1.2.1   2023-08-10 [1] CRAN (R 4.3.1)
 ps            1.7.5   2023-04-18 [1] RSPM (R 4.3.0)
 purrr         1.0.2   2023-08-10 [1] CRAN (R 4.3.1)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp          1.0.11  2023-07-06 [1] CRAN (R 4.3.1)
 remotes       2.4.2.1 2023-07-18 [1] CRAN (R 4.3.1)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.3.0)
 shiny         1.7.5   2023-08-12 [1] RSPM (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr       1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 urlchecker    1.0.1   2021-11-30 [1] RSPM (R 4.3.0)
 usethis       2.2.2   2023-07-06 [1] CRAN (R 4.3.1)
 vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.0)
 xtable        1.8-4   2019-04-21 [1] RSPM (R 4.3.0)

 [1] /home/henrik/R/ubuntu22_04-x86_64-pc-linux-gnu-library/4.3-CBI-gcc11
 [2] /home/henrik/shared/software/CBI/_ubuntu22_04/R-4.3.1-gcc11/lib/R/library
@HenrikBengtsson
Copy link
Contributor Author

Same example using the example HTML file that comes with the package:

library(xml2)
file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- read_html(file)
class(doc)
#> [1] "xml_document" "xml_node"

raw <- xml2::xml_serialize(doc, connection = NULL)
doc2 <- xml2::xml_unserialize(raw)
#> Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>  Opening and ending tag mismatch: link line 12 and head [76]

Now, the problem seems to be with the xml_document class; it works with the xml_nodeset class:

children <- xml_children(doc)
class(children)
#> [1] "xml_nodeset"

raw <- xml2::xml_serialize(children, connection = NULL)
children2 <- xml2::xml_unserialize(raw)
#> {xml_nodeset (2)}
#> [1] <head>\n  <meta http-equiv="Content-Type" content="text/html; charset=UTF ...
#> [2] <body>\n  <div class="container page">\n    <div class="row">\n      <div ...

all.equal(children2, children)
#> [1] TRUE

@HenrikBengtsson
Copy link
Contributor Author

I think it's because xml_unserialize() attempts to read it as an XML file and not as an HTML file in:

xml2/R/xml_serialize.R

Lines 66 to 67 in ef2310b

} else if (inherits(object, "xml_serialized_document")) {
res <- read_xml(unclass(object), ...)

We get the same error message if we try:

file <- system.file("extdata", "r-project.html", package = "xml2")
doc <- xml2::read_xml(file)
#> Error in read_xml.character(file) : 
#>   Opening and ending tag mismatch: link line 14 and head [76]

@HenrikBengtsson
Copy link
Contributor Author

HenrikBengtsson commented Oct 7, 2023

One solution for this it to have xml_serialize.xml_document() record also the document type and then have xml_unserialize() set as_html accordingly. Here's a working patch:

$ git diff -u R/xml_serialize.R
diff --git a/R/xml_serialize.R b/R/xml_serialize.R
index 3f7357f..74e1608 100644
--- a/R/xml_serialize.R
+++ b/R/xml_serialize.R
@@ -22,7 +22,7 @@ xml_serialize.xml_document <- function(object, connection, ...) {
     connection <- file(connection, "w", raw = TRUE)
     on.exit(close(connection))
   }
-  serialize(structure(as.character(object, ...), class = "xml_serialized_document"), connection)
+  serialize(structure(as.character(object, ...), doc_type = doc_type(object), class = "xml_serialized_document"), connection)
 }
 
 #' @export
@@ -64,7 +64,13 @@ xml_unserialize <- function(connection, ...) {
     # Select only the root
     res <- xml_find_first(x, "/node()")
   } else if (inherits(object, "xml_serialized_document")) {
-    res <- read_xml(unclass(object), ...)
+    read_xml_int <- function(object, as_html = FALSE, ...) {
+      if (missing(as_html)) {
+        as_html <- identical(attr(object, "doc_type", exact = TRUE), "html")
+      }
+      read_xml(unclass(object), as_html = as_html, ...)
+    }
+    res <- read_xml_int(unclass(object), ...)
   } else {
     stop("Not a serialized xml2 object", call. = FALSE)
   }

I've submitted this patch in PR #408.

@HenrikBengtsson HenrikBengtsson changed the title Error with xml_serialize()/xml_unserialize() roundtrip: Opening and ending tag mismatch: link line 12 and head [76] Error with xml_serialize()/xml_unserialize() roundtrip: Opening and ending tag mismatch [PATCH] Oct 8, 2023
@hadley hadley added the bug an unexpected problem or unintended behavior label Oct 30, 2023
HenrikBengtsson added a commit to HenrikBengtsson/xml2 that referenced this issue Nov 8, 2023
HenrikBengtsson added a commit to HenrikBengtsson/xml2 that referenced this issue Nov 8, 2023
@hadley hadley closed this as completed in b9f65ba Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants