Skip to content

Commit

Permalink
wikipedia: increase REXML entity expansion limit during XML parsing
Browse files Browse the repository at this point in the history
Using `Datasets::Wikipedia#each` raised an `entity expansion has grown too large (RuntimeError)`.
This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187,
and `Datasets::Wikipedia#each` exceeds that limit.

In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets.
Therefore, we temporarily increase the limit.

```ruby
require 'datasets'

wikipedia = Datasets::Wikipedia.new
wikipedia.each do |wiki|
  pp wiki
end
```

```console
$ cd red-datasets && bundle && bundle exec ruby wiki
/home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError)
```
  • Loading branch information
otegami committed Aug 5, 2024
1 parent a76b917 commit 9786129
Showing 1 changed file with 14 additions and 1 deletion.
15 changes: 14 additions & 1 deletion lib/datasets/wikipedia.rb
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,16 @@ def each(&block)
open_data do |input|
listener = ArticlesListener.new(block)
parser = REXML::Parsers::StreamParser.new(input, listener)
parser.parse
with_increased_entity_expansion_text_limit do
parser.parse
end
end
end

private

ENTITY_EXPANSION_TEXT_LIMIT = 1_342_177_280

def base_name
"#{@language}wiki-latest-#{type_in_path}.xml.bz2"
end
Expand Down Expand Up @@ -80,6 +85,14 @@ def type_in_path
end
end

def with_increased_entity_expansion_text_limit
default_limit = REXML::Security.entity_expansion_text_limit
REXML::Security.entity_expansion_text_limit = ENTITY_EXPANSION_TEXT_LIMIT
yield
ensure
REXML::Security.entity_expansion_text_limit = default_limit
end

class ArticlesListener
include REXML::StreamListener

Expand Down

0 comments on commit 9786129

Please sign in to comment.