Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Add Ox for Improved Performance #13

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@
/pkg/
/spec/reports/
/tmp/
.ruby-gemset
.ruby-version
2 changes: 2 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# frozen_string_literal: true

source 'https://rubygems.org'

# Specify your gem's dependencies in easy_sax.gemspec
Expand Down
10 changes: 10 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ PATH
easy_sax (0.2.0)
activesupport (~> 7.0.8)
nokogiri (~> 1.16.2)
ox (~> 2.14.18)

GEM
remote: https://rubygems.org/
Expand All @@ -13,10 +14,12 @@ GEM
i18n (>= 1.6, < 2)
minitest (>= 5.1)
tzinfo (~> 2.0)
byebug (11.1.3)
coderay (1.1.3)
concurrent-ruby (1.2.3)
i18n (1.14.1)
concurrent-ruby (~> 1.0)
memory_profiler (1.1.0)
method_source (1.0.0)
minitest (5.20.0)
nokogiri (1.16.2-arm64-darwin)
Expand All @@ -25,9 +28,13 @@ GEM
racc (~> 1.4)
nokogiri (1.16.2-x86_64-linux)
racc (~> 1.4)
ox (2.14.18)
pry (0.14.1)
coderay (~> 1.1)
method_source (~> 1.0)
pry-byebug (3.10.1)
byebug (~> 11.0)
pry (>= 0.13, < 0.15)
racc (1.7.3)
rake (13.0.6)
tzinfo (2.0.6)
Expand All @@ -37,14 +44,17 @@ PLATFORMS
arm64-darwin-21
arm64-darwin-22
arm64-darwin-23
arm64-darwin-24
x86_64-darwin-20
x86_64-linux

DEPENDENCIES
bundler (~> 2.3.6)
easy_sax!
memory_profiler
minitest (~> 5.0)
pry
pry-byebug
rake

BUNDLED WITH
Expand Down
143 changes: 125 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# EasySax

EasySax allows you to easily parse large files without the messy syntax needed for working with most Sax parsers. It was inspired after attempting to use [SaxMachine](https://github.com/pauldix/sax-machine) to parse a 500mb XML file that resulted in a huge spike to 2gbs of memory inside a Rails app. EasySax is very lightweight and only stores the element currently being used in memory. It also allows you to access parent elements without storing the whole parent tree in memory. Testing with the same file above, the memory stayed constant and it processed the file much faster. EasySax is currently used in production at EasyBroker.
EasySax allows you to easily parse large files without the messy syntax needed
for working with most Sax parsers. It was inspired after attempting to use
[SaxMachine](https://github.com/pauldix/sax-machine) to parse a 500mb XML file
that resulted in a huge spike to 2gbs of memory inside a Rails app. EasySax is
very lightweight and only stores the element currently being used in memory. It
also allows you to access parent elements without storing the whole parent tree
in memory. Testing with the same file above, the memory stayed constant and it
processed the file much faster. EasySax is currently used in production at
EasyBroker.

## Installation

Expand All @@ -12,14 +20,20 @@ gem 'easy_sax'

And then execute:

$ bundle
```shell
bundle
```

Or install it yourself as:

$ gem install easy_sax
```shell
gem install easy_sax
```

## Usage

Given the following test XML

```xml
<agencies>
<agency id="1">
Expand All @@ -36,8 +50,8 @@ Given the following test XML
<property id="3">
<title>Test 3</title>
<images>
<image url="http://test.com/4.jpg"/>
<image url="http://test.com/5.jpg"/>
<image url="http://test.com/4.jpg"/>
</images>
</property>
</properties>
Expand All @@ -56,6 +70,7 @@ Given the following test XML
</agency>
</agencies>
```

You can parse all the property elements with

```ruby
Expand All @@ -67,15 +82,18 @@ end

Outputs

```
```shell
Property id[2] title[Test 2]
Property id[3] title[Test 3]
Property id[4] title[Test 4]
```

You can also use the `text_for` method if you prefer to get text elements. `property.text_for(:title)` is the same as `property[:title].text` except it returns nil if the title element doesn't exist.
You can also use the `text_for` method if you prefer to get text elements.
`property.text_for(:title)` is the same as `property[:title].text` except it
returns nil if the title element doesn't exist.

If you want to print the property image urls you need to let the parser know that it is an array
If you want to print the property image urls you need to let the parser know
that it is an array

```ruby
parser = EasySax.parser(File.open('test.xml'))
Expand All @@ -87,13 +105,14 @@ end

Outputs

```
```shell
Property id[2] images ["http://test.com/1.jpg", "http://test.com/2.jpg"]
Property id[3] images ["http://test.com/4.jpg", "http://test.com/5.jpg"]
Property id[4] images ["http://test.com/3.jpg", "http://test.com/4.jpg"]
```

Now for something really cool. If you want the root ancestor use the second param in the `parse_each` block
Now for something really cool. If you want the root ancestor use the second
param in the `parse_each` block

```ruby
parser = EasySax.parser(File.open('test.xml'))
Expand All @@ -104,13 +123,14 @@ end

Outputs

```
```shell
Property id[2] agency id[1]
Property id[3] agency id[1]
Property id[4] agency id[2]
```

Now maybe you're lazy like me and don't care about the `agencies` element and want the `agency` to be the oldest ancestor.
Now maybe you're lazy like me and don't care about the `agencies` element and
want the `agency` to be the oldest ancestor.

```ruby
parser = EasySax.parser(File.open('test.xml'))
Expand All @@ -121,25 +141,112 @@ end

Outputs

```
```shell
Property id[2] agency id[1]
Property id[3] agency id[1]
Property id[4] agency id[2]
```

You can also use `ignore` to speed up the parser by allowing it to know that it doesn't need to keep track of the those elements.
You can also use `ignore` to speed up the parser by allowing it to know that it
doesn't need to keep track of the those elements.

## Performance improvement(alpha version)

Currently there are two parser methods `EasySax.parser` is currently well
tested in production using parser. There is a new method named `ox_parser` that
is backward compatible with current code and examples listed in this readme.

Behind scenes the improvement is due the replacement of nokogiri for ox.

### Benchmark setup

```text
OS: macOS Sequoia 15.1.1 arm64
Host: MacBook Pro (14-inch, 2021)
Kernel: Darwin 24.1.0
CPU: Apple M1 Pro (8) @ 3.23 GHz
GPU: Apple M1 Pro (14) @ 1.30 GHz [Integrated]
Memory: 32.00 GiB
ruby 3.3.6 (2024-11-05 revision 75015d4c1f) [arm64-darwin24]
```

### Results

```text
Time Benchmark:
user system total real
Nokogiri: 0.000114 0.000015 0.000129 ( 0.000128)
Ox: 0.000058 0.000002 0.000060 ( 0.000062)

Memory Benchmark:

Nokogiri Parser:
Total allocated memory: 22.90625 KB
Total retained memory: 0.0 KB
Total objects allocated: 430
Total objects retained: 0

Ox Parser:
Total allocated memory: 14.984375 KB
Total retained memory: 0.078125 KB
Total objects allocated: 205
Total objects retained: 2
```

### Performance Conclusion

The new ox_parser demonstrates significant performance improvements over the
EasySax parser that relies on Nokogiri. Below is a summary of the key metrics:

1. Execution Time:

- ox_parser is ~52% faster than EasySax in terms of real execution time.
- Nokogiri: 0.000128 seconds
- Ox: 0.000062 seconds

2. Memory Usage:

- Total allocated memory is reduced by ~35% when using ox_parser.
Nokogiri: 22.91 KB
Ox: 14.98 KB

- Object allocation is reduced by ~52%, making Ox more efficient:
Nokogiri: 430 objects
Ox: 205 objects

3. Retained Memory:
- While Nokogiri retains 0 KB, ox_parser retains a negligible amount of
0.078 KB due to its design. However, the overall efficiency in memory
allocation offsets this minor difference.

### Why Switch to ox_parser?

- Speed: The ox_parser is approximately 2x faster, ensuring faster XML parsing
for applications with high performance needs.
- Efficiency: Reduces memory usage significantly, benefiting applications
running in constrained environments.
- Backward Compatibility: ox_parser works seamlessly with existing code and
examples listed in this README.

> [!CAUTION]
> `ox_parser` needs test and monitoring in production environments.

## Development

After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
After checking out the repo, run `bin/setup` to install dependencies. Then, run
`rake test` to run the tests. You can also run `bin/console` for an interactive
prompt that will allow you to experiment.

To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
To install this gem onto your local machine, run `bundle exec rake install`. To
release a new version, update the version number in `version.rb`, and then run
`bundle exec rake release`, which will create a git tag for the version, push
git commits and tags, and push the `.gem` file to
[rubygems.org](https://rubygems.org).

## Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/easybroker/easy_sax.

Bug reports and pull requests are welcome on GitHub at [issues](https://github.com/easybroker/easy_sax/issues)

## License

The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).

12 changes: 7 additions & 5 deletions Rakefile
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
require "bundler/gem_tasks"
require "rake/testtask"
# frozen_string_literal: true

require 'bundler/gem_tasks'
require 'rake/testtask'

Rake::TestTask.new(:test) do |t|
t.libs << "test"
t.libs << "lib"
t.libs << 'test'
t.libs << 'lib'
t.test_files = FileList['test/**/*_test.rb']
end

task :default => :test
task default: :test
69 changes: 69 additions & 0 deletions benchmark/benchmark_parsers.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# frozen_string_literal: true

$LOAD_PATH.unshift File.expand_path('../lib', __dir__)
require 'benchmark'
require 'memory_profiler'
require 'easy_sax'

TEST_XML = <<~XML
<agencies>
<agency id="1">
<name>Foo</name>
<phone>12345678</phone>
<properties>
<property id="2">
<title>Test 2</title>
<images>
<image url="http://test.com/1.jpg"/>
<image url="http://test.com/2.jpg"/>
</images>
</property>
</properties>
</agency>
<agency id="2">
<name>Bar</name>
<properties>
<property id="3">
<title>Test 3</title>
<images>
<image url="http://test.com/3.jpg"/>
<image url="http://test.com/4.jpg"/>
</images>
</property>
</properties>
</agency>
</agencies>
XML

def create_parser(parser_class)
parser_class.new(StringIO.new(TEST_XML))
end

def parse_with_parser(parser)
agencies = []
parser.parse_each(:agency, ignore: %w[agencies], arrays: %w[properties images]) do |agency|
agencies << agency
end
end

puts 'Time Benchmark:'
Benchmark.bm(10) do |x|
x.report('Nokogiri:') { parse_with_parser(create_parser(EasySax::Parser)) }
x.report('Ox:') { parse_with_parser(create_parser(EasySax::OxParser)) }
end

puts "\nMemory Benchmark:"
[[:nokogiri, EasySax::Parser], [:ox, EasySax::OxParser]].each do |name, parser_class|
report = MemoryProfiler.report do
parser = create_parser(parser_class)
parse_with_parser(parser)
end

puts "\n#{name.capitalize} Parser:"
puts "Total allocated memory: #{report.total_allocated_memsize / 1024.0} KB"
puts "Total retained memory: #{report.total_retained_memsize / 1024.0} KB"
puts "Total objects allocated: #{report.total_allocated}"
puts "Total objects retained: #{report.total_retained}"
# Uncomment the line below for more detailed output:
# report.pretty_print(scale_bytes: true)
end
Loading