Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a benchmark and profile script and hook into CI #1028

Merged
merged 4 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/workflows/benchmarks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: Benchmarks

on: [push, pull_request]

jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Install Memcached 1.6.23
working-directory: scripts
env:
MEMCACHED_VERSION: 1.6.23
run: |
chmod +x ./install_memcached.sh
./install_memcached.sh
memcached -d
memcached -d -p 11222
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: 3.2
danmayer marked this conversation as resolved.
Show resolved Hide resolved
bundler-cache: true # 'bundle install' and cache
- name: Run Benchmarks
run: RUBY_YJIT_ENABLE=1 BENCH_TARGET=all bundle exec bin/benchmark
38 changes: 38 additions & 0 deletions .github/workflows/profile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Profiles

on: [push, pull_request]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these on every push / PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to have on any PR as we can review the profile if we have any concerns that it might impact performance


jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Install Memcached 1.6.23
working-directory: scripts
env:
MEMCACHED_VERSION: 1.6.23
run: |
chmod +x ./install_memcached.sh
./install_memcached.sh
memcached -d
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: 3.4
bundler-cache: true # 'bundle install' and cache
- name: Run Profiles
run: RUBY_YJIT_ENABLE=1 BENCH_TARGET=all bundle exec bin/profile
- name: Upload profile results
uses: actions/upload-artifact@v4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these get uploaded to? Any instructions on how to pull them down?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call added documentation on the action file for how to get them

with:
name: profile-results
path: |
client_get_profile.json
socket_get_profile.json
client_set_profile.json
socket_set_profile.json
client_get_multi_profile.json
socket_get_multi_profile.json
client_set_multi_profile.json
socket_set_multi_profile.json
255 changes: 255 additions & 0 deletions bin/benchmark
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
#!/usr/bin/env ruby
# frozen_string_literal: true

# This helps benchmark current performance of Dalli
# as well as compare performance of optimizated and non-optimized calls like multi-set vs set
danmayer marked this conversation as resolved.
Show resolved Hide resolved
#
# run with:
# bundle exec bin/benchmark
# RUBY_YJIT_ENABLE=1 BENCH_TARGET=get bundle exec bin/benchmark
require 'bundler/inline'
require 'json'

gemfile do
source 'https://rubygems.org'
gem 'benchmark-ips'
gem 'logger'
end

require_relative '../lib/dalli'
danmayer marked this conversation as resolved.
Show resolved Hide resolved
require 'benchmark/ips'
require 'monitor'

##
# StringSerializer is a serializer that avoids the overhead of Marshal or JSON.
##
class StringSerializer
danmayer marked this conversation as resolved.
Show resolved Hide resolved
def self.dump(value)
value
end

def self.load(value)
value
end
end

dalli_url = ENV['BENCH_CACHE_URL'] || '127.0.0.1:11211'
bench_target = ENV['BENCH_TARGET'] || 'set'
danmayer marked this conversation as resolved.
Show resolved Hide resolved
bench_time = (ENV['BENCH_TIME'] || 10).to_i
bench_warmup = (ENV['BENCH_WARMUP'] || 3).to_i
bench_payload_size = (ENV['BENCH_PAYLOAD_SIZE'] || 700_000).to_i
payload = 'B' * bench_payload_size
TERMINATOR = "\r\n"
puts "yjit: #{RubyVM::YJIT.enabled?}"

client = Dalli::Client.new(dalli_url, serializer: StringSerializer, compress: false, raw: true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize we had supported a raw client option in the upstream Dalli? I thought it was only our fork?

multi_client = Dalli::Client.new('localhost:11211,localhost:11222', serializer: StringSerializer, compress: false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
multi_client = Dalli::Client.new('localhost:11211,localhost:11222', serializer: StringSerializer, compress: false,
ring_client = Dalli::Client.new('localhost:11211,localhost:11222', serializer: StringSerializer, compress: false,

raw: true)

# The raw socket implementation is used to benchmark the performance of dalli & the overhead of the various abstractions
# in the library.
sock = TCPSocket.new('127.0.0.1', '11211', connect_timeout: 1)
sock.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, true)
sock.setsockopt(Socket::SOL_SOCKET, Socket::SO_KEEPALIVE, true)
# Benchmarks didn't see any performance gains from increasing the SO_RCVBUF buffer size
# sock.setsockopt(Socket::SOL_SOCKET, ::Socket::SO_RCVBUF, 1024 * 1024 * 8)
# Benchamrks did see an improvement in performance when increasing the SO_SNDBUF buffer size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the same buffer size that is in Dalli proper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah dalli can also take these adjustments, but you are correct we should default to the same and only apply if folks are passing in options to adjust dalli and then also adjust the socket.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now just dropping setting this, but as we look at tweaking better defaults we will want to try a few things out

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Benchamrks did see an improvement in performance when increasing the SO_SNDBUF buffer size
# Benchmarks did see an improvement in performance when increasing the SO_SNDBUF buffer size

# sock.setsockopt(Socket::SOL_SOCKET, Socket::SO_SNDBUF, 1024 * 1024 * 8)

# ensure the clients are all connected and working
client.set('key', payload)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a get after to confirm the key was set?

multi_client.set('multi_key', payload)
sock.write("set sock_key 0 3600 #{payload.bytesize}\r\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ASCII protocol?

sock.write(payload)
sock.write(TERMINATOR)
sock.flush
sock.readline # clear the buffer

raise 'dalli client mismatch' if payload != client.get('key')

raise 'multi dalli client mismatch' if payload != multi_client.get('multi_key')

sock.write("mg sock_key v\r\n")
sock.readline
sock_value = sock.read(payload.bytesize)
sock.read(TERMINATOR.bytesize)
raise 'sock mismatch' if payload != sock_value

# ensure we have basic data for the benchmarks and get calls
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get why we are doing this... And why is the payload so much smaller/not configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these are the defaults for the get_multi... I can make the payload adjustable but we don't typically see get multi with 1mb values so I picked something that was more in the normal range... I will make it configurable and just have it be 1/10th of the full get / set size by default.

payload_smaller = 'B' * (bench_payload_size / 10)
pairs = {}
100.times do |i|
pairs["multi_#{i}"] = payload_smaller
end
client.quiet do
pairs.each do |key, value|
client.set(key, value, 3600, raw: true)
end
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a corresponding get call to make sure the keys were set successfully?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the benchmark verifies during the get bench that they are their or it will raise an exception. So I don't think we need to check before the script, it would fail faster this way but also duplicate more code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so handled with raise 'mismatch' unless result == payload


###
# GC Suite
# benchmark without GC skewing things
###
class GCSuite
def warming(*)
run_gc
end

def running(*)
run_gc
end

def warmup_stats(*); end

def add_report(*); end

private

def run_gc
GC.enable
GC.start
GC.disable
end
end
suite = GCSuite.new

# rubocop:disable Metrics/MethodLength
# rubocop:disable Metrics/PerceivedComplexity
# rubocop:disable Metrics/AbcSize
# rubocop:disable Metrics/CyclomaticComplexity
def sock_get_multi(sock, pairs)
count = pairs.length
pairs.each_key do |key|
count -= 1
tail = count.zero? ? '' : 'q'
sock.write("mg #{key} v f k #{tail}\r\n")
end
sock.flush
# read all the memcached responses back and build a hash of key value pairs
results = {}
last_result = false
while (line = sock.readline.chomp!(TERMINATOR)) != ''
last_result = true if line.start_with?('EN ')
next unless line.start_with?('VA ') || last_result

_, value_length, _flags, key = line.split
results[key[1..]] = sock.read(value_length.to_i)
sock.read(TERMINATOR.length)
break if results.size == pairs.size
break if last_result
end
results
end
# rubocop:enable Metrics/MethodLength
# rubocop:enable Metrics/PerceivedComplexity
# rubocop:enable Metrics/AbcSize
# rubocop:enable Metrics/CyclomaticComplexity

if %w[all set].include?(bench_target)
Benchmark.ips do |x|
x.config(warmup: bench_warmup, time: bench_time, suite: suite)
x.report('client set') { client.set('key', payload) }
# x.report('multi client set') { multi_client.set('string_key', payload) }
x.report('raw sock set') do
sock.write("ms sock_key #{payload.bytesize} T3600 MS\r\n")
sock.write(payload)
sock.write("\r\n")
sock.flush
sock.readline # clear the buffer
end
x.compare!
end
end

@lock = Monitor.new
if %w[all get].include?(bench_target)
Benchmark.ips do |x|
x.config(warmup: bench_warmup, time: bench_time, suite: suite)
x.report('get dalli') do
result = client.get('key')
raise 'mismatch' unless result == payload
end
# NOTE: while this is the fastest it is not thread safe and is blocking vs IO sharing friendly
x.report('get sock') do
sock.write("mg sock_key v\r\n")
sock.readline
result = sock.read(payload.bytesize)
sock.read(TERMINATOR.bytesize)
raise 'mismatch' unless result == payload
end
# NOTE: This shows that when adding thread safety & non-blocking IO we are slower for single process/thread use case
x.report('get sock non-blocking') do
@lock.synchronize do
sock.write("mg sock_key v\r\n")
sock.readline
count = payload.bytesize
value = String.new(capacity: count + 1)
loop do
begin
value << sock.read_nonblock(count - value.bytesize)
rescue Errno::EAGAIN
sock.wait_readable
retry
rescue EOFError
puts 'EOFError'
break
end
break if value.bytesize == count
end
sock.read(TERMINATOR.bytesize)
raise 'mismatch' unless value == payload
end
end
x.compare!
end
end

if %w[all get_multi].include?(bench_target)
Benchmark.ips do |x|
x.config(warmup: bench_warmup, time: bench_time, suite: suite)
x.report('get 100 keys') do
result = client.get_multi(pairs.keys)
raise 'mismatch' unless result == pairs
end
x.report('get 100 keys raw sock') do
result = sock_get_multi(sock, pairs)
raise 'mismatch' unless result == pairs
end
x.compare!
end
end

if %w[all set_multi].include?(bench_target)
Benchmark.ips do |x|
x.config(warmup: bench_warmup, time: bench_time, suite: suite)
x.report('write 100 keys simple') do
client.quiet do
pairs.each do |key, value|
client.set(key, value, 3600, raw: true)
end
end
end
# TODO: uncomment this once we add PR adding set_multi
# x.report('multi client set_multi 100') do
# multi_client.set_multi(pairs, 3600, raw: true)
# end
x.report('write 100 keys rawsock') do
count = pairs.length
tail = ''
value_bytesize = payload_smaller.bytesize
ttl = 3600

pairs.each do |key, value|
count -= 1
tail = count.zero? ? '' : 'q'
sock.write(String.new("ms #{key} #{value_bytesize} c F0 T#{ttl} MS #{tail}\r\n",
capacity: key.size + value_bytesize + 40) << value << TERMINATOR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's 40?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just needed to cover all the characters like ms, ' ', c, F0, etc.... I just picked a number as I was modifying this command a few times, a few extra unused bytes in the buffer didn't matter.

end
sock.flush
sock.gets(TERMINATOR) # clear the buffer
end
# x.report('write_mutli 100 keys') { client.set_multi(pairs, 3600, raw: true) }
x.compare!
end
end
Loading
Loading