Skip to content

Commit

Permalink
readme update
Browse files Browse the repository at this point in the history
  • Loading branch information
rw committed Nov 17, 2016
1 parent 4df21b7 commit 9cdd71c
Showing 1 changed file with 17 additions and 4 deletions.
21 changes: 17 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,16 @@ Current databases supported:
+ InfluxDB
+ Elasticsearch ([announcement blog here](https://influxdata.com/blog/influxdb-markedly-elasticsearch-in-time-series-data-metrics-benchmark/))
+ Cassandra ([InfluxDB Tops Cassandra in Time-Series Data & Metrics Benchmark](https://www.influxdata.com/influxdb-vs-cassandra-benchmark-time-series-metrics/))
+ MongoDB ([InfluxDB is 27x Faster vs MongoDB for Time-Series Workloads](https://www.influxdata.com/influxdb-is-27x-faster-vs-mongodb-for-time-series-workloads/))
+ OpenTSDB

## Testing Methodology

In an attempt to make our performance comparison both realistic and relatable, we decided to build our benchmark suite according to real-world use cases. Micro-benchmarks are useful for database engineers, but using realistic data helps us better understand how our software performs under practical workloads.

Currently, the benchmarking tools focus on the DevOps use case. We create data and queries that mimic what a system administrator would see when operating a fleet of hundreds or thousands of virtual machines. We create and query values like CPU load; RAM usage; number of active, sleeping, or stalled processes; and disk used. Future benchmarks will expand to include the IoT and application monitoring use cases.

We benchmark bulk load performance and synchronous query execution performance. The benchmark suite is written in Go, and attempts to be as fair to each database as possible by removing test-related computational overhead (by pre-generating our datasets and queries).
We benchmark bulk load performance and synchronous query execution performance. The benchmark suite is written in Go, and attempts to be as fair to each database as possible by removing test-related computational overhead (by pre-generating our datasets and queries, and using database-specific drivers where possible).

Although the data is randomly generated, our data and queries are entirely deterministic. By supplying the same PRNG (pseudo-random number generator) seed to the test generation code, each database is loaded with identical data and queried using identical queries.

Expand All @@ -22,7 +25,7 @@ There are five phases when using the benchmark suite: data generation, data load

### Phase 1: Data generation

Each benchmark begins with data generation.
Each benchmark begins with data generation.

The DevOps data generator creates time series points that correspond to server telemetry, similar to what a server fleet would send at regular intervals to a metrics collections service (like Telegraf or collectd). Our DevOps data generator runs a simulation for a pre-specified number of hosts, and emits serialized points to stdout. For each simulated machine, nine different measurements are written in 10-second intervals.

Expand Down Expand Up @@ -80,10 +83,17 @@ Generated data is written in a database-specific format that directly equates to
For InfluxDB, the bulk load protocol is described at:
https://docs.influxdata.com/influxdb/v0.12/guides/writing_data/#writing-multiple-points

For Elastic, the bulk load protocol is described at:
For Elasticsearch, the bulk load protocol is described at:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

Note that both of these formats are human-readable.
For Cassandra, the native protocol version 4 described at:
https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v4.spec

For MongoDB, we use standard BSON with the mgo client:
http://labix.org/mgo

For OpenTSDB, we use the standard HTTP query interface (not the batch input tool) described at:
http://opentsdb.net/docs/build/html/api_http/put.html

### Phase 2: Data loading

Expand All @@ -107,11 +117,14 @@ InfluxDB:
Number of workers to use to make bulk load writes in parallel, and
How many points to include in each write batch.

Loader programs for the other databases take similar parameters.

(For calibration, there is also an option to disable writing to the database; this mode is used to check the speed of data deserialization.)

Note that the bulk loaders will not start writing data if there is already data in the destination database at the beginning of a test. This helps ensure that the database is empty, as if it were newly-installed. It also prevents users from clobbering existing data.

#### Elasticsearch-specific configuration

Both Elasticsearch and InfluxDB are ready out-of-the-tarball for storing time series data. However, after meeting with Elasticsearch experts, we decided to make some reasonable configuration tweaks to Elasticsearch to try to optimize its performance.

First, the configuration for the Elasticsearch daemon was changed to set the ES_HEAP_SIZE environment variable to half of the server machine’s available RAM. For example, on a 32GB machine, ES_HEAP_SIZE is 16g. This is standard practice when administering Elasticsearch.
Expand Down

0 comments on commit 9cdd71c

Please sign in to comment.