Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ongoing Benchmarking of system module #20877

Closed
fearful-symmetry opened this issue Aug 31, 2020 · 8 comments
Closed

Ongoing Benchmarking of system module #20877

fearful-symmetry opened this issue Aug 31, 2020 · 8 comments
Assignees
Labels
Team:Integrations Label for the Integrations team

Comments

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Aug 31, 2020

Over the weekend I've started benchmarking the disk usage of the system module in metricbeat, based on conversations with @andresrc and @mukeshelastic. We're hoping this data can better inform our decisions about where to take the system Integration in the next few releases. Keep in mind this is ongoing, and I'm running more tests as I'm writing this.

Parameters

Each test was run for 710 minutes on the same server with 64GB RAM, a 6TB main disk and two 6-core Xeons. In addition to Metricbeat, the server was running ES and Kibana, since we're not as interested in benchmarking storage on Idle servers. We might want to run some additional tests with different runtimes in the future.

Benchmark 1: All Defaults

This test is just metricbeat OOTB. Nothing changed accept for connectivity settings.

After 710 minutes, metricbeat had used 18.5mb over 75096 docs:

index                              shard prirep state       docs  store ip            node
metricbeat-7.9.0-2020.08.30-000001 0     p      STARTED    75096 18.5mb 192.168.1.135 shoebill
metricbeat-7.9.0-2020.08.30-000001 0     r      UNASSIGNED

Here's a breakdown of documents by metricset, as obtained by a terms aggregation:

Process          29589
Network          21305
CPU              4262
Load             4262
Memory           4262
Process_summary  4262
Socket_summary   4262
Filesystem       2133
fsstat           711
uptime           48

Here's a nice pie chart:

defaults_precise_pie

Process and Network alone take up more than half of the space.

Benchmark 2: Disabling metadata processors.

This test was the same as above, but I disabled the add_*_processor processors, to see how much that's tacking on.

index                              shard prirep state       docs  store ip            node
metricbeat-7.9.0-2020.08.29-000001 0     p      STARTED    76012 16.1mb 192.168.1.135 shoebill
metricbeat-7.9.0-2020.08.29-000001 0     r      UNASSIGNED

In this case, it looks like the metadata processors added about 2.4MB after 710 minutes.

Benchmark 3: Separate indices per-metricset

For this test, I disabled ILM and used the name.metricset field to create dynamic indices per-metricset. I wanted this to be a specific test, since I wasn't sure how this would affect any on-disk compression or storage.

index                                       shard prirep state       docs   store ip            node
metricbeat-load-7.9.0-2020.08.29            0     p      STARTED     4285   690kb 192.168.1.135 shoebill
metricbeat-load-7.9.0-2020.08.29            0     r      UNASSIGNED                             
metricbeat-memory-7.9.0-2020.08.29          0     p      STARTED     4285   1.1mb 192.168.1.135 shoebill
metricbeat-memory-7.9.0-2020.08.29          0     r      UNASSIGNED                             
metricbeat-uptime-7.9.0-2020.08.29          0     p      STARTED       48  50.1kb 192.168.1.135 shoebill
metricbeat-uptime-7.9.0-2020.08.29          0     r      UNASSIGNED                             
metricbeat-socket_summary-7.9.0-2020.08.29  0     p      STARTED     4285   710kb 192.168.1.135 shoebill
metricbeat-socket_summary-7.9.0-2020.08.29  0     r      UNASSIGNED                             
metricbeat-fsstat-7.9.0-2020.08.29          0     p      STARTED      715 164.7kb 192.168.1.135 shoebill
metricbeat-fsstat-7.9.0-2020.08.29          0     r      UNASSIGNED                             
metricbeat-filesystem-7.9.0-2020.08.29      0     p      STARTED     2145 419.4kb 192.168.1.135 shoebill
metricbeat-filesystem-7.9.0-2020.08.29      0     r      UNASSIGNED                             
metricbeat-network-7.9.0-2020.08.29         0     p      STARTED    21425   2.8mb 192.168.1.135 shoebill
metricbeat-network-7.9.0-2020.08.29         0     r      UNASSIGNED                             
metricbeat-process-7.9.0-2020.08.29         0     p      STARTED    25881   7.1mb 192.168.1.135 shoebill
metricbeat-process-7.9.0-2020.08.29         0     r      UNASSIGNED                             
metricbeat-process_summary-7.9.0-2020.08.29 0     p      STARTED     4285 665.7kb 192.168.1.135 shoebill
metricbeat-process_summary-7.9.0-2020.08.29 0     r      UNASSIGNED                             
metricbeat-cpu-7.9.0-2020.08.29             0     p      STARTED     4285 979.3kb 192.168.1.135 shoebill
metricbeat-cpu-7.9.0-2020.08.29             0     r      UNASSIGNED     

Or, presented in a more useful format, with size in KB:

Screen Shot 2020-08-31 at 12 57 42 PM

Some of this data is pretty interesting, and suggests that there's some kind of document compression going on behind the scenes that I don't quite understand, as it looks like larger indices are compressed more efficiently. Also, the total for all indices comes out to 14.7MB, which is a few MB larger than the benchmark in test 1. Regardless, it demonstrates that that process and network are the biggest offenders by a wide margin. Keep in mind that process reports per-event, and network per-interface.

benchmark 4: Seperate indicies per-metricset, period the same across all metricsets

This is the same as benchmark 3, but all metrics are now on a 10 second period. In the default settings, uptime, fsstat, and filesystem are on 15 minute and 1 minute periods.

health status index                                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   metricbeat-socket_summary-7.9.0-2020.08.31  p33JGkD_Q5aDPcjVYCwKSQ   1   1       4261            0    657.7kb        657.7kb
yellow open   metricbeat-uptime-7.9.0-2020.08.31          hT-yPctNTyKZFgIrWN9-5Q   1   1       4261            0    496.6kb        496.6kb
yellow open   metricbeat-process_summary-7.9.0-2020.08.31 XloSgYL2TXiE_TnwDQi60A   1   1       4261            0      589kb          589kb
yellow open   metricbeat-network-7.9.0-2020.08.31         mRI14pDLSeuqND1sMwryPg   1   1      21305            0      2.9mb          2.9mb
yellow open   metricbeat-load-7.9.0-2020.08.31            wdg7g98FTEu3vV3CGzlynQ   1   1       4261            0      619kb          619kb
yellow open   metricbeat-memory-7.9.0-2020.08.31          NTN8klEfRHa6tg50XF-tcw   1   1       4261            0        1mb            1mb
yellow open   metricbeat-cpu-7.9.0-2020.08.31             iCiaLU8_Smi7PVRp8ROkHg   1   1       4261            0    917.5kb        917.5kb
yellow open   metricbeat-filesystem-7.9.0-2020.08.31      9Gl3QzHSScWVfOgEsaVzkQ   1   1      12783            0      1.9mb          1.9mb
yellow open   metricbeat-fsstat-7.9.0-2020.08.31          1vl70H_ER6KiMYS5dPMCqg   1   1       4261            0    624.7kb        624.7kb
yellow open   metricbeat-process-7.9.0-2020.08.31         SF_wmMSOQ9SpeXT7uFucnA   1   1      30617            0      8.5mb          8.5mb

Here's another more useful chart:

Screen Shot 2020-08-31 at 9 06 30 PM

It's a bit easier to see that process's outsize disk usage is down to the number of events it's sending: five per period by default. The high usage of memory is due to the extra metrics added on Linux.

@fearful-symmetry fearful-symmetry added the Team:Integrations Label for the Integrations team label Aug 31, 2020
@fearful-symmetry fearful-symmetry self-assigned this Aug 31, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@andresrc
Copy link
Contributor

andresrc commented Sep 1, 2020

Thanks a lot!

@mukeshelastic
Copy link

@fearful-symmetry thanks for running these benchmark tests, they are very helpful.

Looks like process has the large share of disk footprint and reducing them from per event to every 10s didn't result into much savings (29589 vs 30617 for process metricset). I assume per event data footprint will depend on the process activity on the machine. Is that correct? If yes, it may be worth to extend ths test by running an application on the box and load testing it to see if it makes a difference in terms of # of process docs?

Also looks like storage footprint for process is higher (( 30617 -> 8.5mb) than any other metricset. Are you able to share a raw process document? I am curious if there are ways to optimize the document size for process metricset?

FYI @sorantis in case you haven't seen this already.

@fearful-symmetry
Copy link
Contributor Author

@mukeshelastic

Looks like process has the large share of disk footprint and reducing them from per event to every 10s didn't result into much savings

The process metricset is set to 10s by default, the #4 benchmark was for fsstat, filesystem, and uptime.

I assume per event data footprint will depend on the process activity on the machine. Is that correct?

Screen Shot 2020-09-01 at 7 26 03 AM

There's some variation in process documents, but it's not too wild.

If yes, it may be worth to extend ths test by running an application on the box and load testing it to see if it makes a difference in terms of # of process docs?

Metricbeat here is running on the same server as ES, Kibana and a Jupyter Lab instance I keep running, precisely so I could collect data under a somewhat more realistic environment.

Are you able to share a raw process document?

Here's one of the chunkier documents. Depending on the process in question, we might get more fields or larger fields. I suspect the disk usage has more to do with the fact that it's sending a particularly large amount of documents (5-7) per period. For example, memory on Linux is pretty sizable, but it's only sending one document every 10 seconds.

{
  "_index": "metricbeat-process-7.9.0-2020.08.31",
  "_type": "_doc",
  "_id": "jRUBRnQBRMp_e_S9cetP",
  "_version": 1,
  "_score": null,
  "_source": {
    "@timestamp": "2020-08-31T19:32:12.893Z",
    "agent": {
      "ephemeral_id": "627cced5-6ac7-4597-937f-8835e3029635",
      "id": "faf8c866-5cdb-42b1-b560-b09e52a8ef05",
      "name": "shoebill.nest",
      "type": "metricbeat",
      "version": "7.9.0",
      "hostname": "shoebill.nest"
    },
    "user": {
      "name": "elasticsearch"
    },
    "metricset": {
      "name": "process",
      "period": 10000
    },
    "event": {
      "module": "system",
      "duration": 120181941,
      "dataset": "system.process"
    },
    "service": {
      "type": "system"
    },
    "system": {
      "process": {
        "cmdline": "/usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-13961428356282655902 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=536870912 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet",
        "cpu": {
          "total": {
            "norm": {
              "pct": 0.0015
            },
            "value": 9905200,
            "pct": 0.035
          },
          "start_time": "2020-08-28T16:43:27.000Z"
        },
        "memory": {
          "size": 16139251712,
          "rss": {
            "bytes": 2481897472,
            "pct": 0.0368
          },
          "share": 36675584
        },
        "fd": {
          "limit": {
            "soft": 65535,
            "hard": 65535
          },
          "open": 481
        },
        "state": "sleeping"
      }
    },
    "ecs": {
      "version": "1.5.0"
    },
    "host": {
      "ip": [
        "192.168.1.135",
        "fe80::6c0b:c4ca:3751:a64c"
      ],
      "mac": [
        "84:2b:2b:41:26:44",
        "84:2b:2b:41:26:46",
        "84:2b:2b:41:26:48",
        "84:2b:2b:41:26:4a"
      ],
      "hostname": "shoebill.nest",
      "architecture": "x86_64",
      "os": {
        "version": "32 (Server Edition)",
        "family": "redhat",
        "name": "Fedora",
        "kernel": "5.7.17-200.fc32.x86_64",
        "platform": "fedora"
      },
      "id": "c786975bee764ea69d2541ff7788b762",
      "containerized": false,
      "name": "shoebill.nest"
    },
    "process": {
      "ppid": 1,
      "pgid": 3468,
      "working_directory": "/usr/share/elasticsearch",
      "executable": "/usr/share/elasticsearch/jdk/bin/java",
      "args": [
        "/usr/share/elasticsearch/jdk/bin/java",
        "-Xshare:auto",
        "-Des.networkaddress.cache.ttl=60",
        "-Des.networkaddress.cache.negative.ttl=10",
        "-XX:+AlwaysPreTouch",
        "-Xss1m",
        "-Djava.awt.headless=true",
        "-Dfile.encoding=UTF-8",
        "-Djna.nosys=true",
        "-XX:-OmitStackTraceInFastThrow",
        "-XX:+ShowCodeDetailsInExceptionMessages",
        "-Dio.netty.noUnsafe=true",
        "-Dio.netty.noKeySetOptimization=true",
        "-Dio.netty.recycler.maxCapacityPerThread=0",
        "-Dio.netty.allocator.numDirectArenas=0",
        "-Dlog4j.shutdownHookEnabled=false",
        "-Dlog4j2.disable.jmx=true",
        "-Djava.locale.providers=SPI,COMPAT",
        "-Xms1g",
        "-Xmx1g",
        "-XX:+UseG1GC",
        "-XX:G1ReservePercent=25",
        "-XX:InitiatingHeapOccupancyPercent=30",
        "-Djava.io.tmpdir=/tmp/elasticsearch-13961428356282655902",
        "-XX:+HeapDumpOnOutOfMemoryError",
        "-XX:HeapDumpPath=/var/lib/elasticsearch",
        "-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log",
        "-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m",
        "-XX:MaxDirectMemorySize=536870912",
        "-Des.path.home=/usr/share/elasticsearch",
        "-Des.path.conf=/etc/elasticsearch",
        "-Des.distribution.flavor=default",
        "-Des.distribution.type=rpm",
        "-Des.bundled_jdk=true",
        "-cp",
        "/usr/share/elasticsearch/lib/*",
        "org.elasticsearch.bootstrap.Elasticsearch",
        "-p",
        "/var/run/elasticsearch/elasticsearch.pid",
        "--quiet"
      ],
      "name": "java",
      "pid": 3468
    }
  },
  "fields": {
    "system.process.cpu.start_time": [
      "2020-08-28T16:43:27.000Z"
    ],
    "@timestamp": [
      "2020-08-31T19:32:12.893Z"
    ]
  },
  "sort": [
    1598902332893
  ]
}

@sorantis
Copy link
Contributor

sorantis commented Sep 1, 2020

From the Metrics UI perspective we don't need everything from the process metricset. Currently we're looking at the following details to expose in the UI.

Generally the process information would need to be collected at a higher rate, because it's so dynamic. We should control the amount of information produced for each process as well as retention period for process metricset, which I presume should be shorter than for other information.

@fearful-symmetry
Copy link
Contributor Author

@sorantis I think you're on the right track. Speaking as someone who used the stack extensively at my last job, the top-like graphs in the system dashboards are extremely useful, and we should work on migrating that functionality to the Metrics UI. Stripping down the data we're collecting and increasing the frequency seems like a solid bet.

@andresrc
Copy link
Contributor

andresrc commented Sep 1, 2020

Thanks again for the data @fearful-symmetry . With this, we could say that an upper bound with the current defaults is around 40 MB/day (single replica) per host.

@fearful-symmetry
Copy link
Contributor Author

Closing this, since we have all the data we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

5 participants