Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLOUDSTACK-9998: Prometheus Exporter for CloudStack #2287

Merged
merged 1 commit into from
Oct 11, 2017

Conversation

rohityadavcloud
Copy link
Member

This implements a CloudStack Prometheus exporter as a plugin, that serves
metrics on a HTTP port.

FS: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Metrics+Exporter+for+Prometheus

Prometheus: https://prometheus.io/

New global settings:

  1. prometheus.exporter.enable - (default: false), Enable the prometheus
    exporter plugin, management server restart needed.
  2. prometheus.exporter.port - (default: 9595), The prometheus exporter
    server port.
  3. prometheus.exporter.allowed.ips - (default: 127.0.0.1), List of comma
    separated prometheus server ips (with no spaces) that should be allowed to
    access the URLs.

The following list of metrics are provided per pop (zone) with the exporter:
• Per host:
o CPU cores: used, total
o CPU usage: used, total (in MHz)
o Memory usage: used, total (in MiBs)
o Total VMs running on the host
• CPU cores: allocated (per zone)
• CPU usage: allocated (per zone, in MHz)
• Memory usage: allocated (per zone, in MiBs)
• Hosts: online, offline, total
• VMs: in all states -- starting, running, stopping, stopped, destroyed,
expunging, migrating, error, unknown
• Volumes: ready, destroyed, total
• Primary Storage Pool: (Disk size) used, allocated, unallocated, total (in GiBs)
• Secondary Storage Pool: (Disk size) used, allocated, unallocated, total (in GiBs)
• Private IPs: allocated, total
• Public IPs: allocated, total
• Shared Network IPs: allocated, total
• VLANs: allocated, total

Additional metrics for the environment:
• Summed domain (level=1) limit for CPU cores
• Summed domain (level=1) limit for memory/ram

Pinging for review - @ustcweizhou @DaanHoogland @borisstoyanov @nvazquez @PaulAngus and others

This implements a CloudStack Prometheus exporter as a plugin, that serves
metrics on a HTTP port.

New global settings:

1. prometheus.exporter.enable - (default: false), Enable the prometheus
exporter plugin, management server restart needed.
2. prometheus.exporter.port - (default: 9595), The prometheus exporter
server port.
3. prometheus.exporter.allowed.ips - (default: 127.0.0.1), List of comma
separated prometheus server ips (with no spaces) that should be allowed to
access the URLs.

The following list  of  metrics are provided  per pop (zone)  with  the exporter:
• Per host:
o CPU cores:  used, total
o CPU usage:  used, total (in MHz)
o Memory  usage:  used, total (in MiBs)
o Total VMs running on  the host
• CPU cores:  allocated (per  zone)
• CPU usage:  allocated (per  zone, in  MHz)
• Memory  usage:  allocated (per  zone, in  MiBs)
• Hosts:  online, offline,  total
• VMs: in all states -- starting, running, stopping, stopped, destroyed,
       expunging, migrating,  error, unknown
• Volumes:  ready,  destroyed,  total
• Primary Storage Pool: (Disk size) used, allocated,  unallocated,  total (in GiBs)
• Secondary Storage Pool: (Disk size) used, allocated,  unallocated,  total (in GiBs)
• Private IPs:  allocated,  total
• Public  IPs:  allocated,  total
• Shared  Network IPs:  allocated,  total
• VLANs:  allocated,  total

Additional metrics for the environment:
• Summed  domain  (level=1) limit for CPU cores
• Summed  domain  (level=1) limit for memory/ram

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@rohityadavcloud
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-1137

@rohityadavcloud
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1564)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 40694 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr2287-t1564-kvm-centos7.zip
Intermitten failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermitten failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Test completed. 60 look OK, 2 have error(s)

Test Result Time (s) Test File
test_01_vpc_remote_access_vpn Failure 55.89 test_vpc_vpn.py
test_04_rvpc_privategw_static_routes Failure 406.37 test_privategw_acl.py
test_change_service_offering_for_vm_with_snapshots Skipped 0.00 test_vm_snapshots.py
test_09_copy_delete_template Skipped 0.02 test_templates.py
test_06_copy_template Skipped 0.00 test_templates.py
test_static_role_account_acls Skipped 0.02 test_staticroles.py
test_11_ss_nfs_version_on_ssvm Skipped 0.02 test_ssvm.py
test_01_scale_vm Skipped 0.00 test_scale_vm.py
test_01_primary_storage_iscsi Skipped 0.09 test_primary_storage.py
test_vm_nic_adapter_vmxnet3 Skipped 0.00 test_nic_adapter_type.py
test_nested_virtualization_vmware Skipped 0.00 test_nested_virtualization.py
test_06_copy_iso Skipped 0.00 test_iso.py
test_list_ha_for_host_valid Skipped 0.02 test_hostha_simulator.py
test_list_ha_for_host_invalid Skipped 0.02 test_hostha_simulator.py
test_list_ha_for_host Skipped 0.02 test_hostha_simulator.py
test_hostha_enable_feature_without_setting_provider Skipped 0.02 test_hostha_simulator.py
test_hostha_enable_feature_valid Skipped 0.02 test_hostha_simulator.py
test_hostha_disable_feature_valid Skipped 0.04 test_hostha_simulator.py
test_hostha_configure_invalid_provider Skipped 0.02 test_hostha_simulator.py
test_hostha_configure_default_driver Skipped 0.02 test_hostha_simulator.py
test_ha_verify_fsm_recovering Skipped 0.02 test_hostha_simulator.py
test_ha_verify_fsm_fenced Skipped 0.02 test_hostha_simulator.py
test_ha_verify_fsm_degraded Skipped 0.02 test_hostha_simulator.py
test_ha_verify_fsm_available Skipped 0.02 test_hostha_simulator.py
test_ha_multiple_mgmt_server_ownership Skipped 0.02 test_hostha_simulator.py
test_ha_list_providers Skipped 0.02 test_hostha_simulator.py
test_ha_enable_feature_invalid Skipped 0.03 test_hostha_simulator.py
test_ha_disable_feature_invalid Skipped 0.02 test_hostha_simulator.py
test_ha_configure_enabledisable_across_clusterzones Skipped 0.02 test_hostha_simulator.py
test_configure_ha_provider_valid Skipped 0.02 test_hostha_simulator.py
test_configure_ha_provider_invalid Skipped 0.02 test_hostha_simulator.py
test_deploy_vgpu_enabled_vm Skipped 0.03 test_deploy_vgpu_enabled_vm.py
test_3d_gpu_support Skipped 0.04 test_deploy_vgpu_enabled_vm.py

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm mostly. one remark

super();
}

private void addHostMetrics(final List<Item> metricsList, final long dcId, final String zoneName, final String zoneUuid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see this method split. It doesn't fit in my mind. Guess it doesn't differ for the functionality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this method to follow the hierarical metrics output at host level in a zone; splitting may make redundant loops.

public abstract String toMetricsString();
}

class ItemVM extends Item {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of doublures in fields i.e. zoneName and zoneUuid or even zoneFilter. maybe a different hierarchy is an idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What approach should we take?

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM tests don't show any new failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants