Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNIP-50: GeoNode monitoring #3137

Closed
cezio opened this issue Jun 22, 2017 · 6 comments
Closed

GNIP-50: GeoNode monitoring #3137

cezio opened this issue Jun 22, 2017 · 6 comments
Labels
gnip A GeoNodeImprovementProcess Issue

Comments

@cezio
Copy link
Contributor

cezio commented Jun 22, 2017

GNIP: geonode monitoring

Overview

GeoNode monitoring is an infrastructure to extract and present information on installation's health status and resource (layers, maps, documents) usage. Monitoring is an additional Django/GeoNode application which will:

  • collect data (events) from software components
  • calculate usage statistics (metrics)
  • display it in human-friendly form
  • provide a way to set fault thresholds and send alerts in case of reaching those
    GeoNode monitoring functionality is not limited to plain GeoNode, but it will also collect data from accompanying GeoServer instances, and from operating system on hardware resources usage.

Proposed by

Cezary Statkiewicz GeoSolutions

Assigned to release

None yet.

Motivation

GeoNode lacks information on resources usage and system health, which can be problematic in most cases, where operator(s) want to know some insights of running system. This was formulated as a significant problem by GFDRR’s Innovation Lab, which, through the Open Data for Resilience Initiative has assisted in the creation of National and Regional Geospatial data sharing platforms since 2010. Many of these platforms were deployed outside formal data centers and have administrators with other responsibilities unrelated to GeoNode. Some real problems raised:

  • Some services become unavailable with time, requiring follow up with the hosting institutions.
  • Given that GeoNode allows any registered user to upload datasets, some GeoNodes may contain layers that are partially configured and not accessible.
  • In order to know what resources the public finds most valuable, we need to understand whether or not a GeoNode in a particular location is being actively used: new layers are being added or modified, new users are registering on the site, recurring users are viewing or downloading data, the origin of the users visiting the site, etc.
  • When errors occur on a GeoNode they are logged in different files but the administrator may not have enough experience to diagnose and fix issues. It is desirable to send exceptions and errors back to a central registry where they can be categorized and studied by consultants helping the administrator.
  • Since most GeoNodes start only with a handful of users we need tools that track performance metrics (hard drive usage, memory usage, CPU usage, bandwidth usage, statistics on the time of HTTP responses, etc) to help identify when hardware upgrades are needed.

Technically, data needed to deduce such information could be extracted in various ways (client-side analytics, log parsing, external monitoring), but each way has it's drawbacks, and none would show full picture. Also, existing or previous attempts (GeoHealthCheck, geonode-monitor are quite incomplete and focus mostly on measuring external visibility/state only.

This proposal introduces contrib monitoring application, which would provide insights into actual usage of data and do health check of underlying system. Application should be optional, although there are few integration points in GeoNode core.

Note, this is not a replacement for full-fledge monitoring systems like Zabbix or Nagios. GeoNode monitoring is a simplified, especially from user's perspective. However, while GeoNode monitoring can work in stand-alone mode, it could be also integrated with 3rd party systems as well, as a data source (not covered by this GNIP).

Proposal

GN monitoring has two main tasks:

  • collecting data from probes,
  • calculate usage stats and present them.

Collecting data starts with recording request with context: besides basic http context, it should also contain information about used resources, service which was used, more detailed information on client etc. Similar data structure is already available in GeoServer.

Data collection may be implemented in several ways. By default, data will be pulled from probes, although there should be a way for probes to push data to collector. Also, monitoring should be ready to handle data exchange through AMQP, which will be future default way of notifications handling. Data collection can be performed periodically or persistently in real time.

Statistics calculation is performed periodically, into fixed length periods with aggregated data. Aggregated data would contain general statistics and per-resource statistics, so presentation layer can present system status from overview to layer-level without much of recalculation.

Architecture overview

geonode monitoring architecture

Monitoring is composed of several components, described below. Note, that those are logical units. Code should reside in geonode.contrib.monitoring module as a Django application.

GeoNode probes

Probes are points of integration in GeoNode core, which will record it's activity. This is build with:

  • middleware - requests are marked with start/end time, and after view is processed, request is recorded to database with context information,
  • views - core views for layers, maps and documents should mark request with resources affected by this request.

GeoServer probes

GeoServer provides Monitoring/Audit API, which can be used. GeoServer improvements will be handled outside this GNIP.

System-level probes

GeoNode monitoring can collect system-level data (cpu usage, memory usage, disks usage). System-level data can be extracted by reading system indicators from GeoNode and GeoServer processes and expose with Status API in GeoServer. GeoNode would have Expose API, which is a set of views which will present system-level data at the moment of request.

Collector

This is the core element of monitoring, because it connects both main functionalities. Collector provides following facilities:

  • receive or acquire raw data from probes
    This can be any of following:
    • actively query (over HTTP or other transport) probes for data (pull),
    • expose view reachable from probes, which will report their data to it (push),
    • hybrid, utilizing incoming AMQP infrastructure, probes will publish events to queue, collector will act as a consumer and collect it from broker,
  • normalize, calculate and store metrics,
  • expose metrics for status UI and notifications.

Collector can be run as:

  • periodical command, pulling data from probes (implemented as collect_metrics),
  • long-running process (as a AMQP consumer),
  • as a view, passively receive data from probes.

Dashboard/Status UI

(Note, those are designs, not actual implementation)

main view:

monitoring

list of captured exceptions

image

exception details

image

notifications configuration

image

response statistics

image

resources statistics

image

Status UI is a set of views and client-side application that will present metrics. User should get main indicators in simplified form (to judge if system is working properly), and have a way to see more detailed data few clicks away. Status UI should also provide a way to configure notifications and collector.

Notifications

Monitoring Notifications shouldn't be confused with GeoNode notifications app, which is a separate entity. However, Monitoring Notifications will use general notifications as a backend for sending alerts. User should be able to configure thresholds for certain indicators, which can consist of several metrics. Notifications will check metrics for each indicator after each metrics calculation, and send alerts in alarm conditions.

Beacon

Beacon is an API that exposes current status of GeoNode for external monitoring.

Data model

monitoring data model

Collected data

There are different types of probes and data they provide. Basically, two base types are distinguished, service type and host type. Service type provides stream of events from service (GeoNode, GeoSever). Stream can contain data from past or be provided in real-time. Host type probe provides only data for current moment.

Collector will get following data from probes:

  • request with context (client location, affected resources, timing, errors),
  • exception information for errors occured during request procesing,
  • system-level data.

Data will be aggregated and stored in fixed-lenght periods. For near-present data, periods should be 1-5 minutes, for older data periods could be longer.

Metrics

Metric is an aggregated value for specific indicator. There are three types of metrics:

  • value (where we store value and count occurences within a period of time, for example: request method)
  • rate (where we store average rate within a period of time, for example: net interface tx/rx transfer rates)
  • count (where we count occurences within specific period of time, for example: errors count, net interface tx/rx bytes for given period)

While metric types seems similar, they are handled differently when are aggregated in API.

A metric has several main properties:

  • valid period (valid_from, valid_to),
  • service, for which it is calculated,
  • a name, like request.ip, or request.count (which is defined in MetricName model),
  • numeric value.
    Additionally, metric can be associated to:
  • specific resource (layer, map, document),
  • OWS service type (names stored in OWSService model),
  • free-text label.

Following metric organization allows to have different levels of granularity (per-service, per-metric, per-resource etc) and further aggregation (increased intervals, aggregating total request count from sum of requests to specific resources etc).

Errors

Errors captured by GN or GS are stored along with request details, and are exposed with dedicated API endpoint. Error information contains:

  • error class
  • error message
  • stack trace
  • request context

Monitoring API

Detailed API description: https://github.com/geosolutions-it/geonode/wiki/Monitoring:-API

@afabiani afabiani added enhancement gnip A GeoNodeImprovementProcess Issue labels Jun 26, 2017
@afabiani
Copy link
Member

+1

@capooti
Copy link
Member

capooti commented Jun 27, 2017

Great. Have also a look at Hypermap: https://github.com/cga-harvard/HHypermap

We use it to track health check of thousands of services and layers, including our GeoNode instance (WorldMap). Here is our live instance: http://hh.worldmap.harvard.edu/

For example here is the situation for WorldMap: http://hh.worldmap.harvard.edu/registry/hypermap/service/2a96b71c-96b2-4432-b31f-219c45f3fc52/

@cezio
Copy link
Contributor Author

cezio commented Jun 29, 2017

@capooti thanks. looks interesting, but correct me if i'm wrong here: this is just external visibility check, right?

@capooti
Copy link
Member

capooti commented Jun 29, 2017

We test services and layers using OWSLib and ArcREST.
Test for a service (and time response) is done getting the capability document.
Test for a layer (and time response) is done with a GetMap (or similar for Arc REST Services)

@safezpa
Copy link

safezpa commented Jul 2, 2017

I think may be ELK will be another nice solution for monitoring.

_2017-07-02t13-11-17 154z

@cezio
Copy link
Contributor Author

cezio commented Oct 31, 2017

code merged in, closing

@cezio cezio closed this as completed Oct 31, 2017
@afabiani afabiani changed the title GNIP: GeoNode monitoring GNIP-50: GeoNode monitoring Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gnip A GeoNodeImprovementProcess Issue
Projects
None yet
Development

No branches or pull requests

4 participants