Skip to content

Commit

Permalink
feat(system_monitor): add some smart information to diagnostics (tier…
Browse files Browse the repository at this point in the history
…4#560)

* feat(system_monitor): add some smart information to diagnostics

Signed-off-by: kk-inoue-esol <kk-inoue@esol.co.jp>

* ci(pre-commit): autofix

* modify regex for nvme device name

Signed-off-by: kk-inoue-esol <kk-inoue@esol.co.jp>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
2 people authored and boyali committed Oct 19, 2022
1 parent 98af2ca commit 526348c
Show file tree
Hide file tree
Showing 9 changed files with 428 additions and 156 deletions.
6 changes: 3 additions & 3 deletions system/system_monitor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,12 +128,12 @@ Thermal throttling event can be monitored by reading contents of MSR(Model Speci

## <u>HDD Monitor</u>

Generally, S.M.A.R.T. information is used to monitor HDD temperature, and normally accessing disk device node is allowed for root user or disk group.<br>
Generally, S.M.A.R.T. information is used to monitor HDD temperature and life of HDD, and normally accessing disk device node is allowed for root user or disk group.<br>
As with the CPU monitor, this package provides an approach to minimize security risks as much as possible:<br>

- Provide a small program named 'hdd_reader' which accesses S.M.A.R.T. information and sends HDD temperature to HDD monitor by using socket programming.
- Provide a small program named 'hdd_reader' which accesses S.M.A.R.T. information and sends some items of it to HDD monitor by using socket programming.
- Run 'hdd_reader' as a specific user.
- HDD monitor is able to know HDD temperature as an unprivileged user since HDD temperature is sent by socket communication.
- HDD monitor is able to know some items of S.M.A.R.T. information as an unprivileged user since those are sent by socket communication.

### Instructions before starting

Expand Down
3 changes: 3 additions & 0 deletions system/system_monitor/config/hdd_monitor.param.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,8 @@
name: /
temp_warn: 55.0
temp_error: 70.0
power_on_hours_warn: 3000000
total_data_written_warn: 4915200 # =150TB (1unit=32MB)
total_data_written_safety_factor: 0.05
free_warn: 5120 # MB(8hour)
free_error: 100 # MB(last 1 minute)
22 changes: 11 additions & 11 deletions system/system_monitor/docs/hdd_reader.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

## Name

hdd_reader - Read S.M.A.R.T. information for monitoring HDD temperature
hdd_reader - Read S.M.A.R.T. information for monitoring HDD temperature and life of HDD

## Synopsis

hdd_reader [OPTION]

## Description

Read S.M.A.R.T. information for monitoring HDD temperature.<br>
Read S.M.A.R.T. information for monitoring HDD temperature and life of HDD.<br>
This runs as a daemon process and listens to a TCP/IP port (7635 by default).

**Options:**<br>
Expand All @@ -24,15 +24,15 @@ Returns 0 if OK; non-zero otherwise.

## Notes

The 'hdd_reader' accesses minimal data enough to get Model number, Serial number, and HDD temperature.<br>
The 'hdd_reader' accesses minimal data enough to get Model number, Serial number, HDD temperature, and life of HDD.<br>
This is an approach to limit its functionality, however, the functionality can be expanded for further improvements and considerations in the future.<br><br>

### [ATA]

| Purpose | Name | Length |
| --------------------------- | -------------------- | -------------------- |
| Model number, Serial number | IDENTIFY DEVICE data | 256 words(512 bytes) |
| HDD temperature | SMART READ DATA | 256 words(512 bytes) |
| Purpose | Name | Length |
| ---------------------------- | -------------------- | -------------------- |
| Model number, Serial number | IDENTIFY DEVICE data | 256 words(512 bytes) |
| HDD temperature, life of HDD | SMART READ DATA | 256 words(512 bytes) |

For details please see the documents below.<br>

Expand All @@ -43,10 +43,10 @@ For details please see the documents below.<br>

### [NVMe]

| Purpose | Name | Length |
| --------------------------- | ---------------------------------- | ---------------- |
| Model number, Serial number | Identify Controller data structure | 4096 bytes |
| HDD temperature | SMART / Health Information | 1 Dword(4 bytes) |
| Purpose | Name | Length |
| ---------------------------- | ---------------------------------- | ------------------- |
| Model number, Serial number | Identify Controller data structure | 4096 bytes |
| HDD temperature, life of HDD | SMART / Health Information | 36 Dword(144 bytes) |

For details please see the documents below.<br>

Expand Down
14 changes: 9 additions & 5 deletions system/system_monitor/docs/ros_parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,15 @@ hdd_monitor:

&nbsp;&nbsp;disks:

| Name | Type | Unit | Default | Notes |
| :--------- | :----: | :--: | :-----: | :-------------------------------------------------------------------------- |
| name | string | n/a | none | The disk name to monitor temperature. (e.g. /dev/sda) |
| temp_error | float | DegC | 55.0 | Generates warning when HDD temperature reaches a specified value or higher. |
| temp_error | float | DegC | 70.0 | Generates error when HDD temperature reaches a specified value or higher. |
| Name | Type | Unit | Default | Notes |
| :------------------------------- | :----: | :---------------: | :-----: | :--------------------------------------------------------------------------------- |
| name | string | n/a | none | The disk name to monitor temperature. (e.g. /dev/sda) |
| temp_warn | float | DegC | 55.0 | Generates warning when HDD temperature reaches a specified value or higher. |
| temp_error | float | DegC | 70.0 | Generates error when HDD temperature reaches a specified value or higher. |
| power_on_hours_warn | int | Hour | 3000000 | Generates warning when HDD power-on hours reaches a specified value or higher. |
| total_data_written_attribute_id | int | n/a | 0xF1 | S.M.A.R.T attribute ID of total data written. |
| total_data_written_warn | int | depends on device | 4915200 | Generates warning when HDD total data written reaches a specified value or higher. |
| total_data_written_safety_factor | int | %(1e-2) | 0.05 | Safety factor of HDD total data written. |

hdd_monitor:

Expand Down
44 changes: 43 additions & 1 deletion system/system_monitor/docs/topics_hdd_monitor.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ROS topics: CPU Monitor
# ROS topics: HDD Monitor

## <u>HDD Temperature</u>

Expand All @@ -22,6 +22,48 @@
| HDD [0-9]: serial | S4EMNF0M820682 |
| HDD [0-9]: temperature | 37.0 DegC |

## <u>HDD PowerOnHours</u>

/diagnostics/hdd_monitor: HDD PowerOnHours

<b>[summary]</b>

| level | message |
| ----- | -------------- |
| OK | OK |
| WARN | lifetime limit |

<b>[values]</b>

| key | value (example) |
| ------------------------- | ----------------------- |
| HDD [0-9]: status | OK / lifetime limit |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: model | PHISON PS5012-E12S-512G |
| HDD [0-9]: serial | FB590709182505050767 |
| HDD [0-9]: power on hours | 4834 Hours |

## <u>HDD TotalDataWritten</u>

/diagnostics/hdd_monitor: HDD TotalDataWritten

<b>[summary]</b>

| level | message |
| ----- | --------------- |
| OK | OK |
| WARN | warranty period |

<b>[values]</b>

| key | value (example) |
| ----------------------------- | ---------------------------- |
| HDD [0-9]: status | OK / warranty period |
| HDD [0-9]: name | /dev/nvme0 |
| HDD [0-9]: model | PHISON PS5012-E12S-512G |
| HDD [0-9]: serial | FB590709182505050767 |
| HDD [0-9]: total data written | 146295330 <br> not available |

## <u>HDD Usage</u>

/diagnostics/hdd_monitor: HDD Usage
Expand Down
36 changes: 36 additions & 0 deletions system/system_monitor/include/hdd_reader/hdd_reader.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,39 @@
#include <boost/serialization/serialization.hpp>
#include <boost/serialization/string.hpp>

#include <bitset>
#include <map>
#include <string>

/**
* @brief ATA attribute IDs
*/
enum class ATAAttributeIDs : uint8_t { TEMPERATURE = 0, POWER_ON_HOURS = 1, SIZE };

/**
* @brief HDD device
*/
struct HDDDevice
{
std::string name_; //!< @brief Device name
uint8_t
total_data_written_attribute_id_; //!< @brief S.M.A.R.T attribute ID of total data written

/**
* @brief Load or save data members.
* @param [inout] ar archive reference to load or save the serialized data members
* @param [in] version version for the archive
* @note NOLINT syntax is needed since this is an interface to serialization and
* used inside boost serialization.
*/
template <typename archive>
void serialize(archive & ar, const unsigned /*version*/) // NOLINT(runtime/references)
{
ar & name_;
ar & total_data_written_attribute_id_;
}
};

/**
* @brief HDD information
*/
Expand All @@ -38,6 +68,9 @@ struct HDDInfo
uint8_t temp_; //!< @brief temperature(DegC)
// Lowest byte of the raw value contains the exact temperature value (Celsius degrees)
// in S.M.A.R.T. information.
uint64_t power_on_hours_; //!< @brief power on hours count
uint64_t total_data_written_; //!< @brief total data written
bool is_valid_total_data_written_; //!< @brief whether total_data_written_ is valid value

/**
* @brief Load or save data members.
Expand All @@ -53,6 +86,9 @@ struct HDDInfo
ar & model_;
ar & serial_;
ar & temp_;
ar & power_on_hours_;
ar & total_data_written_;
ar & is_valid_total_data_written_;
}
};

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#define SYSTEM_MONITOR__HDD_MONITOR__HDD_MONITOR_HPP_

#include <diagnostic_updater/diagnostic_updater.hpp>
#include <hdd_reader/hdd_reader.hpp>

#include <climits>
#include <map>
Expand All @@ -32,13 +33,35 @@
*/
struct HDDParam
{
std::string device_; //!< @brief device
float temp_warn_; //!< @brief HDD temperature(DegC) to generate warning
float temp_error_; //!< @brief HDD temperature(DegC) to generate error
int free_warn_; //!< @brief HDD free space(MB) to generate warning
int free_error_; //!< @brief HDD free space(MB) to generate error
std::string device_; //!< @brief device
float temp_warn_; //!< @brief HDD temperature(DegC) to generate warning
float temp_error_; //!< @brief HDD temperature(DegC) to generate error
int power_on_hours_warn_; //!< @brief HDD power on hours to generate warning
uint64_t total_data_written_warn_; //!< @brief HDD total data written to generate warning
float total_data_written_safety_factor_; //!< @brief safety factor of HDD total data written
int free_warn_; //!< @brief HDD free space(MB) to generate warning
int free_error_; //!< @brief HDD free space(MB) to generate error

HDDParam()
: temp_warn_(55.0),
temp_error_(70.0),
power_on_hours_warn_(3000000),
total_data_written_warn_(4915200),
total_data_written_safety_factor_(0.05),
free_warn_(5120),
free_error_(100)
{
}
};

HDDParam() : temp_warn_(55.0), temp_error_(70.0), free_warn_(5120), free_error_(100) {}
/**
* @brief SMART information items to check
*/
enum class HDDSMARTInfoItem : uint32_t {
TEMPERATURE = 0,
POWER_ON_HOURS = 1,
TOTAL_DATA_WRITTEN = 2,
SIZE
};

class HDDMonitor : public rclcpp::Node
Expand All @@ -50,11 +73,6 @@ class HDDMonitor : public rclcpp::Node
*/
explicit HDDMonitor(const rclcpp::NodeOptions & options);

/**
* @brief Update the diagnostic state.
*/
void update();

protected:
using DiagStatus = diagnostic_msgs::msg::DiagnosticStatus;

Expand All @@ -64,7 +82,36 @@ class HDDMonitor : public rclcpp::Node
* @note NOLINT syntax is needed since diagnostic_updater asks for a non-const reference
* to pass diagnostic message updated in this function to diagnostic publish calls.
*/
void checkTemp(diagnostic_updater::DiagnosticStatusWrapper & stat); // NOLINT(runtime/references)
void checkSMARTTemperature(
diagnostic_updater::DiagnosticStatusWrapper & stat); // NOLINT(runtime/references)

/**
* @brief check HDD power on hours
* @param [out] stat diagnostic message passed directly to diagnostic publish calls
* @note NOLINT syntax is needed since diagnostic_updater asks for a non-const reference
* to pass diagnostic message updated in this function to diagnostic publish calls.
*/
void checkSMARTPowerOnHours(
diagnostic_updater::DiagnosticStatusWrapper & stat); // NOLINT(runtime/references)

/**
* @brief check HDD total data written
* @param [out] stat diagnostic message passed directly to diagnostic publish calls
* @note NOLINT syntax is needed since diagnostic_updater asks for a non-const reference
* to pass diagnostic message updated in this function to diagnostic publish calls.
*/
void checkSMARTTotalDataWritten(
diagnostic_updater::DiagnosticStatusWrapper & stat); // NOLINT(runtime/references)

/**
* @brief check S.M.A.R.T. information
* @param [out] stat diagnostic message passed directly to diagnostic publish calls
* @note NOLINT syntax is needed since diagnostic_updater asks for a non-const reference
* to pass diagnostic message updated in this function to diagnostic publish calls.
*/
void checkSMART(
diagnostic_updater::DiagnosticStatusWrapper & stat,
HDDSMARTInfoItem item); // NOLINT(runtime/references)

/**
* @brief check HDD usage
Expand Down Expand Up @@ -94,19 +141,39 @@ class HDDMonitor : public rclcpp::Node
*/
std::string getDeviceFromMountPoint(const std::string & mount_point);

/**
* @brief timer callback
*/
void onTimer();

/**
* @brief update HDD information list
*/
void updateHDDInfoList();

diagnostic_updater::Updater updater_; //!< @brief Updater class which advertises to /diagnostics
rclcpp::TimerBase::SharedPtr timer_; //!< @brief timer to get HDD information from HDDReader

char hostname_[HOST_NAME_MAX + 1]; //!< @brief host name

int hdd_reader_port_; //!< @brief port number to connect to hdd_reader
std::map<std::string, HDDParam> hdd_params_; //!< @brief list of error and warning levels
std::vector<std::string> hdd_devices_; //!< @brief list of devices
std::vector<HDDDevice> hdd_devices_; //!< @brief list of devices
//!< @brief diagnostic of connection
diagnostic_updater::DiagnosticStatusWrapper connect_diag_;
HDDInfoList hdd_info_list_; //!< @brief list of HDD information

/**
* @brief HDD temperature status messages
* @brief HDD SMART status messages
*/
const std::map<int, const char *> temp_dict_ = {
{DiagStatus::OK, "OK"}, {DiagStatus::WARN, "hot"}, {DiagStatus::ERROR, "critical hot"}};
const std::map<int, const char *> smart_dicts_[static_cast<uint32_t>(HDDSMARTInfoItem::SIZE)] = {
// temperature
{{DiagStatus::OK, "OK"}, {DiagStatus::WARN, "hot"}, {DiagStatus::ERROR, "critical hot"}},
// power on hours
{{DiagStatus::OK, "OK"}, {DiagStatus::WARN, "lifetime limit"}, {DiagStatus::ERROR, "unused"}},
// total data written
{{DiagStatus::OK, "OK"}, {DiagStatus::WARN, "warranty period"}, {DiagStatus::ERROR, "unused"}},
};

/**
* @brief HDD usage status messages
Expand Down
Loading

0 comments on commit 526348c

Please sign in to comment.