Docs › Deployment › System monitoring › Monitoring system health

Monitoring system health

To monitor our system’s health, Arm uses Prometheus, an open-source toolkit for system monitoring and alerts.

Prometheus can collect metrics from any service that exposes its metrics in OpenMetrics, a text-based standard format, which Secure Factory services use.

With the Secure Factory service metrics detailed in this document, you can set up alerts and monitor system behavior using:

Prometheus Alertmanager, a rule-based engine that can trigger notifications about system behaviors.
Grafana to visualize the metrics.

Using Prometheus in the factory

Arm recommends monitoring the system at various levels:

Node monitoring using Prometheus node exporter to validate:
- Node availability, up time, number of available nodes.
- Node health, including memory and CPU consumption, disk space.
- Network traffic at node level.
Docker container monitoring using Google cAdvisor to validate:
- Docker container health in each of the nodes.
- Network traffic.
Secure Factory Service with Arm-provided Docker images.

For each of the running containers, measure specific application metrics, such as:
- Factory manufacturing information.
- Batch key retrieval events and batch key expiration information.
- Internal communication errors between servers.
Database health using MongoDB exporter, which Arm provides with the Secure Factory deployment.
HSM (Hardware Security Module) health with Arm-provided Docker images.

Using the metrics

Secure Factory Service metrics

Secure Factory Service metrics URL: https://:8444/actuator/prometheus

Secure Factory Admin service metrics URL: https://:9444/actuator/prometheus

Available metrics:

http_server_requests_seconds_count

The count of server HTTP input requests.

exception, method, outcome, status, and uri are useful labels for monitoring.

For example, use these labels to retrieve:
- GET requests to /v1/device_response that return a 5xx or 4xx response to monitor the total number of failed provisioning requests.
- GET requests that return a 5xx or 4xx response to monitor the total number of failed workstation registration requests.
http_client_requests_seconds_count

The count of client HTTP output requests.

clientName, instance, method, status and uri are useful labels for grouping similar metrics.

For example, use these queries to monitor requests to Device Management or the HSM service that result in an error:
```
http_client_requests_seconds_count{status="5.."}
```
or
```
http_client_requests_seconds_count{status="4.."}
```
batch_key_minutes_to_expiration

Time, in minutes, left before the batch key expires. A negative value indicates an expired batch.

For example, a 10079 value indicates the batch key expires in seven days.
batch_key_update_count_total

The number of batch key updates after the initial setup.
uploaded_reports_count_total

Total number of manufacturing reports Secure Factory Service uploads to Device Management.
prepared_reports_count_total

Total number of manufacturing reports Secure Factory Service generates. These are the reports the Manufacturing Statistics API returns.
process_uptime_seconds

The uptime of the Java virtual machine.

You can use this metric to monitor how long the Secure Factory Service Docker has been up.
system_cpu_usage

The recent CPU usage for the entire system.
jvm_memory_used_bytes

The amount of used memory.

Use the area label to monitor heap or non-heap memory.

MongoDB metrics

MongoDB service metrics URL: https://:9216/metrics

Available metrics:

mongodb_instance_uptime_seconds

The number of seconds the mongos or mongod process has been active.

For example, the value 3.275797e+06 indicates that MongoDB started 37.9 days ago.
mongodb_mongod_replset_number_of_members

The number of replica set members. This metric can indicate whether one or more MongoDB members disconnected from the replica set.

For example, because the number of connected members is vital, you can set an alert for when the number of available replica set members is less than two.
mongodb_mongod_replset_member_health

Indicates whether the member is up (1) or down (0).

You can set an alert for when a specific replica set member is no longer available.
mongodb_memory

The memory data structure holds information regarding the target system architecture of mongod and current memory usage in megabytes.

Use the type label to create alerts when the virtual or resident memory exceed their limits.
mongodb_extra_info_heap_usage_bytes

The total size of heap space the database process uses.
mongodb_network_bytes_total

The amount of data MongoDB’s network uses.

You can use the state label to monitor in_bytes or out_bytes data.
mongodb_network_metrics_num_requests_total

The total number of distinct requests that the server received.

Use this value to provide context for the in_bytes or out_bytes values and ensure that MongoDB’s network utilization is consistent with expectations and application use.
mongodb_mongod_metrics_document_total

Reflects document access and modification patterns and data use.

Use the state label to monitor updates, inserts, and deleted documents.
mongodb_op_counters_total

Provides an overview of database operations by type and makes it possible to analyze the load on the database in a more granular manner.

These numbers grow over time and in response to database use. Analyze these values over time to track database usage.

Use the type label to monitor delete, insert, query or update operations.
mongodb_connections_metrics_created_total

A count of all incoming connections to the server, including closed connections.

HSM service metrics

HSM service metrics URL: https://:9001/metrics

Available metrics:

hsm_machines

The number of HSM machines connected to the HSM service.

When you work with two HSM machines, it is vital to know whether one of the HSM machines is down and therefore no longer available, in which case, the HSM service no longer replicates keys and certificates.
http_server_requests_seconds_count

The count of server HTTP input requests.

exception, method, handler and status are useful labels for monitoring.

For example, use these labels to retrieve GET requests that return 4xx and 5xx failure responses to monitor the HSM's total number of failed Diffie Hellman key derivations.
performance_monitor_response_time_seconds

A histogram of method completed time.

Use this metric to monitor the number and latency of HSM service operations.

For example:
- Use this query to monitor the total number of HSM service requests to the HSM to get a stored key:
```
performance_monitor_response_time_seconds_count{method="getKey",resource="hsm",status="success"}
```
- Use this query to monitor the total number such requests (HSM service requests to the HSM to get a stored key) that occur within less than one second:
```
performance_monitor_response_time_seconds_bucket{class="HsmServiceService",method="getKey",resource="hsm",status="success",le="1.0"}
```
http_requests_error_500_total

The total number of HTTP requests with 5xx error statuses.

Use this metric to trigger an alert when an HSM service request to access the physical HSM or get input from Secure Factory Service results in an error.
process_start_time_seconds

The uptime of the Java virtual machine. Can indicate when the HSM service started.
jvm_memory_bytes_used

The used bytes of a given JVM memory area.

Use the area label to monitor heap or non-heap memory.

Documentation

Monitoring system health

Using Prometheus in the factory

Using the metrics

Secure Factory Service metrics

MongoDB metrics

HSM service metrics