Monitoring system health
To monitor our system’s health, Arm uses Prometheus, an open-source toolkit for system monitoring and alerts.
Prometheus can collect metrics from any service that exposes its metrics in OpenMetrics, a text-based standard format, which Secure Factory services use.
With the Secure Factory service metrics detailed in this document, you can set up alerts and monitor system behavior using:
- Prometheus Alertmanager, a rule-based engine that can trigger notifications about system behaviors.
- Grafana to visualize the metrics.
Using Prometheus in the factory
Arm recommends monitoring the system at various levels:
-
Node monitoring using Prometheus node exporter to validate:
- Node availability, up time, number of available nodes.
- Node health, including memory and CPU consumption, disk space.
- Network traffic at node level.
-
Docker container monitoring using Google cAdvisor to validate:
- Docker container health in each of the nodes.
- Network traffic.
-
Secure Factory Service with Arm-provided Docker images.
For each of the running containers, measure specific application metrics, such as:
- Factory manufacturing information.
- Batch key retrieval events and batch key expiration information.
- Internal communication errors between servers.
-
Database health using MongoDB exporter, which Arm provides with the Secure Factory deployment.
-
HSM (Hardware Security Module) health with Arm-provided Docker images.
Using the metrics
Secure Factory Service metrics
Secure Factory Service metrics URL: https://
Secure Factory Admin service metrics URL: https://
Available metrics:
-
http_server_requests_seconds_count
The count of server HTTP input requests.
exception
,method
,outcome
,status
, anduri
are useful labels for monitoring.For example, use these labels to retrieve:
GET
requests to/v1/device_response
that return a5xx
or4xx
response to monitor the total number of failed provisioning requests.GET
requests that return a5xx
or4xx
response to monitor the total number of failed workstation registration requests.
-
http_client_requests_seconds_count
The count of client HTTP output requests.
clientName
,instance
,method
,status
anduri
are useful labels for grouping similar metrics.For example, use these queries to monitor requests to Device Management or the HSM service that result in an error:
http_client_requests_seconds_count{status="5.."}
or
http_client_requests_seconds_count{status="4.."}
-
batch_key_minutes_to_expiration
Time, in minutes, left before the batch key expires. A negative value indicates an expired batch.
For example, a
10079
value indicates the batch key expires in seven days. -
batch_key_update_count_total
The number of batch key updates after the initial setup.
-
uploaded_reports_count_total
Total number of manufacturing reports Secure Factory Service uploads to Device Management.
-
prepared_reports_count_total
Total number of manufacturing reports Secure Factory Service generates. These are the reports the Manufacturing Statistics API returns.
-
process_uptime_seconds
The uptime of the Java virtual machine.
You can use this metric to monitor how long the Secure Factory Service Docker has been up.
-
system_cpu_usage
The recent CPU usage for the entire system.
-
jvm_memory_used_bytes
The amount of used memory.
Use the
area
label to monitor heap or non-heap memory.
MongoDB metrics
MongoDB service metrics URL: https://
Available metrics:
-
mongodb_instance_uptime_seconds
The number of seconds the
mongos
ormongod
process has been active.For example, the value
3.275797e+06
indicates that MongoDB started 37.9 days ago. -
mongodb_mongod_replset_number_of_members
The number of replica set members. This metric can indicate whether one or more MongoDB members disconnected from the replica set.
For example, because the number of connected members is vital, you can set an alert for when the number of available replica set members is less than two.
-
mongodb_mongod_replset_member_health
Indicates whether the member is up (1) or down (0).
You can set an alert for when a specific replica set member is no longer available.
-
mongodb_memory
The memory data structure holds information regarding the target system architecture of
mongod
and current memory usage in megabytes.Use the
type
label to create alerts when the virtual or resident memory exceed their limits. -
mongodb_extra_info_heap_usage_bytes
The total size of heap space the database process uses.
-
mongodb_network_bytes_total
The amount of data MongoDB’s network uses.
You can use the
state
label to monitorin_bytes
orout_bytes
data. -
mongodb_network_metrics_num_requests_total
The total number of distinct requests that the server received.
Use this value to provide context for the
in_bytes
orout_bytes
values and ensure that MongoDB’s network utilization is consistent with expectations and application use. -
mongodb_mongod_metrics_document_total
Reflects document access and modification patterns and data use.
Use the
state
label to monitor updates, inserts, and deleted documents. -
mongodb_op_counters_total
Provides an overview of database operations by type and makes it possible to analyze the load on the database in a more granular manner.
These numbers grow over time and in response to database use. Analyze these values over time to track database usage.
Use the
type
label to monitor delete, insert, query or update operations. -
mongodb_connections_metrics_created_total
A count of all incoming connections to the server, including closed connections.
HSM service metrics
HSM service metrics URL: https://
Available metrics:
-
hsm_machines
The number of HSM machines connected to the HSM service.
When you work with two HSM machines, it is vital to know whether one of the HSM machines is down and therefore no longer available, in which case, the HSM service no longer replicates keys and certificates.
-
http_server_requests_seconds_count
The count of server HTTP input requests.
exception
,method
,handler
andstatus
are useful labels for monitoring.For example, use these labels to retrieve
GET
requests that return4xx
and5xx
failure responses to monitor the HSM's total number of failed Diffie Hellman key derivations. -
performance_monitor_response_time_seconds
A histogram of method completed time.
Use this metric to monitor the number and latency of HSM service operations.
For example:
-
Use this query to monitor the total number of HSM service requests to the HSM to get a stored key:
performance_monitor_response_time_seconds_count{method="getKey",resource="hsm",status="success"}
-
Use this query to monitor the total number such requests (HSM service requests to the HSM to get a stored key) that occur within less than one second:
performance_monitor_response_time_seconds_bucket{class="HsmServiceService",method="getKey",resource="hsm",status="success",le="1.0"}
-
-
http_requests_error_500_total
The total number of HTTP requests with
5xx
error statuses.Use this metric to trigger an alert when an HSM service request to access the physical HSM or get input from Secure Factory Service results in an error.
-
process_start_time_seconds
The uptime of the Java virtual machine. Can indicate when the HSM service started.
-
jvm_memory_bytes_used
The used bytes of a given JVM memory area.
Use the
area
label to monitor heap or non-heap memory.