Using the cloud_service_monitoring backend, you can use the Cloud Service Monitoring API to manage your SLOs.
backends:
cloud_service_monitoring:
project_id: "${WORKSPACE_PROJECT_ID}"SLOs are created from standard metrics available in Cloud Monitoring and the data is stored in Cloud Service Monitoring API (see docs).
The following methods are available to compute SLOs with the cloud_service_monitoring backend:
basicto create standard SLOs for Google App Engine, Google Kubernetes Engine, and Cloud Endpoints.good_bad_ratiofor metrics of typeDELTAorCUMULATIVE.distribution_cutfor metrics of typeDELTAand unitDISTRIBUTION.
The basic method is used to let the Cloud Service Monitoring API automatically generate standardized SLOs for the following GCP services:
- Google App Engine
- Google Kubernetes Engine (with Istio)
- Google Cloud Endpoints
The SLO configuration uses Cloud Monitoring GCP metrics and only requires minimal configuration compared to custom SLOs.
Example config (App Engine availability):
backend: cloud_service_monitoring
method: basic
service_level_indicator:
app_engine:
project_id: ${GAE_PROJECT_ID}
module_id: ${GAE_MODULE_ID}
availability: {}For details on filling the app_engine fields, see AppEngine spec.
Example config (Cloud Endpoint latency):
backend: cloud_service_monitoring
method: basic
service_level_indicator:
cloud_endpoints:
service_name: ${ENDPOINT_URL}
latency:
threshold: 724 # msFor details on filling the cloud_endpoints fields, see CloudEndpoint spec.
Example config (Istio service latency):
backend: cloud_service_monitoring
method: basic
service_level_indicator:
mesh_istio:
mesh_uid: ${GKE_MESH_UID}
service_namespace: ${GKE_SERVICE_NAMESPACE}
service_name: ${GKE_SERVICE_NAME}
latency:
threshold: 500 # msFor details on filling the mesh_istio fields, see MeshIstio spec.
Example config (Istio service latency) [DEPRECATED]:
backend: cloud_service_monitoring
method: basic
service_level_indicator:
cluster_istio:
project_id: ${GKE_PROJECT_ID}
location: ${GKE_LOCATION}
cluster_name: ${GKE_CLUSTER_NAME}
service_namespace: ${GKE_SERVICE_NAMESPACE}
service_name: ${GKE_SERVICE_NAME}
latency:
threshold: 500 # msFor details on filling the cluster_istio fields, see ClusterIstio spec.
The good_bad_ratio method is used to compute the ratio between two metrics:
- Good events, i.e events we consider as 'good' from the user perspective.
- Bad or valid events, i.e events we consider either as 'bad' from the user perspective, or all events we consider as 'valid' for the computation of the SLO.
This method is often used for availability SLOs, but can be used for other purposes as well (see examples).
Example config:
backend: cloud_service_monitoring
method: good_bad_ratio
service_level_indicator:
filter_good: >
project="${GAE_PROJECT_ID}"
metric.type="appengine.googleapis.com/http/server/response_count"
resource.type="gae_app"
metric.labels.response_code >= 200
metric.labels.response_code < 500
filter_valid: >
project="${GAE_PROJECT_ID}"
metric.type="appengine.googleapis.com/http/server/response_count"You can also use the filter_bad field which identifies bad events instead of the filter_valid field which identifies all valid events.
The distribution_cut method is used for Cloud distribution-type metrics, which are usually used for latency metrics.
A distribution metric records the statistical distribution of the extracted values in histogram buckets. The extracted values are not recorded individually, but their distribution across the configured buckets are recorded, along with the count, mean, and sum of squared deviation of the values.
Example config:
backend: cloud_service_monitoring
method: distribution_cut
service_level_indicator:
filter_valid: >
project=${GAE_PROJECT_ID}
metric.type=appengine.googleapis.com/http/server/response_latencies
metric.labels.response_code >= 200
metric.labels.response_code < 500
range_min: 0
range_max: 724 # msThe range_min and range_max are used to specify the latency range that we consider 'good'.
Since Cloud Service Monitoring API persists Service and ServiceLevelObjective objects, we need ways to keep our local SLO YAML configuration synced with the remote objects.
Auto-imported
Some services are auto-imported by the Service Monitoring API: they correspond to SLO configurations using the basic method.
The following conventions are used by the Service Monitoring API to give a unique id to an auto-imported Service:
-
App Engine:
gae:{project_id}_{module_id}→ Make sure that the
app_engineblock in your config has the correct fields corresponding to your App Engine service. -
Cloud Endpoints:
ist:{project_id}-{service}→ Make sure that the
cloud_endpointsblock in your config has the correct fields corresponding to your Cloud Endpoint service. -
Mesh Istio [NOT YET RELEASED]:
ist:{project_id}-{mesh_uid}-{service_namespace}-{service_name}→ Make sure that the
mesh_istioblock in your config has the correct fields corresponding to your Istio service. -
Cluster Istio [DEPRECATED SOON]:
ist:{project_id}-{suffix}-{location}-{cluster_name}-{service_namespace}-{service_name}→ Make sure that the
cluster_istioblock in your config has the correct fields corresponding to your Istio service.
You cannot import an existing ServiceLevelObjective object, since they use a random id.
Custom
Custom services are the ones you create yourself using the Cloud Service Monitoring API and the slo-generator.
The following conventions are used by the slo-generator to give a unique id to a custom Service and Service Level Objective objects:
-
service_id = ${metadata.service_name}-${metadata.feature_name} -
slo_id = ${metadata.service_name}-${metadata.feature_name}-${metadata.slo_name}-${window}
To keep track of those, do not update any of the following fields in your configs:
-
metadata.service_name,metadata.feature_nameandmetadata.slo_namein the SLO config. -
windowin the Error Budget Policy.
If you need to make updates to any of those fields, first run the slo-generator with the -d (delete) option (see #deleting-objects), then re-run normally.
To import an existing custom Service objects, find out your service id from the API and fill the service_id in the service_level_indicator configuration.
To delete an SLO object in Cloud Monitoring API using the cloud_service_monitoring class, run the slo-generator with the -d (or --delete) flag:
slo-generator -f <SLO_CONFIG_PATH> -b <ERROR_BUDGET_POLICY> --deleteSee the Cloud Service Monitoring docs for instructions on alerting.
Complete SLO samples using Cloud Service Monitoring are available in samples/cloud_service_monitoring. Check them out!