Skip to main content

Observability Guide

What is CVT Observability?

CVT provides comprehensive observability features to monitor the health, performance, and usage of your contract validation infrastructure. This includes Prometheus metrics for real-time monitoring, Grafana dashboards for visualization, and structured logging for debugging and audit trails.

Architecture Context

For system architecture context including how observability integrates with the CVT server, see the Architecture Overview.

Overview

CVT provides comprehensive observability through:

  • Prometheus Metrics: Real-time metrics collection and storage
  • Grafana Dashboards: Visual monitoring and analytics
  • Structured Logging: Detailed operation logs with Zap logger

Architecture

Quick Start

Start the Observability Stack

# Start CVT server, Prometheus, and Grafana
make up

# Check status
make observability-status

Access the UIs

Quick Commands

# View metrics in terminal
make metrics

# Open Grafana dashboard
make grafana

# Open Prometheus UI
make prometheus

# View observability logs
make observability-logs

Metrics Collected

Schema Registration Metrics

MetricTypeLabelsDescription
cvt_schemas_registered_totalCounterstatus (success/failure)Total number of schemas registered
cvt_schema_registration_errors_totalCountererror_typeSchema registration errors by type

Validation Metrics

MetricTypeLabelsDescription
cvt_validations_totalCounterschema_id, method, resultTotal validations performed
cvt_validation_duration_secondsHistogramschema_id, methodValidation operation duration
cvt_validation_errors_totalCountererror_categoryValidation errors by category

Error Categories:

  • input_validation: Invalid request parameters
  • schema_not_found: Schema not found in cache
  • http_request_creation: Failed to create HTTP request for validation
  • router_creation: Failed to create OpenAPI router
  • route_not_found: Route not found in OpenAPI spec
  • request_invalid: HTTP request validation failed
  • response_invalid: HTTP response validation failed

Cache Metrics

MetricTypeDescription
cvt_cache_hits_totalCounterTotal cache hits
cvt_cache_misses_totalCounterTotal cache misses
cvt_cache_size_bytesGaugeCurrent cache size in bytes
cvt_cache_items_totalGaugeCurrent number of cached items

gRPC Metrics

MetricTypeLabelsDescription
cvt_grpc_requests_totalCountermethod, statusTotal gRPC requests
cvt_grpc_request_duration_secondsHistogrammethodgRPC request duration

Methods:

  • RegisterSchema
  • ValidateInteraction
  • GetSchema
  • ListSchemas
  • ValidateProducerResponse
  • CompareSchemas
  • GenerateFixture
  • ListEndpoints
  • RegisterConsumer
  • ListConsumers
  • DeregisterConsumer
  • CanIDeploy

Schema Versioning Metrics

MetricTypeLabelsDescription
cvt_breaking_changes_detected_totalCounterchange_typeBreaking changes detected by type
cvt_schema_versions_totalGaugeschema_idNumber of versions per schema

Change Types:

  • ENDPOINT_REMOVED
  • REQUIRED_FIELD_ADDED
  • TYPE_CHANGED
  • REQUIRED_PARAMETER_ADDED
  • RESPONSE_SCHEMA_CHANGED
  • ENUM_VALUE_REMOVED

Authentication Metrics

MetricTypeLabelsDescription
cvt_auth_success_totalCounter-Total successful authentications
cvt_auth_failure_totalCounterreasonAuthentication failures by reason

Failure Reasons:

  • missing_key: No API key provided
  • invalid_key: Invalid API key

Governance Metrics

MetricTypeLabelsDescription
cvt_schemas_by_ownerGaugeownerNumber of schemas per owner
cvt_schemas_by_teamGaugeteamNumber of schemas per team
cvt_read_only_violations_totalCounter-Attempts to modify read-only schemas

Audit Metrics

MetricTypeLabelsDescription
cvt_audit_events_totalCounterevent_typeTotal audit events by type

Grafana Dashboard

The CVT Grafana dashboard provides real-time visualization of:

Panels

  1. Validations per Second (Stat)

    • Current rate of validations
    • Threshold indicators for performance
  2. Validation Results Over Time (Time Series)

    • Valid vs Invalid vs Error validations
    • Trends and patterns
  3. Validation Latency (Percentiles) (Time Series)

    • p50, p95, p99 latency
    • Performance SLOs
  4. Cache Hit Rate (Gauge)

    • Visual indicator of cache effectiveness
    • Color-coded thresholds:
      • Red: < 50%
      • Yellow: 50-80%
      • Green: > 80%
  5. Validation Errors by Category (Time Series)

    • Error distribution over time
    • Helps identify problem areas
  6. gRPC Requests by Method (Time Series)

    • Request distribution
    • Usage patterns
  7. Summary Stats (Stats)

    • Total Schemas Registered
    • Cache Hits/sec
    • Cache Misses/sec

Accessing the Dashboard

  1. Open Grafana: http://localhost:3000
  2. Login with admin / admin
  3. Navigate to Dashboards > CVT - Contract Validator Toolkit

Prometheus Configuration

The Prometheus configuration scrapes metrics from the CVT server every 10 seconds:

scrape_configs:
- job_name: 'cvt-server'
scrape_interval: 10s
static_configs:
- targets: ['cvt-server:9551']

Useful Prometheus Queries

Validation Rate

sum(rate(cvt_validations_total[5m]))

Cache Hit Rate

sum(rate(cvt_cache_hits_total[5m])) /
(sum(rate(cvt_cache_hits_total[5m])) + sum(rate(cvt_cache_misses_total[5m])))

p95 Latency

histogram_quantile(0.95, sum(rate(cvt_validation_duration_seconds_bucket[5m])) by (le))

Error Rate by Category

sum by (error_category) (rate(cvt_validation_errors_total[5m]))

Breaking Changes by Type

sum by (change_type) (rate(cvt_breaking_changes_detected_total[5m]))

Authentication Failure Rate

sum by (reason) (rate(cvt_auth_failure_total[5m]))

Custom Metrics Endpoint

The CVT server exposes a Prometheus-compatible /metrics endpoint on port 9551:

curl http://localhost:9551/metrics

Example Output

# HELP cvt_validations_total Total number of validations performed
# TYPE cvt_validations_total counter
cvt_validations_total{method="POST",result="valid",schema_id="petstore-v3"} 42

# HELP cvt_validation_duration_seconds Duration of validation operations in seconds
# TYPE cvt_validation_duration_seconds histogram
cvt_validation_duration_seconds_bucket{le="0.001",method="POST",schema_id="petstore-v3"} 10
cvt_validation_duration_seconds_bucket{le="0.005",method="POST",schema_id="petstore-v3"} 35

Structured Logging

CVT uses Zap for structured logging.

Log Levels

Set via LOG_LEVEL environment variable:

  • debug: Verbose logging (development)
  • info: Standard logging (production)
  • warn: Warnings only
  • error: Errors only

Example Logs

{
"level": "info",
"ts": 1638360000.123,
"caller": "server/cvtservice/service.go:110",
"msg": "Schema registered successfully",
"schemaId": "petstore-v3"
}

{
"level": "info",
"ts": 1638360001.456,
"caller": "server/cvtservice/service.go:237",
"msg": "Interaction validated successfully",
"schemaId": "petstore-v3",
"method": "POST",
"path": "/pets"
}

Production Deployment

# docker-compose.yml
services:
cvt-server:
environment:
- LOG_LEVEL=info
- CVT_PORT=9550
- CVT_METRICS_PORT=9551

prometheus:
volumes:
- prometheus-data:/prometheus
restart: unless-stopped

grafana:
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
restart: unless-stopped

Security Considerations

  1. Change default credentials: Update Grafana admin password
  2. Network isolation: Use Docker networks to isolate services
  3. TLS encryption: Enable HTTPS for Grafana in production
  4. Authentication: Configure Grafana OAuth or LDAP
  5. Firewall rules: Restrict access to metrics and monitoring ports

Retention and Storage

  • Prometheus: Default retention is 15 days
  • Grafana: Stores dashboard configurations only
  • Metrics cardinality: Monitor label cardinality to prevent explosion
# Increase Prometheus retention
prometheus:
command:
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=10GB'

Alerting

Sample Alert Rules

Create observability/alert-rules.yml:

groups:
- name: cvt_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: sum(rate(cvt_validation_errors_total[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: High validation error rate
description: Error rate is {{ $value }} errors/sec

- alert: LowCacheHitRate
expr: |
sum(rate(cvt_cache_hits_total[5m])) /
(sum(rate(cvt_cache_hits_total[5m])) + sum(rate(cvt_cache_misses_total[5m]))) < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: Cache hit rate is low
description: Cache hit rate is {{ $value | humanizePercentage }}

- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(cvt_validation_duration_seconds_bucket[5m])) by (le)) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: High validation latency
description: p95 latency is {{ $value }}s

Troubleshooting

Metrics Not Appearing

  1. Check CVT server is running:

    make status
  2. Verify metrics endpoint is accessible:

    curl http://localhost:9551/metrics
  3. Check Prometheus is scraping:

Grafana Dashboard Not Loading

  1. Check Grafana is running:

    docker ps | grep grafana
  2. Verify datasource configuration:

    • Open http://localhost:3000
    • Go to Configuration > Data Sources
    • Verify Prometheus is configured and working
  3. Check logs:

    make observability-logs

High Memory Usage

Monitor cache size and adjust configuration:

const (
MaxSchemas = 1000 // Reduce if memory is constrained
SchemaTTL = 24 * time.Hour // Reduce to free memory faster
)

Further Reading