Beta Program Telemetry Dashboard¶

Monitor SkillMeat beta program health, usage patterns, and performance in real-time via Grafana dashboard.

Overview¶

The telemetry dashboard provides real-time insights into: - User engagement and adoption - Feature usage patterns - System performance and reliability - Error rates and troubleshooting - User satisfaction signals

This data drives daily decisions about prioritization, bug fixes, and feature completeness assessment for GA release.

Accessing the Dashboard¶

Prerequisites¶

Docker and Docker Compose installed
2GB free disk space for metrics database
Ports 3001 (Grafana), 9090 (Prometheus), 3100 (Loki) available

Start the Observability Stack¶

# Navigate to repo root
cd /path/to/skillmeat

# Start observability stack
docker-compose -f docker-compose.monitoring.yml up -d

# Wait for services to be ready (30 seconds)
sleep 30

# Verify all services running
docker-compose -f docker-compose.monitoring.yml ps

Access Grafana¶

Open browser to http://localhost:3001
Login with default credentials:
Username: admin
Password: admin
Change password immediately (admin will prompt)
Navigate to "Dashboards" → "SkillMeat Beta" folder

Dashboards Available¶

1. Overview Dashboard¶

High-level metrics for daily monitoring: - Daily active users (DAU) trend - Commands executed per user (7-day average) - Top 5 used features - Error rate (%) - Average API response time - System uptime

Use for: Daily standup, quick health check, identifying critical issues

2. Engagement Dashboard¶

Deep dive into user behavior: - Cohort retention (users active by day since invite) - Feature adoption timeline - Session duration distribution - Commands executed distribution - User segment analysis (by role, platform, collection size)

Use for: Engagement analysis, adoption tracking, cohort health

3. Performance Dashboard¶

System performance and reliability: - API response times (P50, P95, P99) by endpoint - Request volume over time - Error rate by endpoint - Database query latency - Memory and CPU usage - Network latency

Use for: Performance debugging, capacity planning, SLA verification

4. Features Dashboard¶

Feature-specific metrics: - Feature usage frequency (command/UI action counts) - Feature adoption rate over time - Most/least used features - Feature error rates - Time-to-first-use by feature

Use for: Feature prioritization, identifying gaps, success measurement

5. Marketplace Dashboard¶

Marketplace-specific metrics: - Package search volume - Package installation counts - Top searched packages - Publish success rate - Bundle export/import counts

Use for: Marketplace health, popular content, integration effectiveness

6. Team Features Dashboard¶

Team collaboration metrics: - Bundle exports (daily count, size distribution) - Bundle imports (daily count, success rate) - Team member counts per collection - Sharing frequency

Use for: Team features health, collaboration patterns

7. Error & Issues Dashboard¶

Error tracking and troubleshooting: - Error count by type - Error distribution by endpoint - Stack traces for recent errors - Crash reports (if enabled) - Platform-specific error rates

Use for: Bug triage, priority assessment, issue clustering

8. MCP Management Dashboard¶

MCP server management metrics: - MCP deployments (count, success rate) - MCP health check results - MCP latency distribution - Server uptime - Configuration change frequency

Use for: MCP integration health, deployment patterns

Key Metrics Explained¶

Engagement Metrics¶

Daily Active Users (DAU) - Users who executed at least one command in a calendar day - Target for beta: DAU > 70% of total participants by Week 3 - Indicates overall engagement level

Feature Adoption (%) - Percentage of DAU who used a given feature - Target: 70%+ for core features by Week 4 - Helps prioritize which features work/don't work

Session Duration - Average time per session (period of active usage) - Target: 10-15 minutes - Too short = feature incomplete or confusing - Too long = power users or issues with workflow

Retention by Day - Percentage of initial participants still active on Day N - Target: 80%+ by Day 7, 60%+ by Day 21 - Shows if product becomes sticky

Performance Metrics¶

API Response Time (P95) - 95th percentile response time in milliseconds - Target for beta: <100ms - P99 should be <200ms - Indicates user experience quality

Error Rate (%) - Percentage of requests resulting in error - Target: <1% - Spike indicates regression or service issue

Uptime - Percentage of time service is available - Target: 99.5%+ - Lower indicates infrastructure issues

Quality Metrics¶

Crash Rate - Percentage of sessions ending with crash - Target: <0.5% - Tracks system stability

Failed Operations - Counts by operation type (install, deploy, sync, etc.) - Target: 0 for critical operations - Highlights unreliable workflows

Real-Time Alerting¶

Configured alerts notify team of critical issues:

Alert Rules¶

Alert	Condition	Action
High Error Rate	Error rate > 5%	Page on-call engineer immediately
Service Down	Uptime < 99%	Page on-call, notification to Slack
P0 Bug Report	New P0 issue filed	Notify engineering team
Performance Degradation	P95 response time > 200ms	Investigate query/resource usage
DAU Decline	DAU drops > 20% from previous day	Investigate recent changes
Platform Specific Issue	Error rate > 10% on single platform	Investigate platform-specific code

Slack Integration¶

Alerts automatically post to #skillmeat-beta-alerts Slack channel:

[ERROR] 🔴 High Error Rate
Endpoint: POST /api/v1/skills/add
Error Rate: 8.2% (target: <1%)
Affected: 247 requests
Impact: High

Debug: Check logs in Loki dashboard
Contact: @on-call-eng

Viewing Logs¶

Loki Log Aggregation¶

All SkillMeat logs are aggregated in Loki for searching and debugging:

Access Loki Explorer: 1. In Grafana, navigate to: Explore → Select "Loki" 2. Query examples:

# All errors
{job="skillmeat"} | "ERROR"

# Specific endpoint errors
{job="skillmeat", endpoint="POST /api/v1/skills/add"} | "ERROR"

# Platform-specific issues
{job="skillmeat", platform="windows"} | "ERROR"

# Performance (requests taking >100ms)
{job="skillmeat"} | duration > 100

# Recent crash reports
{job="skillmeat"} | "panic" or "fatal"

Log Levels: - DEBUG: Detailed diagnostic info (disabled in prod for performance) - INFO: General informational messages - WARN: Warning messages that may indicate issues - ERROR: Error conditions (always indexed) - FATAL: Fatal errors causing shutdown

Searching Logs¶

Use Loki query language (LogQL) for powerful searching:

# Search by participant ID
{job="skillmeat"} | "participant_id=abc123"

# Search by command
{job="skillmeat"} | "command=add_skill"

# Filter by response time
{job="skillmeat"} | duration >= 500

# Combine conditions
{job="skillmeat", platform="macos"} | "ERROR" | duration >= 200

Data Retention¶

Data Type	Retention	Resolution
Metrics (Prometheus)	30 days	1-minute interval
Logs (Loki)	7 days	Raw logs
Dashboards	Permanent	Configuration stored in Git
Alerts	30 days	Audit log of all alerts

Privacy and Security¶

Data Collection Policy¶

SkillMeat collects: - Command names (anonymized counts) - Feature usage (which features used, how often) - Performance data (latencies, error rates) - System info (OS, Python version, for debugging) - Aggregated collection stats (avg size, not content)

SkillMeat does NOT collect: - Skill names or content - Personal information - Passwords or authentication tokens - Private repository data - File system contents

Data Redaction¶

Sensitive data is automatically redacted: - GitHub tokens: ***REDACTED*** - API keys: ***REDACTED*** - Email addresses: ***@***.*** - File paths with PII: /home/[USER]/*** → /home/user/***

Access Control¶

Dashboard access is restricted to: - SkillMeat core team - Beta program leads - On-call engineers - Authorized data analysts

Common Dashboard Tasks¶

Task: Identify Slow Endpoints¶

Open "Performance Dashboard"
Look at "Response Time by Endpoint" chart
Click on bar for slow endpoint
Drill down to individual requests
Check logs in Loki for errors or timeouts

Task: Investigate Error Spike¶

Open "Overview Dashboard"
Notice error rate spike (red area in graph)
Click on timestamp to drill down
Open "Error & Issues Dashboard"
Use "Error Distribution" to identify affected endpoint
Check Loki logs for error details

Task: Check Feature Adoption¶

Open "Features Dashboard"
Review "Feature Adoption Timeline" chart
Features below 50% adoption need investigation
Check "Feature Error Rates" for bugs blocking adoption
Review feedback in GitHub Discussions for complaints

Task: Assess Release Readiness¶

Review "Overview Dashboard" - all metrics green?
Check "Error & Issues Dashboard" - any P0/P1 bugs?
Review "Engagement Dashboard" - is retention >60%?
Check "Performance Dashboard" - is P95 <100ms?
If all passing: ready for GA

Troubleshooting Dashboard Issues¶

Metrics not updating¶

# Check Prometheus is running
docker-compose -f docker-compose.monitoring.yml ps prometheus

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Restart Prometheus if needed
docker-compose -f docker-compose.monitoring.yml restart prometheus

No logs appearing in Loki¶

# Check Loki is running
docker-compose -f docker-compose.monitoring.yml ps loki

# Check log files exist
ls -la /var/log/skillmeat/

# Restart log collection
docker-compose -f docker-compose.monitoring.yml restart loki

Dashboard is slow¶

# Check Grafana memory usage
docker stats grafana

# If memory usage high, restart Grafana
docker-compose -f docker-compose.monitoring.yml restart grafana

# Clean up old data
docker exec prometheus promtool query instant 'up' | head -100

Exporting Data¶

Export Dashboard as PDF¶

In Grafana, open dashboard
Click dashboard title → Share
Select "Render as PDF"
Send to stakeholders

Export Metrics as CSV¶

In Grafana, click "Explore"
Run query
Click "Download as CSV"

Programmatic Access¶

Query Prometheus API directly:

# Query metrics via API
curl 'http://localhost:9090/api/v1/query_range?query=up&start=1609459200&end=1609545600&step=300'

# Export all SkillMeat metrics
curl 'http://localhost:9090/api/v1/label/job/values' | grep skillmeat