What We Need
A Cloudability feature that analyzes Azure Event Hubs usage and tells us when we're paying for more capacity than we need (or when we're about to hit limits). Since tier and capacity are set at the namespace level, we need rightsizing recommendations at that level - but with visibility into what each individual event hub is consuming.
Why This Matters
Event Hubs can get expensive fast, especially Premium tier. We need to know:
- Are we overprovisioned? (wasting money)
- Are we about to hit limits? (performance risk)
- Should we split or consolidate namespaces?
Should we switch between Standard and Premium tiers?
Analysis Scope - What to Support
Multi-Level Analysis (Priority Order)
-
Single Namespace Analysis - Deep dive with individual event hub breakdown
- Shows total namespace utilization
- Breaks down which event hubs are consuming what percentage
- Identifies "noisy neighbors" hogging resources
-
Batch Namespace Analysis - Multiple namespaces in one report
- Summary view: "15 namespaces analyzed, 8 optimization opportunities, $12K/month savings"
- Drill down into any namespace for details
-
Subscription-Level - All namespaces in a subscription
Cross-Subscription - All namespaces across multiple subscriptions
Why Individual Event Hub Visibility Matters
Even though we rightsize at the namespace level, we need to see individual event hub usage to make smart decisions:
Example: prod-eventhub-namespace (Premium, 2 PUs, $2,400/month)
├─ orders-hub: 85% of throughput → Maybe needs its own namespace
├─ inventory-hub: 10% of throughput
├─ logging-hub: 3% of throughput → Could consolidate these three
└─ analytics-hub: 2% of throughput into one Standard namespace
What to Collect
Basic Info
- Subscription/Account Name
-
Vendor: Azure
- Resource Group
- Namespace Name
-
Current Tier: Standard or Premium
-
Current Capacity: # of TUs (Standard) or PUs (Premium)
-
Auto-Inflate Status (Available on Standard Tier only): Enabled/Disabled + max units
Date Range: Minimum 30 days recommended
Metrics to Track
Throughput (Most Important for Rightsizing)
-
Incoming bytes/sec (ingress) - converted to MB/s
-
Outgoing bytes/sec (egress) - converted to MB/s
- Incoming messages/sec
- Outgoing messages/sec
-
Track for each: Average, Peak, P95, P99
Why these matter:
- Standard: 1 TU = 1 MB/s ingress OR 2 MB/s egress (whichever hits first)
- Premium: 1 PU ≈ 8 MB/s combined throughput
- Your bottleneck is whichever limit you hit first (usually egress on Standard)
Performance Issues
- Throttled requests (send + consumer)
- Server errors
- User errors
- Success rate
Connections
- Active connections (current, peak, average)
- Connections opened/closed per period
Premium Tier Only
- CPU usage % (average, peak)
- Memory usage % (average, peak)
Additional
- Namespace storage utilization
Capture backlog (if using Capture feature)
Rightsizing Thresholds - When to Recommend Changes
Important: All utilization percentages below refer to throughput utilization - the percentage of ingress/egress capacity being used based on the calculations above. Always use the higher of ingress or egress as your constraining metric.
Safe to Downsize (High Confidence)
- P95 throughput utilization < 45%
- Peak throughput utilization < 65%
- Throttling < 0.1% of requests
- Sustained for 70%+ of analysis period
-
Action: Reduce capacity by 20-30%
Example: 10 TUs, P95 egress 7.2 MB/s → 36% utilization → Reduce to 7 TUs
Critical Downsize (Very Safe)
- P95 throughput utilization < 30%
- Peak throughput utilization < 50%
- Zero throttling
-
Action: Reduce capacity by 30-40%
Example: 10 TUs, P95 egress 4.5 MB/s → 22% utilization → Reduce to 6 TUs
Needs Upsize (Performance Risk)
- P95 throughput utilization > 75%
- Peak throughput utilization > 90%
- Throttling > 1% of requests
-
Action: Increase capacity by 20-30%
Example: 10 TUs, P95 egress 16.2 MB/s → 81% utilization → Increase to 13 TUs
Critical Upsize (Act Now)
- P95 throughput utilization > 85%
- Peak throughput utilization > 95%
- Throttling > 5% of requests
- Sustained high usage > 1 hour
-
Action: Increase capacity by 40-50% immediately
Example: 10 TUs, Peak egress 19.5 MB/s → 97% utilization → Increase to 15 TUs NOW
Optimal Range (No Changes)
- P95 throughput: 55-70%
- Peak throughput: 75-85%
- Throttling: < 1%
This range provides:
- Enough headroom for traffic spikes
- Cost efficiency (not grossly overprovisioned)
- Performance safety margin
Tier Change Recommendations
Premium → Standard:
- P95 < 35% consistently
- Predictable, steady workload
- No dedicated resource requirements
- Potential 40-60% cost savings
Standard → Premium:
- Frequent throttling at max TUs
- Need predictable performance
CPU/memory constraints on Standard
What Each Recommendation Should Include
Summary View
Current State:
- Namespace: prod-events-ns
- Tier: Premium, 3 PUs
- Current Cost: $3,600/month
- Ingress: 2.1 MB/s (P95), 3.2 MB/s (Peak)
- Egress: 4.3 MB/s (P95), 6.8 MB/s (Peak)
- Combined: 6.4 MB/s (P95), 10.0 MB/s (Peak)
- Capacity: 24 MB/s (3 PUs × 8 MB/s)
- Utilization: 27% (P95), 42% (Peak)
Recommendation:
- Reduce to 2 PUs (Premium)
- New Capacity: 16 MB/s
- New Utilization: 40% (P95), 63% (Peak)
- New Cost: $2,400/month
- Savings: $1,200/month ($14,400/year)
- Confidence: High (95%)
- Buffer: 37% headroom above P95