Jet Li 13acfccd77 (feat): Basic docs on Freeleaps Infra

2025-09-04 00:58:59 -07:00

30 KiB

Raw Blame History

🐰 RabbitMQ Management Analysis & Production Guide

Complete Guide to Managing RabbitMQ in Your FreeLeaps Production Environment
From configuration to monitoring to troubleshooting

📋 Table of Contents

🎯 Quick Start
🏗️ Your Production Setup
🔧 Current Configuration Analysis
📊 Management UI Guide
🔍 Production Monitoring
🚨 Troubleshooting Guide
⚡ Performance Optimization
🔒 Security Best Practices
📈 Scaling & High Availability
🛠️ Maintenance Procedures

🎯 Quick Start

🚀 First Day Checklist

Access RabbitMQ Management UI: Port forward to http://localhost:15672
Check your queues: Verify freeleaps.devops.reconciler.* queues exist
Monitor connections: Check if reconciler is connected
Review metrics: Check message rates and queue depths
Test connectivity: Verify RabbitMQ is accessible from your apps

🔑 Essential Commands

# Access your RabbitMQ cluster
kubectl get pods -n freeleaps-alpha | grep rabbitmq

# Port forward to management UI
kubectl port-forward svc/rabbitmq-headless -n freeleaps-alpha 15672:15672

# Check RabbitMQ logs
kubectl logs -f deployment/rabbitmq -n freeleaps-alpha

# Access RabbitMQ CLI
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_queues

🏗️ Your Production Setup

🌐 Production Architecture

┌─────────────────────────────────────────────────────────────┐
│                    RABBITMQ PRODUCTION SETUP                │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
│  │   freeleaps-   │  │   freeleaps-    │  │   freeleaps- │ │
│  │   devops-       │  │   apps          │  │   monitoring │ │
│  │   reconciler    │  │   (Your Apps)   │  │   (Metrics)  │ │
│  └─────────────────┘  └─────────────────┘  └──────────────┘ │
│           │                    │                    │        │
│           │ AMQP 5672          │ AMQP 5672          │        │
│           │ HTTP 15672         │ HTTP 15672         │        │
│           └────────────────────┼────────────────────┘        │
│                                │                             │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │              RABBITMQ CLUSTER                           │ │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    │ │
│  │  │   Node 1    │  │   Node 2    │  │   Node 3    │    │ │
│  │  │ (Primary)   │  │ (Replica)   │  │ (Replica)   │    │ │
│  │  │ Port: 5672  │  │ Port: 5672  │  │ Port: 5672  │    │ │
│  │  │ UI: 15672   │  │ UI: 15672   │  │ UI: 15672   │    │ │
│  │  └─────────────┘  └─────────────┘  └─────────────┘    │ │
│  └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

📊 Production Namespaces

Environment	Namespace	Purpose	Status
Alpha	`freeleaps-alpha`	Development & Testing	✅ Active
Production	`freeleaps-prod`	Live Production	✅ Active

🔧 Production Services

# Your actual RabbitMQ services
kubectl get svc -n freeleaps-alpha | grep rabbitmq
kubectl get svc -n freeleaps-prod | grep rabbitmq

# Service details:
# - rabbitmq-headless: Internal cluster communication
# - rabbitmq: External access (if needed)
# - rabbitmq-management: Management UI access

🔧 Current Configuration Analysis

📋 Configuration Sources

1. Helm Chart Configuration

# Location: freeleaps-ops/freeleaps/helm-pkg/3rd/rabbitmq/
# Primary configuration files:
# - values.yaml (base configuration)
# - values.alpha.yaml (alpha environment overrides)
# - values.prod.yaml (production environment overrides)

2. Reconciler Configuration

# Location: freeleaps-devops-reconciler/helm/freeleaps-devops-reconciler/values.yaml
rabbitmq:
  host: "rabbitmq-headless.freeleaps-alpha.svc.cluster.local"
  port: 5672
  username: "user"
  password: "NjlhHFvnDuC7K0ir"
  vhost: "/"

3. Python Configuration

# Location: freeleaps-devops-reconciler/reconciler/config/config.py
RABBITMQ_HOST = os.getenv('RABBITMQ_HOST', 'localhost')
RABBITMQ_PORT = int(os.getenv('RABBITMQ_PORT', '5672'))
RABBITMQ_USERNAME = os.getenv('RABBITMQ_USERNAME', 'guest')
RABBITMQ_PASSWORD = os.getenv('RABBITMQ_PASSWORD', 'guest')

🔍 Configuration Analysis

✅ What's Working Well

Helm-based deployment - Consistent and repeatable
Environment separation - Alpha vs Production
Clustering enabled - High availability
Management plugin - Web UI available
Resource limits - Proper resource management

⚠️ Issues Identified

1. Configuration Mismatch

# ❌ PROBLEM: Different image versions
# Helm chart: bitnami/rabbitmq:4.0.6-debian-12-r0
# Reconciler: rabbitmq:3.12-management-alpine

# ❌ PROBLEM: Different credentials
# Alpha: username: "user", password: "NjlhHFvnDuC7K0ir"
# Production: Different credentials (not shown in config)

2. Security Concerns

# ❌ PROBLEM: Hardcoded passwords in values files
auth:
  username: user
  password: "NjlhHFvnDuC7K0ir"  # Should be in Kubernetes secrets

3. Network Configuration

# ❌ PROBLEM: Inconsistent hostnames
# Reconciler uses: rabbitmq-headless.freeleaps-alpha.svc.cluster.local
# But should use service discovery

🎯 Recommended Improvements

1. Centralized Configuration

# Create a centralized RabbitMQ configuration
# Location: freeleaps-ops/config/rabbitmq/
rabbitmq-config:
  image:
    repository: bitnami/rabbitmq
    tag: "4.0.6-debian-12-r0"
  auth:
    username: ${RABBITMQ_USERNAME}
    password: ${RABBITMQ_PASSWORD}
  clustering:
    enabled: true
    name: "freeleaps-${ENVIRONMENT}"

2. Secret Management

# Use Kubernetes secrets instead of hardcoded values
apiVersion: v1
kind: Secret
metadata:
  name: rabbitmq-credentials
  namespace: freeleaps-alpha
type: Opaque
data:
  username: dXNlcg==  # base64 encoded
  password: TmphbEhGdm5EdUM3SzBpcg==  # base64 encoded

3. Service Discovery

# Use consistent service discovery
# Instead of hardcoded hostnames, use:
RABBITMQ_HOST: "rabbitmq-headless.${NAMESPACE}.svc.cluster.local"

📊 Management UI Guide

🌐 Accessing the Management UI

Method 1: Port Forward (Recommended)

# Port forward to RabbitMQ management UI
kubectl port-forward svc/rabbitmq-headless -n freeleaps-alpha 15672:15672

# Access: http://localhost:15672
# Username: user
# Password: NjlhHFvnDuC7K0ir

Method 2: Ingress (If configured)

# If you have ingress configured for RabbitMQ
# Access: https://rabbitmq.freeleaps.mathmast.com

📋 Management UI Features

1. Overview Dashboard

Cluster status and health indicators
Node information and resource usage
Connection counts and message rates
Queue depths and performance metrics

2. Queues Management

# Your actual queues to monitor:
# - freeleaps.devops.reconciler.queue (heartbeat)
# - freeleaps.devops.reconciler.input (input messages)
# - freeleaps.devops.reconciler.output (output messages)

# Queue operations:
# - View queue details and metrics
# - Purge queues (remove all messages)
# - Delete queues (with safety confirmations)
# - Monitor message rates and consumer counts

3. Exchanges Management

# Your actual exchanges:
# - amq.default (default direct exchange)
# - amq.topic (topic exchange)
# - amq.fanout (fanout exchange)

# Exchange operations:
# - View exchange properties and bindings
# - Create new exchanges with custom types
# - Monitor message routing and performance

4. Connections & Channels

# Monitor your reconciler connections:
# - Check if reconciler is connected
# - Monitor connection health and performance
# - View channel details and limits
# - Force disconnect if needed

5. Users & Permissions

# Current user setup:
# - Username: user
# - Permissions: Full access to vhost "/"
# - Tags: management

# User management:
# - Create new users for different applications
# - Set up proper permissions and access control
# - Monitor user activity and connections

🔧 Practical UI Operations

Monitoring Your Reconciler

# 1. Check if reconciler is connected
# Go to: Connections tab
# Look for: freeleaps-devops-reconciler connections

# 2. Monitor message flow
# Go to: Queues tab
# Check: freeleaps.devops.reconciler.* queues
# Monitor: Message rates and queue depths

# 3. Check cluster health
# Go to: Overview tab
# Monitor: Node status and resource usage

Troubleshooting via UI

# 1. Check for stuck messages
# Go to: Queues > freeleaps.devops.reconciler.input
# Look for: High message count or no consumers

# 2. Check connection issues
# Go to: Connections tab
# Look for: Disconnected or error states

# 3. Monitor resource usage
# Go to: Overview tab
# Check: Memory usage and disk space

🔍 Production Monitoring

📊 Key Metrics to Monitor

1. Cluster Health

# Check cluster status
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl cluster_status

# Monitor node health
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_nodes

2. Queue Metrics

# Check queue depths
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_queues name messages consumers

# Monitor message rates
# Use Management UI: Queues tab > Queue details > Message rates

3. Connection Metrics

# Check active connections
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_connections

# Monitor connection health
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_channels

4. Resource Usage

# Check memory usage
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl status

# Monitor disk usage
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- df -h

🚨 Alerting Setup

1. Queue Depth Alerts

# Alert when queue depth exceeds threshold
# Queue: freeleaps.devops.reconciler.input
# Threshold: > 100 messages
# Action: Send Slack notification

2. Connection Loss Alerts

# Alert when reconciler disconnects
# Monitor: freeleaps-devops-reconciler connections
# Threshold: Connection count = 0
# Action: Page on-call engineer

3. Resource Usage Alerts

# Alert when memory usage is high
# Threshold: Memory usage > 80%
# Action: Scale up or investigate

📈 Monitoring Dashboard

Grafana Dashboard

# Your existing RabbitMQ dashboard
# Location: freeleaps-ops/cluster/manifests/freeleaps-monitoring-system/kube-prometheus-stack/dashboards/rabbitmq.yaml
# Access: https://grafana.mathmast.com
# Dashboard: RabbitMQ Management Overview

Key Dashboard Panels

Queue Depth - Monitor message accumulation
Message Rates - Track throughput
Connection Count - Monitor client connections
Memory Usage - Track resource consumption
Error Rates - Monitor failures

🚨 Troubleshooting Guide

🔍 Common Issues & Solutions

1. Reconciler Connection Issues

Problem: Reconciler can't connect to RabbitMQ

# Symptoms:
# - Reconciler logs show connection errors
# - No connections in RabbitMQ UI
# - Pods restarting due to connection failures

# Diagnosis:
kubectl logs -f deployment/freeleaps-devops-reconciler -n freeleaps-devops-system
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_connections

# Solutions:
# 1. Check network connectivity
kubectl exec -it deployment/freeleaps-devops-reconciler -n freeleaps-devops-system -- ping rabbitmq-headless.freeleaps-alpha.svc.cluster.local

# 2. Verify credentials
kubectl get secret rabbitmq-credentials -n freeleaps-alpha -o yaml

# 3. Check RabbitMQ status
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl status

2. Queue Message Accumulation

Problem: Messages stuck in queues

# Symptoms:
# - High message count in queues
# - No consumers processing messages
# - Increasing queue depth

# Diagnosis:
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_queues name messages consumers

# Solutions:
# 1. Check consumer health
kubectl logs -f deployment/freeleaps-devops-reconciler -n freeleaps-devops-system

# 2. Restart consumers
kubectl rollout restart deployment/freeleaps-devops-reconciler -n freeleaps-devops-system

# 3. Purge stuck messages (if safe)
# Via Management UI: Queues > Queue > Purge

3. Memory Pressure

Problem: RabbitMQ running out of memory

# Symptoms:
# - High memory usage
# - Slow performance
# - Connection drops

# Diagnosis:
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl status
kubectl top pods -n freeleaps-alpha | grep rabbitmq

# Solutions:
# 1. Increase memory limits
kubectl patch deployment rabbitmq -n freeleaps-alpha -p '{"spec":{"template":{"spec":{"containers":[{"name":"rabbitmq","resources":{"limits":{"memory":"2Gi"}}}]}}}}'

# 2. Restart RabbitMQ
kubectl rollout restart deployment/rabbitmq -n freeleaps-alpha

# 3. Check for memory leaks
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_queues name memory

4. Cluster Issues

Problem: RabbitMQ cluster not healthy

# Symptoms:
# - Nodes not in sync
# - Replication lag
# - Split-brain scenarios

# Diagnosis:
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl cluster_status
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_nodes

# Solutions:
# 1. Check node connectivity
kubectl get pods -n freeleaps-alpha | grep rabbitmq

# 2. Restart problematic nodes
kubectl delete pod rabbitmq-0 -n freeleaps-alpha

# 3. Rejoin cluster if needed
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl join_cluster rabbit@rabbitmq-0

🛠️ Debugging Commands

Essential Debugging Commands

# Check RabbitMQ status
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl status

# List all queues
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_queues

# List all exchanges
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_exchanges

# List all bindings
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_bindings

# List all connections
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_connections

# List all channels
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_channels

# Check user permissions
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_users
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_user_permissions user

Advanced Debugging

# Check RabbitMQ logs
kubectl logs -f deployment/rabbitmq -n freeleaps-alpha

# Check system logs
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- journalctl -u rabbitmq-server

# Check network connectivity
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- netstat -tlnp

# Check disk usage
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- df -h

# Check memory usage
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- free -h

⚡ Performance Optimization

🎯 Performance Tuning

1. Memory Optimization

# Optimize memory settings
# Location: values.alpha.yaml
configuration: |-
  # Memory management
  vm_memory_high_watermark.relative = 0.6
  vm_memory_high_watermark_paging_ratio = 0.5
  
  # Message store
  msg_store_file_size_limit = 16777216
  msg_store_credit_disc_bound = 4000

2. Disk Optimization

# Optimize disk settings
configuration: |-
  # Disk free space
  disk_free_limit.relative = 2.0
  
  # Queue master location
  queue_master_locator = min-masters
  
  # Message persistence
  queue.default_consumer_prefetch = 50

3. Network Optimization

# Optimize network settings
configuration: |-
  # TCP settings
  tcp_listen_options.backlog = 128
  tcp_listen_options.nodelay = true
  
  # Heartbeat
  heartbeat = 60
  
  # Connection limits
  max_connections = 1000
  max_connections_per_user = 100

📊 Performance Monitoring

Key Performance Indicators

Message Throughput - Messages per second
Latency - Message processing time
Queue Depth - Messages waiting to be processed
Memory Usage - Heap and process memory
Disk I/O - Write and read operations

Performance Benchmarks

# Your expected performance:
# - Message rate: 1000+ messages/second
# - Latency: < 10ms for local messages
# - Queue depth: < 100 messages (normal operation)
# - Memory usage: < 80% of allocated memory
# - Disk usage: < 70% of allocated storage

🔒 Security Best Practices

🛡️ Current Security Analysis

✅ Security Strengths

Network isolation - RabbitMQ runs in Kubernetes namespace
Resource limits - Memory and CPU limits set
Non-root user - Runs as non-root in container
TLS support - SSL/TLS configuration available

⚠️ Security Weaknesses

Hardcoded passwords - Passwords in YAML files
Default permissions - Overly permissive user access
No audit logging - Limited security event tracking
No network policies - No ingress/egress restrictions

🔧 Security Improvements

1. Secret Management

# Use Kubernetes secrets
apiVersion: v1
kind: Secret
metadata:
  name: rabbitmq-credentials
  namespace: freeleaps-alpha
type: Opaque
data:
  username: dXNlcg==  # base64 encoded
  password: <base64-encoded-password>
---
# Reference in Helm values
auth:
  existingSecret: rabbitmq-credentials
  existingSecretPasswordKey: password
  existingSecretUsernameKey: username

2. User Access Control

# Create application-specific users
# Instead of one user with full access:
# - freeleaps-reconciler (reconciler access only)
# - freeleaps-monitoring (read-only access)
# - freeleaps-admin (full access, limited to admins)

3. Network Policies

# Restrict network access
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: rabbitmq-network-policy
  namespace: freeleaps-alpha
spec:
  podSelector:
    matchLabels:
      app: rabbitmq
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: freeleaps-devops-system
    ports:
    - protocol: TCP
      port: 5672
    - protocol: TCP
      port: 15672

4. Audit Logging

# Enable audit logging
configuration: |-
  # Audit logging
  log.file.level = info
  log.file.rotation.date = $D0
  log.file.rotation.size = 10485760
  
  # Security events
  log.security = true

📈 Scaling & High Availability

🏗️ Current HA Setup

Cluster Configuration

# Your current clustering setup
clustering:
  enabled: true
  name: "freeleaps-alpha"
  addressType: hostname
  rebalance: false
  forceBoot: false
  partitionHandling: autoheal

Replication Strategy

# Queue replication
# - Queues are replicated across cluster nodes
# - Automatic failover if primary node fails
# - Data consistency maintained across cluster

🚀 Scaling Strategies

1. Horizontal Scaling

# Scale RabbitMQ cluster
kubectl scale statefulset rabbitmq -n freeleaps-alpha --replicas=5

# Verify scaling
kubectl get pods -n freeleaps-alpha | grep rabbitmq
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl cluster_status

2. Vertical Scaling

# Increase resource limits
resources:
  requests:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2000m
    memory: 4Gi

3. Queue Partitioning

# Partition large queues across nodes
# Strategy: Hash-based partitioning
# Benefits: Better performance and fault tolerance

🔧 High Availability Best Practices

1. Node Distribution

# Ensure nodes are distributed across availability zones
# Use pod anti-affinity to prevent single points of failure
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - rabbitmq
      topologyKey: kubernetes.io/hostname

2. Data Replication

# Configure proper replication
# - All queues should have at least 2 replicas
# - Use quorum queues for critical data
# - Monitor replication lag

3. Backup Strategy

# Backup RabbitMQ data
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl export_definitions /tmp/rabbitmq-definitions.json

# Restore from backup
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl import_definitions /tmp/rabbitmq-definitions.json

🛠️ Maintenance Procedures

📅 Regular Maintenance Tasks

Daily Tasks

# 1. Check cluster health
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl cluster_status

# 2. Monitor queue depths
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_queues name messages

# 3. Check connection count
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_connections | wc -l

# 4. Review error logs
kubectl logs --tail=100 deployment/rabbitmq -n freeleaps-alpha | grep ERROR

Weekly Tasks

# 1. Review performance metrics
# Access Grafana dashboard: RabbitMQ Management Overview

# 2. Check disk usage
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- df -h

# 3. Review user permissions
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_users
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_user_permissions user

# 4. Backup configurations
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl export_definitions /tmp/weekly-backup-$(date +%Y%m%d).json

Monthly Tasks

# 1. Security audit
# Review user access and permissions
# Check for unused queues and exchanges
# Verify network policies

# 2. Performance review
# Analyze message rates and latency
# Review resource usage trends
# Optimize configurations

# 3. Capacity planning
# Project growth based on usage trends
# Plan for scaling if needed
# Review backup and disaster recovery procedures

🔧 Maintenance Scripts

Health Check Script

#!/bin/bash
# scripts/rabbitmq-health-check.sh

NAMESPACE="freeleaps-alpha"
POD_NAME=$(kubectl get pods -n $NAMESPACE -l app=rabbitmq -o jsonpath='{.items[0].metadata.name}')

echo "🐰 RabbitMQ Health Check - $(date)"
echo "=================================="

# Check cluster status
echo "📊 Cluster Status:"
kubectl exec -it $POD_NAME -n $NAMESPACE -- rabbitmqctl cluster_status

# Check queue depths
echo "📋 Queue Depths:"
kubectl exec -it $POD_NAME -n $NAMESPACE -- rabbitmqctl list_queues name messages consumers

# Check connections
echo "🔗 Active Connections:"
kubectl exec -it $POD_NAME -n $NAMESPACE -- rabbitmqctl list_connections | wc -l

# Check resource usage
echo "💾 Resource Usage:"
kubectl top pods -n $NAMESPACE | grep rabbitmq

Backup Script

#!/bin/bash
# scripts/rabbitmq-backup.sh

NAMESPACE="freeleaps-alpha"
BACKUP_DIR="/tmp/rabbitmq-backups"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p $BACKUP_DIR

echo "📦 Creating RabbitMQ backup..."

# Export definitions
kubectl exec -it deployment/rabbitmq -n $NAMESPACE -- rabbitmqctl export_definitions /tmp/rabbitmq-definitions-$DATE.json

# Copy backup file
kubectl cp $NAMESPACE/deployment/rabbitmq:/tmp/rabbitmq-definitions-$DATE.json $BACKUP_DIR/

echo "✅ Backup created: $BACKUP_DIR/rabbitmq-definitions-$DATE.json"

🚨 Emergency Procedures

1. RabbitMQ Node Failure

# If a RabbitMQ node fails:
# 1. Check node status
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl list_nodes

# 2. Restart failed node
kubectl delete pod rabbitmq-1 -n freeleaps-alpha

# 3. Verify cluster health
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl cluster_status

2. Data Loss Recovery

# If data is lost:
# 1. Stop all consumers
kubectl scale deployment freeleaps-devops-reconciler -n freeleaps-devops-system --replicas=0

# 2. Restore from backup
kubectl cp backup-file.json freeleaps-alpha/deployment/rabbitmq:/tmp/
kubectl exec -it deployment/rabbitmq -n freeleaps-alpha -- rabbitmqctl import_definitions /tmp/backup-file.json

# 3. Restart consumers
kubectl scale deployment freeleaps-devops-reconciler -n freeleaps-devops-system --replicas=1

3. Performance Emergency

# If performance is severely degraded:
# 1. Check resource usage
kubectl top pods -n freeleaps-alpha | grep rabbitmq

# 2. Scale up resources
kubectl patch deployment rabbitmq -n freeleaps-alpha -p '{"spec":{"template":{"spec":{"containers":[{"name":"rabbitmq","resources":{"limits":{"memory":"4Gi","cpu":"2000m"}}}]}}}}'

# 3. Restart RabbitMQ
kubectl rollout restart deployment/rabbitmq -n freeleaps-alpha

🎯 Summary & Next Steps

📊 Current State Assessment

✅ Strengths

Production-ready setup - Clustering, monitoring, management UI
Helm-based deployment - Consistent and repeatable
Environment separation - Alpha vs Production
Integration working - Reconciler successfully using RabbitMQ
Monitoring available - Grafana dashboards and metrics

⚠️ Areas for Improvement

Security hardening - Remove hardcoded passwords, implement secrets
Configuration standardization - Centralize configuration management
Performance optimization - Tune settings for your workload
Documentation - Create runbooks for common operations
Automation - Implement automated health checks and alerts

🚀 Recommended Actions

Immediate (This Week)

Implement secret management - Move passwords to Kubernetes secrets
Standardize configuration - Create centralized RabbitMQ config
Set up monitoring alerts - Configure alerts for critical metrics
Document procedures - Create runbooks for common operations

Short Term (Next Month)

Security audit - Review and improve security posture
Performance tuning - Optimize settings based on usage patterns
Automation - Implement automated health checks and backups
Training - Train team on RabbitMQ management and troubleshooting

Long Term (Next Quarter)

High availability - Implement multi-zone deployment
Disaster recovery - Set up automated backup and recovery procedures
Advanced monitoring - Implement predictive analytics and alerting
Capacity planning - Plan for growth and scaling

📚 Additional Resources

Official Documentation

RabbitMQ Documentation - Official guides
RabbitMQ Management UI - UI documentation
RabbitMQ Clustering - Cluster setup

Community Resources

RabbitMQ Slack - Community support
RabbitMQ GitHub - Source code
RabbitMQ Blog - Latest updates and tips

Books & Courses

"RabbitMQ in Depth" by Gavin M. Roy
"RabbitMQ Essentials" by Lovisa Johansson
RabbitMQ Tutorials - Official tutorial series

🎉 You now have a comprehensive understanding of your RabbitMQ production environment! Use this guide to maintain, monitor, and optimize your message broker infrastructure.

Last updated: $(date)
Maintained by: FreeLeaps DevOps Team

30 KiB Raw Blame History