Among the interesting challenges infrastructure teams, site reliability engineering teams, and DevOps developers face when developing or factoring the deployments of their containerized microservices workloads is designing resilience to failure of the node cluster on which the microservice pods are running. If the k8s masters stop working, or the entire namespace fails (a frequent occurrence where I work), your wonderfully resilient system will become unavailable. Global load balancing with multi-regional and multi-cluster / namespace redundancies can mitigate this problem, but adds complexity and creates new types of cascading failures such as flapping, undetected service failures, and false alarms. Even Amazon Fargate and Google Cloud Platform autopilot are not yet immune to these problems, though AutoPIlot does have extremely high availability in my personal testing.
A simpler, and more-elegant abstraction that solves this problem is ArgoCD. I am embarrassed to admit that I first heard of ArgoCD when this silly video came out (the ArgoCD reference is about three minutes in). Here is a very gentle, 5-minute introduction to ArgoCD concepts that can help get you started quickly.
No comments:
Post a Comment