Canary and Blue-Green Deployments Enabled by KubeStellar — Part 1 — failed approach

14 min readMar 31, 2024

I recently attended KubeCon EU in Paris, France. As part of the CNCF Sandox program, KubeStellar is afforded some really nice perks. We have opportunities to give a lightning talk, offer a contribfest (hackathon of sorts), and be a part of the CNCF project pavilion where you can showcase your wares and talk to real customers. All of this is included in the program and all I had to do is show up.

During the opening night event I was fortunate to talk to quite a few folks and some had some really interesting ideas about how they might use KubeStellar to accomplish a real business need. I am working to bring a handful of these scenarios / use-cases to life in a documented format. Canary and Blue-Green operation, WASM Edge, and consolidated Continuous Delivery were the 3 items that I could see potential for. In this blog I am going to take on the first of 3 — Canary and Blue Green operation.

Understanding Deployment Strategies: Canary vs. Blue-Green

Before diving into our exploration of canary and blue-green deployments using KubeStellar, let’s clarify what each deployment strategy entails.

Canary Deployment

In a canary deployment, a new version of an application is gradually rolled out to a subset of users or servers. This approach allows for testing the new version in a real-world environment while minimizing the potential impact of bugs or issues. The term “canary” refers to using a small, controlled subset (like a canary in a coal mine) to detect potential problems before fully deploying the new version.

Blue-Green Deployment

In contrast, a blue-green deployment involves maintaining two identical production environments: one active (blue) and one inactive (green). The new version of the application is deployed to the inactive environment (green), allowing for extensive testing and validation. Once the new version is deemed stable and ready for production, traffic is switched from the active environment (blue) to the updated environment (green), effectively swapping the roles of the two environments. Blue-Green deployment is a complete cutover instead of a gradual move defined by canary deployments.

Blue-Green on AWS EKS with ALB

For the blue-green deployment on AWS EKS with ALB, I initially attempted to follow the approach outlined in the AWS documentation (https://aws.amazon.com/blogs/containers/using-aws-load-balancer-controller-for-blue-green-deployment-canary-deployment-and-a-b-testing/), using the AWS Load Balancer Controller. However, despite a few days of non-stop learning and effort, I encountered several challenges (many due to lack of EKS knowledge) and the deployment did not work as I had expected. And, for good reason, my scenario was quite different than the one used by the authors of the AWS blog.

Multicluster Deployment with KubeStellar

It’s important to note that the scenario I attempted with KubeStellar involves multicluster deployment across different VPCs and regions. This means we’re dealing with a more complex setup compared to the single-cluster deployment described by AWS. Managing multiple clusters distributed across different VPCs and regions adds additional layers of complexity, especially when coordinating deployments and traffic routing between these clusters.

In fact, using an ALB->NLB combination was not ever going to work. ALB (aws load balancer controller — https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html) works at layer 7 of the OSI model and expects protocols like HTTP and HTTPS. NLB (ingress-nginx — https://aws.amazon.com/blogs/opensource/network-load-balancer-nginx-ingress-controller-eks/) works at layer 4 of the OSI model and expects TCP, UDP and ports as input to routing decisions. For more on AWS ALB check out https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html.

Failed Approach

My lack of respect for the differences in ingress mechanics led me to a failed implementation. But, I learned quite a bit and can make better informed architectures now as a result. While it was frustrating, I did come away with a working implementation (next blog in the series).

I wrote this blog to record what I did so that I can a) show others what to avoid, and b) use this information in the future when a different scenario arises. The AWS load-balancer-controller is really slick and I like the way Amazon has created a method for creating infrastructure from within Kubernetes state-based controllers. To me, the key takeaway is that you are not limited to Upbound’s Crossplane to CRUD infrastructure. AWS is making good strides in this direction also.

The setup

I had limited success with the simple ‘eksctl create cluster’ for creating EKS Kubernetes clusters that could communicate with public internet endpoints. I am sure there is a way to do it but I stuck with what I know works -I deployed 3 eks clusters with “vpc and more” — https://clubanderson.medium.com/how-to-install-kubestellar-on-a-collection-of-aws-eks-managed-clusters-aa1615e671a0
Remember to spin up nodes in nodegroups for each of the 3 clusters and add the ebs add-on (after connecting OIDC and creating policies for all 3 clusters)
Get all the kubeconfigs

eksctl utils write-kubeconfig - cluster=bg-wec1 - kubeconfig=eks.kubeconfig - region us-east-1
eksctl utils write-kubeconfig - cluster=bg-wec2 - kubeconfig=eks.kubeconfig - region us-east-1
eksctl utils write-kubeconfig - cluster=bg-core - kubeconfig=eks.kubeconfig - region us-east-1

NOTE: write the hub to the kubeconfig last so that your context is defaulted to the hub to start this exercise

4. install ingress-nginx (Network Load Balancer) FIRST to support kubestellar (https://clubanderson.medium.com/how-to-install-kubestellar-on-a-collection-of-aws-eks-managed-clusters-aa1615e671a0)

kubectl --context bg-core apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/aws/deploy.yaml

4.a. add ssl passthrough:

kubectl --context bg-core edit deployment.apps/ingress-nginx-controller -n ingress-nginx

4.b. add ‘- — enable-ssl-passthrough’

add more ports for ks:

kubectl - context hub-eks edit service ingress-nginx-controller -n ingress-nginx

- name: proxied-tcp-9443
  nodePort: 31345
  port: 9443
  protocol: TCP
  targetPort: 443
- name: proxied-tcp-9080
  nodePort: 31226
  port: 9080
  protocol: TCP
  targetPort: 80

4.c. check for a new network load balancer in your AWS console at https://console.aws.amazon.com/ec2/home?#LoadBalancers

5. Now that ingress is established, you can Install KubeStellar (WDS0 hosted control plane, WDS1 pointing at bg-wec1 and WDS2 pointing at bg-wec2). Use the instructions at https://github.com/kubestellar/kubeflex/blob/main/docs/users.md#use-a-different-dns-service. Remember to use the “ — domain” flag for kubeflex. The domain you give it is one you define in AWS Route53. The domain will play an important role in ingress for KubeStellar IMBS/ITS and WDS’. Without a domain, this failed approach and the working approach (next blog in the series) will not work!

6. Test your KubeStellar endpoints are working in conjunction with your Route53 domain. A successful response is ‘403’ in this case because we are not using the proper certificates to access the control planes of IMBS1, WDS1, and WDS2. The fact there is an invalid response indicates the control planes are alive and responding. That’s all we need to confirm here.

curl -k https://imbs1.kubestellar.org:9443

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}%

curl -k https://wds1.kubestellar.org:9443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}%

curl -k https://wds2.kubestellar.org:9443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}%

Where are we? Check-in #1

Ok, so let’s take inventory here of what we have done so far. We have 3 EKS clusters (bg-core, bg-wec1, and bg-wec2). We have ingress-nginx setup and working on bg-core. We have 2 KubeStellar WDS setup to stage workloads for delivery to bg-wec1 and bg-wec2 respectively. IMBS1, WDS1, and WDS2 controlplanes are responding over ingress using their domain.

I am going to install a different version of a single app to WDS1 (which will sync to the bg-wec1 cluster remote cluster using KubeStellar) and WDS2 (which will sync to the bg-wec2 cluster remote cluster using KubeStellar). Once that app is deployed, I will apply a configuration to an application load balancer (ALB) (this flawed thinking — I will explain why) that will share the traffic load between the app on bg-wec1 and the app on bg-wec2.

Why is this flawed? An application load balancer (ALB) works at Layer 7 with protocols HTTP and HTTPS and others. If I want to point traffic at an app on bg-wec1 I must give the ALB a pointer to the service on bg-wec1. Same is true to point traffic at an app on bg-wec2. The dilemma is “how” do I do this when the services are on separate clusters in separate VPCs in, potentially, separate regions. The regions are not the issue. The VPCs are a slight issue, but it can be worked around. The issue is the services are on different clusters. Let’s find out why…

Not to mention this — Important to note: https://stackoverflow.com/questions/73091883/can-application-load-balancer-access-target-group-in-different-vpc

Deploy the AWS Application Load Balancer Controller

As I stated earlier, the AWS ALB Controller is really slick. They took the time to integrate the creation of an ALB and it’s listeners by “watching” the ingress objects in etcd via a controller/operator they developed. You can see changes in your AWS console load balancers, targetgroups, listeners, and etc. all because the controller is running in kubernetes and watching ingress changes. Really cool!

I used the ALB for EKS blog (https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html) as a reference for the instructions herein.

Setting up IAM Roles: We began by associating an IAM OIDC provider and creating IAM policies and roles required for the AWS Load Balancer Controller.

# associate oidc with bg-core, bg-wec1, and bg-wec2 clusters
eksctl utils associate-iam-oidc-provider --region=us-east-1 --cluster=bg-core --approve
eksctl utils associate-iam-oidc-provider --region=us-east-1 --cluster=bg-wec1 --approve
eksctl utils associate-iam-oidc-provider --region=us-east-1 --cluster=bg-wec2 --approve

# get the JSON policy definition needed
curl -O https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.1/docs/install/iam_policy.json

# apply policy
aws iam create-policy \
--policy-name AWSLoadBalancerControllerIAMPolicy \
--policy-document file://iam_policy.json

# get your account id and store it in a var
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# did we get our account id?
echo $ACCOUNT_ID

# setup the load balancer controller on each cluster
eksctl create iamserviceaccount \
--cluster=bg-core \
--namespace=kube-system \
--name=aws-load-balancer-controller \
--role-name AmazonEKSLoadBalancerControllerRole \
--attach-policy-arn=arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \
--approve \
--region us-east-1

eksctl create iamserviceaccount \
--cluster=bg-wec1 \
--namespace=kube-system \
--name=aws-load-balancer-controller-wec1 \
--role-name AmazonEKSLoadBalancerControllerRoleWec2 \
--attach-policy-arn=arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \
--approve \
--region us-east-1

eksctl create iamserviceaccount \
--cluster=bg-wec2 \
--namespace=kube-system \
--name=aws-load-balancer-controller-wec2 \
--role-name AmazonEKSLoadBalancerControllerRoleWec2 \
--attach-policy-arn=arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \
--approve \
--region us-east-1

NOTE: if you somehow goof up the creation of the iamserviceaccount - use 

eksctl delete iamserviceaccount --cluster=bg-core --name=aws-load-balancer-controller --namespace=kube-system 

to remove it.

2. Helm Installation: Next, we used Helm to install the AWS Load Balancer Controller on each cluster (bg-core, bg-wec1, bg-wec2), ensuring to specify the cluster name and service account details.

# add the helm repo
KUBECONFIG=eks.kubeconfig helm repo add eks https://aws.github.io/eks-charts

# update the repo
KUBECONFIG=eks.kubeconfig helm repo update eks

# install the AWS load-balancer-controller on each of the 3 EKS clusters
KUBECONFIG=eks.kubeconfig helm --kube-context bg-core install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system \
--set clusterName=bg-core \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller 

KUBECONFIG=eks.kubeconfig helm --kube-context bg-wec1 install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system \
--set clusterName=bg-wec1 \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller-wec1

KUBECONFIG=eks.kubeconfig helm --kube-context bg-wec2 install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system \
--set clusterName=bg-wec2 \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller-wec2

3. Tagging Public Subnets: It was crucial to tag the public subnets with the appropriate Kubernetes role for ELB. (see — https://repost.aws/knowledge-center/eks-vpc-subnet-discovery)

'kubernetes.io/role/elb' with a value of '1'

4. Test that the AWS LBC is working on all 3 clusters

KUBECONFIG=eks.kubeconfig kubectl --context bg-core get deployment -n kube-system aws-load-balancer-controller
KUBECONFIG=eks.kubeconfig kubectl --context bg-wec1 get deployment -n kube-system aws-load-balancer-controller
KUBECONFIG=eks.kubeconfig kubectl --context bg-wec2 get deployment -n kube-system aws-load-balancer-controller

5. Deploying Test App: We deployed a test application on the core cluster to check if the Application Load Balancer (ALB) was created successfully.

# deploy game 2048 on the core cluster
kubectl --context bg-core apply -f https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.1/docs/examples/2048/2048_full.yaml

6. check for an application load balancer in https://console.aws.amazon.com/ec2/home?#LoadBalancers

7. You should see an address show up in a few minutes under the ingress listing in the bg-core cluster

kubectl --context bg-core get ingress/ingress-2048 -n game-2048

8. Try the URL in a browser after about 3–4 minutes. You should see

9. We are done with the game for now — let’s remove it from the bg-core cluster

# delete game 2048 from the core cluster
kubectl --context bg-core delete -f https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.7.1/docs/examples/2048/2048_full.yaml

10. If you do not see an ALB created in your AWS Console, look here for answers

# check logs if ALB does not create
kubectl --context bg-core logs -f -n kube-system -l app.kubernetes.io/instance=aws-load-balancer-controller

Where are we? Check-in #2

Ok, so now we have a working AWS LBC and we can see Game 2048 exposed on a random URL. Now let’s start working towards deploying 2 different versions of the same app on 2 different clusters using KubeStellar.

I used an example workload called “hello-kubernetes” (https://github.com/paulbouwer/hello-kubernetes). Thank you paulbouwer. because it shows a different page for each version deployed. Great for demonstrating blue-green and canary deployment scenarios.

showing version 2 of “hello kubernetes” project — thank you paulbouwer

2. Deploy the helm chart across the KubeStellar WDS’

git clone https://github.com/paulbouwer/hello-kubernetes.git
    
# deploy to the KubeStellar WDS1 which represents bg-wec1
KUBECONFIG=eks.kubeconfig helm --kube-context wds1 install --create-namespace --namespace hello-kubernetes v1 \
./hello-kubernetes/deploy/helm/hello-kubernetes \
--set message="You are reaching hello-kubernetes version 1" \
--set ingress.configured=true \
--set service.type="ClusterIP" \

# deploy to the KubeStellar WDS2 which represents bg-wec2
KUBECONFIG=eks.kubeconfig helm --kube-context wds2 install --create-namespace --namespace hello-kubernetes v2 \
./hello-kubernetes/deploy/helm/hello-kubernetes \
--set message="You are reaching hello-kubernetes version 2" \
--set ingress.configured=true \
--set service.type="ClusterIP" \

3. Deploy the KubeStellar binding policies necessary to populate the bg-wec1 and bg-wec2 clusters with the “hello kubernetes” app

# on WDS1 for bg-wec1
kubeconfig=eks.kubeconfig kubectl --context wds1 apply - <<EOF
apiVersion: control.kubestellar.io/v1alpha1
kind: BindingPolicy
metadata:
  name: bg-wec1-v1-bindingpolicy
spec:
  wantSingletonReportedState: true
  clusterSelectors:
  - matchLabels:
      name: bg-wec1
  downsync:
  - objectSelectors:
    - matchLabels: 
        app.kubernetes.io/instance: v1
EOF

# on WDS2 for bg-wec2
kubeconfig=eks.kubeconfig kubectl --context wds2 apply - <<EOF
apiVersion: control.kubestellar.io/v1alpha1
kind: BindingPolicy
metadata:
  name: bg-wec2-v2-bindingpolicy
spec:
  wantSingletonReportedState: true
  clusterSelectors:
  - matchLabels:
      name: bg-wec2
  downsync:
  - objectSelectors:
    - matchLabels: 
        app.kubernetes.io/instance: v2
EOF

Now we have the “hello-kubernetes” app on bg-wec1 and bg-wec2 clusters in 2 different VPCs in the same region. At this point, the authors of the original Blue-Green blog (https://aws.amazon.com/blogs/containers/using-aws-load-balancer-controller-for-blue-green-deployment-canary-deployment-and-a-b-testing/) start implementing ingress with:

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: "hello-kubernetes"
      namespace: "hello-kubernetes"
      annotations:
        kubernetes.io/ingress.class: alb
        alb.ingress.kubernetes.io/scheme: internet-facing
        alb.ingress.kubernetes.io/target-type: ip
        alb.ingress.kubernetes.io/actions.blue-green: |
          {
            "type":"forward",
            "forwardConfig":{
              "targetGroups":[
                {
                  "serviceName":"hello-kubernetes-v1",
                  "servicePort":"80",
                  "weight":100
                },
                {
                  "serviceName":"hello-kubernetes-v2",
                  "servicePort":"80",
                  "weight":0
                }
              ]
            }
          }
      labels:
        app: hello-kubernetes
    spec:
      rules:
        - http:
            paths:
              - path: /
                pathType: Prefix
                backend:
                  service:
                    name: blue-green
                    port:
                      name: use-annotation

But, now the epiphany comes to light. How will this work? The “alb.ingress.kubernetes.io/actions.blue-green” defines rules that have service names in them. Should I apply this object definition on bg-core? On bg-wec1 via WDS1, on bg-wec2 via WDS2? How does bg-core know about service hello-kubernetes-v1 on cluster bg-wec1? Short answer, it doesn’t. I am stuck!!!

What next. Well, I did some research on targetgroups and targetgroupbindings. Surely this will solve the problem. I will just define 2 targetgroups and bind (targetgroupbinding) the services on the remote clusters back to those targetgroups. Listeners should create and then I should see my service endpoints running. Let’s try that out…

I setup 2 target groups:

aws elbv2 create-target-group \
--name target-group-bg-wec1 \
--protocol HTTP \
--port 80 \
--target-type ip \
--vpc-id $VPC_ID_BG_CORE

aws elbv2 create-target-group \
--name target-group-bg-wec2 \
--protocol HTTP \
--port 80 \
--target-type ip \
--vpc-id $VPC_ID_BG_CORE

NOTE: I originally had the vpc-id as the id of the bg-wec1 and bg-wec2 clusters respectively. However, I found out, in error messages from the ALB that the target group definitions must all originate in the same vpc as the lead ALB is located. For me, I thought I could use this on the bg-core VPC, so I went with that.

2. I setup 2 targetgroupbindings:

# on WDS1 for bg-wec1
kubeconfig=eks.kubeconfig kubectl --context wds1 apply - <<EOF
apiVersion: elbv2.k8s.aws/v1beta1
kind: TargetGroupBinding
metadata:
  name: tgb-v1
  namespace: hello-kubernetes
  labels:
    app.kubernetes.io/instance: v1
spec:
  serviceRef:
    name: hello-kubernetes-v1
    port: 80
  targetGroupARN: arn:aws:elasticloadbalancing:$REGION:$ACCOUNT_ID:targetgroup/target-group-bg-wec1/a5291e82fcc6b879
  targetType: ip
EOF

# on WDS2 for bg-wec2
kubeconfig=eks.kubeconfig kubectl --context wds2 apply - <<EOF
apiVersion: elbv2.k8s.aws/v1beta1
kind: TargetGroupBinding
metadata:
  name: tgb-v2
  namespace: hello-kubernetes
  labels:
    app.kubernetes.io/instance: v2
spec:
  serviceRef:
    name: hello-kubernetes-v2
    port: 80
  targetGroupARN: arn:aws:elasticloadbalancing:$REGION:$ACCOUNT_ID:targetgroup/target-group-bg-wec2/41813480453d8864
  targetType: ip

Why would I apply these on the remote clusters (via the KubeStellar WDS of course)? Well, again, bg-core does not know about the “hello-kubernetes-v1” or “hello-kubernetes-v2” services on bg-wec1 and bg-wec2. Creating these targetgroupbindings on the bg-core would fail. After applying these object definitions I waited for the ALB listeners to create in the AWS console. After much tinkering, I finally saw the targetgroups and targets appearing.

Surely this might start working… Alas, I was wrong.

What happened was a whole lot of:

And, it makes perfect sense to me now. How is an ALB supposed to know anything about services on remote clusters. Now, could AWS update their LBC to be multi-cluster aware, sure. But, it might just be an anti-pattern. I scoured the internet for some sort of comfort and found https://www.reddit.com/r/aws/comments/10od7ei/how_to_send_load_balancer_traffic_to_a_target/. An active discussion that talks about VPC peering and VPC service endpoints. But, its a bit of a trick… They were talking about non-kubernetes deployments. You can easily lose site of the fact that kubernetes services can be targeted as services on url endpoints or as services in kubernetes. The targetgroupbindings attempts to tie the kubernetes service back to the targetgroup via listener but falls a little short. Also, notice that the elb creation has a protocol (layer 7) and the targetgroupbinding has target type IP in it (layer 2) as well as port. I am not sure how to connect targetgroup and targetgroupbinding any more than I did.

And then, it dawned on me, why not simplify this using DNS. Why not use weighted traffic definitions on DNS. Surely Route53 must offer this. Low and behold, they do. And, that is where this blog ends. I got very close, but unless you deploy services on nodes within the same cluster, all of the above is useless. Or, maybe I could have pushed a little more to make it work. In the end, I am not sure if it would have been worth it. It is awfully complicated, isn’t it? I learned a hell of alot though. I hope you did also.

Ok, so stay tuned here. I am writing the happy path blog next. See you there.

Thanks for stopping by!