How to install KubeStellar on a collection of AWS EKS managed clusters

Andy Anderson
11 min readMar 9, 2024

--

As of late I have been working on integrating KubeStellar with a few fellow CNCF sandbox, incubating, and graduated projects. The nice part about working on these integrations is that you collect other knowledge along the way. In this particular case I started off using a local Kind cluster and slowly realized it would take a bit more cpu and memory resources than my Mac M2 would allow.

I could have used any cloud vendor for this experiment but I wanted to test out KubeStellar on AWS’ Elastic Kubernetes Service (EKS). I was curious to experience a “managed kubernetes cluster service” and compare/contrast that with the years of Red Hat OpenShift administration and operation I have accumulated. The differences, as you might expect, are substantial. AWS’ EKS offered is built around the Kubernetes upstream which means you can use their latest versions as they become available. OpenShift is an opinionated version of a recently release Kubernetes upstream version. This means that with OpenShift you will have access to a few releases, but not the most recent release of Kubernetes. EKS requires that you manually or automatically (cloudformation) create the infrastructure (VPC, subnets, routes, security groups, route53, EFS/EBS storage, worker nodes, internet gateway, nat gateway, firewall, and etc.) as part of the provisioning process. OpenShift does require you to chose the size/capability/number of worker nodes but otherwise automates the provisioning process of your cluster. My point is that the term “managed cluster” varies in terms of definition when applied to Kubernetes clusters, be sure to read the fine print and there is no substitution for hands-on experience to know the differences.

But this article is not about comparing cloud and vendor offerings, lets about what it takes to provision a pair of AWS EKS clusters and then get KubeStellar deployed so that it can deploy and manage your configurations. My hope is that this blog helps others get started and avoid some of the time I wasted on a couple of configuration issues. I hit some snags in subnet, EBS add-on, and ingress configuration along the way. Let’s dig in, shall we?

I will split the installation and configuration into 3 sections 1) AWS infrastructure preparation and provisioning, 2) EKS provisioning, 3) EKS post-provisioning tasks, and 4) kubernetes ingress and kubeconfig.

AWS Infrastructure Preparation and Provisioning

Preparing the AWS infrastructure was straightforward, once you understand how worker nodes communicate with the internet.

Mistake #1 — private subnets only

I made the mistake to use the default AWS VPC in my account in my first attempt at provising EKS clusters. The default VPC had access to a handful of private subnets in different availability zones (AZ). This is helpful from the standpoint of reliability but not helpful from the perspective of access from/to the internet for things like pulling container images AND ingress access. Deploying an EKS cluster with worker nodes with private subnets only is a recipe for failure. With private subnets you can have internet access for pulling container images or private communication between nodes/pods but not both simulatneously. After provisioning an EKS cluster, with access to an internet gateway as the default route in each private subnet / availability zone pairing, I only had the abilty to establish ingress via an application load-balancer and occasionally successfully pull images from AWS’ Elastic Container Registry (ECR). I found that whenever I created a deployment with a reference to an external public registry the kubelet on the worker node was unable to pull the images required for proper initialization. None of my pods were reaching a ready state. How annoying was this?

VERY high-level AWS EKS architecture

I searched high and low for help on why my deployments could not pull images and there was nothing on the interwebs that pointed to… “hey dummy, you need private AND public subnets.”

Starting Over

I decided to start from scratch with a new VPC. I create 2 new VPCs and this time I used the “VPC and More” feature. I created 2 VPCs because I am creating a hub and a spoke cluster. I will cover creating a single VPC in this article, but the process is the same for the second cluster. I can’t stress how helpful the option of “VPC and More” was in my next attempt at provisioning a set of EKS clusters.

“VPC and More”

The only option I changed is to add one NAT Gateway per AZ subnet. You can quickly see that the new VPC configuration has a public and private subnet pair in each AZ. The control plane and work nodes will connect (via ENI) to both subnets and use route tables to govern their communications.

Once I clicked “create” I saw the immediate value of the “VPC and More” feature. The provisioning of private and public subnets, route tables, elastic IPs, internet and NAT gateways are all automated. This saves lots of time and removes the guesswork involved with preparing for an EKS cluster. I did miss one important setting here though. I needed to enable the “auto-assign public IPV4 address” to the public subnets. This option is important because without it your EKS worker nodes won’t be able to provision properly. EKS worker nodes creation includes attaching ENI to the public and private subnets they will communicate across. During the process of connecting to the public subnets, the worker node creation automation will attempt to associate with a public address. This action will fail if the public subnet’s “auto-assign” feature is not selected.

make sure you check off “auto-assign” for public subnets or your worker nodes won’t provision later on

For the most part, this concludes the preparation and provisioning process required for EKS. You can take some time to create/adjust a few more items:

  1. Modify your network ACL settings to allow ingress on 80, 443, 9080, 9443, and 8443. The default network ACL secttings allow all incoming and outgoing traffic. You might want to restrict this a bit for public subnet traffic.
  2. (optional) Create a domain in Route53 — this will be used later on when we deploy KubeStellar. You will need a point of ingress and a domain name serves as a prettier representation of the alphanumeric default domain that AWS will randomly assign you. If you create a domain, be sure to include a wildcard CNAME. Allowing wildcards will allow KubeStellar’s KubeFlex “API Server in an API Server” feature to be accessible via context in your kubeconfig (covered later on in this blog)

EKS Provisioning

The next step is to create your EKS cluster(s). I used the newly created VPC that included the private and public subnet pairs. I used the most recent version of Kubernetes (1.29 at the time of this writing) and I took defaults for the remainder of the settings for add-ons (kube-proxy, coreDNS, and Amazon VPC CNI).

Private and Public subnets included in the EKS provisioning process

This step is fairly straightforward and takes some time to complete. Time for a cup of coffee.

EKS Post-Provisioning Tasks

After EKS completes the initial provisioning task you are left with a control plane and no worker nodes. It is now time to add a node group to our cluster. This is the fulfillment of the “E” (elastic) in EKS. Node groups can auto-scale based on manual or automatic metric analysis for elements like cluster load. Handy feature, but could lead to more spend then you are willing to afford. I find this feature invaluable because I can increase or decrease the capacity of my cluster(s) whenever the need arises. I start off with a set of t3.2xlarge and then move up to c4.4xlarge in the future, and back again.

t3.2xlarge (3 nodes) is a good starting point for KubeStellar Core. Remote clusters (WECs) can be any size you like because our transport agent is negligible in size.

Worker nodes are now available and we avoided the “auto-assign public ip address” issue detailed above. Compute is complete. Network has been configured correctly with coreDNS, kube-proxy, and VPC CNI support. Next is storage. We need a Kubernetes CSI in place to allow the creation of persistent volumes (PV) and persistent volume claims (PVC). This is where I made another mistake.

Mistake #2 — jumbled CSI configuration

As you may recall, I was unsuccessful in my first attempt to create an EKS cluster. During the first attempt I configured the Amazon EBS CSI add-on successfully, but the cluster’s network configuration did not allow for pulling images. When I tore down the failed first attempt cluster, I did not fully cleanup the IAM policies associated with the EBS CSI. Don’t forget to clean this up. EBS CSI is not just an add-on, it requires OIDC and IAM for proper authentication between the kubelet requesting storage on behalf of a PV/PVC pair and the AWS EBS storage backing it.

Getting it right the X-th time…

The best advice I can give is to use the “AWS Management Console” set of instructions (not the eksctl or aws cli instructions). Why use the console for this? First, you will become familiar with the IAM portion of operating and administering an EKS cluster, and secondly you will use the wizard to create the important ‘trust entity relationship’ required to link the Kubernetes EBS CSI back to the EBS service. Your experience may vary, but the instructions at https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html were invaluable in getting me out of the “Warning ProvisioningFailed 13m ebs.csi.aws.com_ebs-csi-controller-7cbfc87f8f-l27tq_8dcfa20e-a67b-4ff4–802a-65bdc513b06b failed to provision volume with StorageClass “gp2”: rpc error: code = Internal desc = Could not create volume “pvc-9fbac510-fd12–41a4–9917-a2a43d6c7197”: could not create volume in EC2: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity

This error message will show up in your “kubectl describe” of any PVC until your AWS “trust entity relationship” and OIDC pairing for the EBS CSI add-on is working properly. Follow the instructions carefully in https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html to avoid this mistake in the configuration.

NOTE: unless the OIDC provider shows up under the EKS cluster access tab, do not proceed with adding the ebs add-on. For some reason, even though you explicitly assign it to the cluster with:

eksctl utils associate-iam-oidc-provider --cluster $cluster_name --approve
aws eks describe-cluster --name $cluster_name --query "cluster.identity.oidc.issuer" --output text

Now, even though the result is positive and you get an OIDC string returned, you can receive “api error AccessDeniedException: Cross-account pass role is not allowed” when you try:

ACCOUNT_ID=$(aws sts get-caller-identity | python3 -c "import sys,json; print (json.load(sys.stdin)['Account'])")
eksctl create addon --name aws-ebs-csi-driver --cluster $cluster_name --service-account-role-arn arn:aws:iam::$ACCOUNT_ID:role/AmazonEKS_EBS_CSI_DriverRole --force

Still investigating why this happens or how to get around it — not sure yet.

Getting your cluster’s kubeconfig

AWS provides a handy feature in their AWS cli to grab the kubeconfig for your shiny new cluster:

aws eks update-kubeconfig - region us-east-1 - name hub-eks - kubeconfig aws-eks.kubeconfig

and then I renamed the context to something a little shorter:

export KUBECONFIG=~/aws-eks.kubeconfig 
kubectl config rename-context "arn:aws:eks:us-east-1:xxxabc123:cluster/hub-eks" "hub-eks"

Kubernetes Ingress

Inching closer to using the cluster for experiments, the next step is to deploy an ingress controller on the cluster to allow external access to services and other endpoints. Ingress will allow us to expose services backed by pods, defined by deployments and replicated by replicasets and statefulsets. To do this properly I used the Kubernetes AWS ingress-nginx workload definition at https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/aws/deploy.yaml

kubectl --context hub-eks apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/aws/deploy.yaml

Remember to add “enable-ssl-passthrough” to the ingress-nginx-controller” deployment after applying the original ingress-nginx definition.

kubectl --context hub-eks edit deployment.apps/ingress-nginx-controller -n ingress-nginx
- --enable-ssl-passthrough

We also need to expose some more ports in addition to the default set ingress-nginx is configured for. I used https://kubernetes.github.io/ingress-nginx/user-guide/exposing-tcp-udp-services/ as a guide to complete this task. Here is what my configuration looked like after editing the ingress-nginx-controller service.

kubectl --context hub-eks edit service ingress-nginx-controller -n ingress-nginx
  - name: proxied-tcp-9443
nodePort: 31345
port: 9443
protocol: TCP
targetPort: 443
- name: proxied-tcp-9080
nodePort: 31226
port: 9080
protocol: TCP
targetPort: 80

AND… DO NOT FORGET TO TAG YOUR PUBLIC SUBNETS!!!

You need to have ‘internal-elb’ tags for ingress-nginx Network Load Balancer, and you need ‘elb’ for AWS’ Application Load Balancer (for other cool things I will be sharing in a future blog). So put them both in there now. I have 2 public subnets for my cluster so I add to both:

key: kubernetes.io/role/internal-elb value: 1
key: kubernetes.io/role/elb. value: 1

Once the deployment and pods are in a ready state, you should see an ‘external-ip’ on your console output for the service/ingress-nginx-controller

kubectl --context hub-eks get all -n ingress-nginx
...

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ingress-nginx-controller LoadBalancer 172.20.71.175 abc123xyy.elb.us-east-1.amazonaws.com 80:31777/TCP,443:31079/TCP,9443:31345/TCP,9080:31226/TCP 16h

Use the external-ip as the address for the wildcard entry (*.) for your route53 domain you defined earlier in this set of instructions.

You are now ready to deploy KubeStellar’s KubeFlex and the KubeStellar deployment and configuration project. For more information on how to deploy KubeStellar, go to https://github.com/kubestellar/kubeflex/blob/main/docs/users.md#use-a-different-dns-service.

You will know you have ingress working properly when you can see 2 things.

  1. navigate to http://<your route-53 domain> or https://<your route-53 domain>. In either case you should see:
this is a good sign, believe it or not

2) An EC2 load balancer with settings like:

NOTE: When initializing KubeFlex, remember to use the “ — domain” parameter to include the ingress-nginx external-ip OR your Route53 domain name.

kflex -k ~/aws-eks.kubeconfig init --domain yourdomain.org

- OR -

kflex -k ~/aws-eks.kubeconfig init --domain abc123xyy.elb.us-east-1.amazonaws.com

Conclusion

AWS EKS and OpenShift definitions of “managed service” vary in many ways. There are pros and cons to using either platform. Once provisioned, it is easy to make the either choice easy to configure using KubeStellar. Visit us at https://KubeStellar.io to see the cool things we are working on.

Thanks for stopping by!

--

--

Andy Anderson

IBM Research - KubeStellar, DevOps, Technology Adoption, and Kubernetes. Views are my own.