2023-07-25
Karpenter 是亚马逊云科技为 Kubernetes 构建的能用于生产环境的开源的工作节点动态调度控制器。相较于传统的 Cluster Autoscaler 工具,Karpenter 具有调度速度快、更灵活、资源利用率高等众多优势,是 Amazon Elastic Kubernetes Service(EKS)自动扩缩容的首选方案,两者的比较可以参考下图。本文将介绍如何在 EKS 上把 cluster autoscaler 迁移到 Karpenter。
Feature | Cluster Autoscaler | Karpenter |
Resource Management | Based on the resource utilization of existing nodes, Cluster Autoscaler takes a reactive approach to scale nodes. | Based on the current resource requirements of unscheduled pods, Karpenter takes a proactive approach to provisioning nodes. |
Node management | Cluster Autoscaler manages nodes based on the resource demands of the present workload, using predefined autoscaling groups. | Karpenter scales, provisions, and manages nodes based on the configuration of custom Provisioners. |
Scaling | Cluster Autoscaler is more focused on node-level scaling, which means it can effectively add more nodes to meet any increase in demand. But this also means it may be less effective in downscaling resources. | Karpenter offers more effective and granular scaling functionalities based on specific workload requirements. In other words, it scales according to the actual usage. It also allows users to specify particular scaling policies or rules to match their requirements. |
Scheduling | With Cluster Autoscaler, scheduling is more simple as it is designed to scale up or down based on the present requirements of the workload. | Karpenter can effectively schedule workloads based on different factors like availability zones and resource requirements. It can try to optimize for the cheapest pricing via spot but is unaware of any commitments like RI’s or Savings Plans. |
为了演示迁移的过程,本次我们使用了 EKS 1.26 版本来模拟生产环境,其中集群使用了 3 个公有子网和 3 个私有子网,使用 1 个托管节点组运行当前负载并安装好 OIDC 插件做 IAM 和 service account 的集成,并预先安装好 eksctl 和 aws CLI 配置工具。过程忽略,可以参考官方 quick start guide。
为了迁移过程对应用的影响降到最低,强烈建议用户把 Pod Disruption budgets(PDB)配置上。因为在迁移的过程中我们需要驱逐应用的 Pod 和收缩托管节点组的数量,配置了 PDB 可以保证 pod 迁徙过程中确保可以运行的副本数量永远不会低于你所配置的比例。比如当前 deployment 有 10 个 pod 在运行,你配置 minAvailable 为“50%”,那么就能确保干扰期间至少有 5 个 POD 是能持续工作的。
所以我们迁移步骤大致如下:
1. 查看 EKS 集群信息
$ kubectl get node
NAME STATUS ROLES AGE VERSION
ip-192-168-31-163.ap-southeast-1.compute.internal Ready <none> 40h v1.26.4-eks-0a21954
ip-192-168-44-153.ap-southeast-1.compute.internal Ready <none> 40h v1.26.4-eks-0a21954
ip-192-168-7-103.ap-southeast-1.compute.internal Ready <none> 40h v1.26.4-eks-0a21954
zhenwei:~/environment $ eksctl get nodegroup --region ap-southeast-1 --cluster prd-eks
CLUSTER NODEGROUP STATUS CREATED MIN SIZE MAX SIZE DESIRED CAPACITY INSTANCE TYPE IMAGE ID ASG NAME TYPE
prd-eks worknode ACTIVE 2023-05-31T02:26:25Z 2 4 3 t3.medium AL2_x86_64 eks-worknode-c2c437d0-581a-8182-d61f-d7888271bfbb managed
2. 部署测试 Nginx 应用
创建 nginx.yaml 文件
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-to-scaleout
spec:
selector:
matchLabels:
app: nginx
replicas: 4
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
kubectl apply -f nginx.yaml
3. 部署 PDB
创建 nginx-pdb.yaml 文件
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
maxUnavailable: 50%
selector:
matchLabels:
app: nginx
kubectl apply -f nginx-pdb.yaml
检查 PDB 状态,这样我们就能保证顶多只有 50% 的 POD 会在迁移过程受影响。
kubectl get poddisruptionbudgets
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nginx-pdb N/A 50% 2 3m9s
1. 准备好环境变量
CLUSTER_NAME=<your cluster name>
AWS_PARTITION='aws' # if you are not using standard partitions, you may need to configure to aws-cn / aws-us-gov
AWS_REGION='$(aws configure list | grep region | tr -s ' ' | cut -d' ' -f3)'
OIDC_ENDPOINT='$(aws eks describe-cluster --name ${CLUSTER_NAME}
--query 'cluster.identity.oidc.issuer' --output text)'
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account'
--output text)
2. 创建 Karpenter 需要的 2 个 Role
1)创建 Karpenter 的 node 需要的 role
echo '{
'Version': '2012-10-17',
'Statement': [
{
'Effect': 'Allow',
'Principal': {
'Service': 'ec2.amazonaws.com'
},
'Action': 'sts:AssumeRole'
}
]
}' > node-trust-policy.json
aws iam create-role --role-name 'KarpenterNodeRole-${CLUSTER_NAME}'
--assume-role-policy-document file://node-trust-policy.json
2)给这个 role 添加 policy
aws iam attach-role-policy --role-name 'KarpenterNodeRole-${CLUSTER_NAME}'
--policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy --role-name 'KarpenterNodeRole-${CLUSTER_NAME}'
--policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam attach-role-policy --role-name 'KarpenterNodeRole-${CLUSTER_NAME}'
--policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam attach-role-policy --role-name 'KarpenterNodeRole-${CLUSTER_NAME}'
--policy-arn arn:${AWS_PARTITION}:iam::aws:policy/AmazonSSMManagedInstanceCore
3)把 role 授予 EC2 的 instance profile
aws iam create-instance-profile
--instance-profile-name 'KarpenterNodeInstanceProfile-${CLUSTER_NAME}'
aws iam add-role-to-instance-profile
--instance-profile-name 'KarpenterNodeInstanceProfile-${CLUSTER_NAME}'
--role-name 'KarpenterNodeRole-${CLUSTER_NAME}'
// 输出如下:
{
'InstanceProfile': {
'Path': '/',
'InstanceProfileName': 'KarpenterNodeInstanceProfile-prd-eks',
'InstanceProfileId': 'AIPAXGRAMRX5HISZDSH2O',
'Arn': 'arn:aws:iam::495062xxxxx:instance-profile/KarpenterNodeInstanceProfile-prd-eks',
'CreateDate': '2023-05-31T03:31:40Z',
'Roles': []
}
}
4)创建 Karpenter controller 本身需要的 role,它依赖 OIDC 来做 IRSA 绑定授权
$ cat << EOF > controller-trust-policy.json
{
'Version': '2012-10-17',
'Statement': [
{
'Effect': 'Allow',
'Principal': {
'Federated': 'arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:oidc-provider/${OIDC_ENDPOINT#*//}'
},
'Action': 'sts:AssumeRoleWithWebIdentity',
'Condition': {
'StringEquals': {
'${OIDC_ENDPOINT#*//}:aud': 'sts.amazonaws.com',
'${OIDC_ENDPOINT#*//}:sub': 'system:serviceaccount:karpenter:karpenter'
}
}
}
]
}
EOF
$ aws iam create-role --role-name KarpenterControllerRole-${CLUSTER_NAME}
--assume-role-policy-document file://controller-trust-policy.json
cat << EOF > controller-policy.json
{
'Statement': [
{
'Action': [
'ssm:GetParameter',
'ec2:DescribeImages',
'ec2:RunInstances',
'ec2:DescribeSubnets',
'ec2:DescribeSecurityGroups',
'ec2:DescribeLaunchTemplates',
'ec2:DescribeInstances',
'ec2:DescribeInstanceTypes',
'ec2:DescribeInstanceTypeOfferings',
'ec2:DescribeAvailabilityZones',
'ec2:DeleteLaunchTemplate',
'ec2:CreateTags',
'ec2:CreateLaunchTemplate',
'ec2:CreateFleet',
'ec2:DescribeSpotPriceHistory',
'pricing:GetProducts'
],
'Effect': 'Allow',
'Resource': '*',
'Sid': 'Karpenter'
},
{
'Action': 'ec2:TerminateInstances',
'Condition': {
'StringLike': {
'ec2:ResourceTag/karpenter.sh/provisioner-name': '*'
}
},
'Effect': 'Allow',
'Resource': '*',
'Sid': 'ConditionalEC2Termination'
},
{
'Effect': 'Allow',
'Action': 'iam:PassRole',
'Resource': 'arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}',
'Sid': 'PassNodeIAMRole'
},
{
'Effect': 'Allow',
'Action': 'eks:DescribeCluster',
'Resource': 'arn:${AWS_PARTITION}:eks:${AWS_REGION}:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}',
'Sid': 'EKSClusterEndpointLookup'
}
],
'Version': '2012-10-17'
}
EOF
$ aws iam put-role-policy --role-name KarpenterControllerRole-${CLUSTER_NAME}
--policy-name KarpenterControllerPolicy-${CLUSTER_NAME}
--policy-document file://controller-policy.json
// 输出如下:
{
'Role': {
'Path': '/',
'RoleName': 'KarpenterControllerRole-prd-eks',
'RoleId': 'AROAXGRAMRX5A5OS6FYJ3',
'Arn': 'arn:aws:iam::495062xxxxx:role/KarpenterControllerRole-prd-eks',
'CreateDate': '2023-05-31T03:35:56Z',
'AssumeRolePolicyDocument': {
'Version': '2012-10-17',
'Statement': [
{
'Effect': 'Allow',
'Principal': {
'Federated': 'arn:aws:iam::495062xxxxxx:oidc-provider/oidc.eks.ap-southeast-1.amazonaws.com/id/BC13F53BC96566C2087EEF318D307864'
},
'Action': 'sts:AssumeRoleWithWebIdentity',
'Condition': {
'StringEquals': {
'oidc.eks.ap-southeast-1.amazonaws.com/id/BC13F53BC96566C2087EEF318D307864:aud': 'sts.amazonaws.com',
'oidc.eks.ap-southeast-1.amazonaws.com/id/BC13F53BC96566C2087EEF318D307864:sub': 'system:serviceaccount:karpenter:karpenter'
}
}
}
]
}
}
}
3. 为所有子网和安全组添加标签
1)子网打标签
for NODEGROUP in $(aws eks list-nodegroups --cluster-name ${CLUSTER_NAME}
--query 'nodegroups' --output text); do aws ec2 create-tags
--tags 'Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}'
--resources $(aws eks describe-nodegroup --cluster-name ${CLUSTER_NAME}
--nodegroup-name $NODEGROUP --query 'nodegroup.subnets' --output text )
done
2)给托管节点组的运行模版的安全组打标签
NODEGROUP=$(aws eks list-nodegroups --cluster-name ${CLUSTER_NAME}
--query 'nodegroups[0]' --output text)
LAUNCH_TEMPLATE=$(aws eks describe-nodegroup --cluster-name ${CLUSTER_NAME}
--nodegroup-name ${NODEGROUP} --query 'nodegroup.launchTemplate.{id:id,version:version}'
--output text | tr -s ' ' ',')
SECURITY_GROUPS=$(aws ec2 describe-launch-template-versions
--launch-template-id ${LAUNCH_TEMPLATE%,*} --versions ${LAUNCH_TEMPLATE#*,}
--query 'LaunchTemplateVersions[0].LaunchTemplateData.[NetworkInterfaces[0].Groups||SecurityGroupIds]'
--output text)
aws ec2 create-tags
--tags 'Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}'
--resources ${SECURITY_GROUPS}
4. 更新 aws-auth ConfigMap
我们需要授权刚刚创建给 Karpenter node 的 role 加入集群的权限。
kubectl edit configmap aws-auth -n kube-system
将变量{AWS_PARTITION}替换为帐户分区,{AWS_ACCOUNT_ID}变量替换为您的帐户 ID,{CLUSTER_NAME}变量替换为集群名称,但不要替换{{EC2PrivateDNSName}}。
检查如下信息,新增第二个 group 的信息,从 13 行到 17 行
apiVersion: v1
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
rolearn: arn:aws:iam::495062xxxxx:role/eksctl-prd-eks-nodegroup-worknode-NodeInstanceRole-1WM8CGG0R5KS3
username: system:node:{{EC2PrivateDNSName}}
- groups:
- system:bootstrappers
- system:nodes
rolearn: arn:aws:iam::495062xxxxx:instance-profile/KarpenterNodeInstanceProfile-prd-eks
username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
creationTimestamp: '2023-05-31T02:27:10Z'
name: aws-auth
namespace: kube-system
resourceVersion: '1143'
uid: 1c952f89-d8a3-4c62-8b44-d0d070f6460a
1. 设置环境变量
//设置环境变量,0.27.5 是当前最新版
export KARPENTER_VERSION=v0.27.5
2. 创建 karpenter yaml 模版
helm template karpenter oci://public.ecr.aws/karpenter/karpenter --version ${KARPENTER_VERSION} --namespace karpenter
--set settings.aws.defaultInstanceProfile=KarpenterNodeInstanceProfile-${CLUSTER_NAME}
--set settings.aws.clusterName=${CLUSTER_NAME}
--set serviceAccount.annotations.'eks.amazonaws.com/role-arn'='arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}'
--set controller.resources.requests.cpu=1
--set controller.resources.requests.memory=1Gi
--set controller.resources.limits.cpu=1
--set controller.resources.limits.memory=1Gi > karpenter.yaml
3. 设置节点亲和性
对于系统性的重要 workload,例如 CoreDNS,Controller,CNI,CSI 和 Operator 等,这些 workload 对弹性要求不高但是稳定性要求比较高,建议部署在节点组运行。以下以 karpenter 为例进行节点亲和性配置。
编辑 karpenter.yaml 文件并找到 karpenter 部署亲和性规则。修改亲和性关系,以便 karpenter 在现有节点组节点之一上运行。修改值以匹配您的$NODEGROUP,每行一个节点组。
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.sh/provisioner-name
operator: DoesNotExist
- matchExpressions:
- key: eks.amazonaws.com/nodegroup
operator: In
values:
- ${NODEGROUP}
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: 'kubernetes.io/hostname'
4. 部署 karpenter 相关资源
kubectl create namespace karpenter
kubectl create -f
https://raw.githubusercontent.com/aws/karpenter/${KARPENTER_VERSION}/pkg/apis/crds/karpenter.sh_provisioners.yaml
kubectl create -f
https://raw.githubusercontent.com/aws/karpenter/${KARPENTER_VERSION}/pkg/apis/crds/karpenter.k8s.aws_awsnodetemplates.yaml
kubectl apply -f karpenter.yaml
5. 创建默认的 provisioner
cat <<EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: [c, m, r]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ['2']
providerRef:
name: default
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: default
spec:
subnetSelector:
karpenter.sh/discovery: '${CLUSTER_NAME}'
securityGroupSelector:
karpenter.sh/discovery: '${CLUSTER_NAME}'
EOF
检查 karpenter 状态
$ kubectl get po -n karpenter
NAME READY STATUS RESTARTS AGE
karpenter-5d7f8596f6-2ml6s 1/1 Running 0 10s
karpenter-5d7f8596f6-t9lwk 1/1 Running 0 10s
1. 从上面可以看到 Karpener 已经正常运行,我们可以禁用 CAS 了
kubectl scale deploy/cluster-autoscaler -n kube-system --replicas=0
然后我们可以驱逐 POD 和回收托管节点组资源。
2. 回收 nodegroup 多余节点资源
由于我们配置了 PDB,所以收缩的过程不会对服务造成影响,我们可以这样来做观察——
方式 1:把 nginx 通过 Cluster IP 的方式暴露出来,并找一个别的 POD 检查服务的可用性
$ kubectl expose deployment nginx-to-scaleout --port=80 --target-port=80
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 7d8h
nginx-to-scaleout ClusterIP 10.100.101.74 <none> 80/TCP 20h
//进去其中一个安装了 curl 的 POD
kubectl exec -it curl-777d588d65-lk6xm /bin/bash
// 检查 svc 可用性
root@curl-777d588d65-lk6xm:/# while true;do
> curl -Is http://10.100.101.74 | head -1
> sleep 1
> done
HTTP/1.1 200 OK
HTTP/1.1 200 OK
HTTP/1.1 200 OK
HTTP/1.1 200 OK
方式 2:通过 watch 检查过程中 pod 的变化
$ kubectl get deploy nginx-to-scaleout -w
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-to-scaleout 4/4 4 4 22h
收缩命令如下:
aws eks update-nodegroup-config --cluster-name ${CLUSTER_NAME}
--nodegroup-name ${NODEGROUP}
--scaling-config 'minSize=2,maxSize=2,desiredSize=2'
由于我们这个是测试环境,所以只保留 2 台。对于生产环境,还是建议保留 3 个 AZ 共 3 节点来运行托管节点组。
Karpenter 作为一款新的 Kubernetes auto scaling 工具,由于其先进的特性我们把它作为节点扩容的首选方案。本文介绍了从 CAS 迁移 Karpenter 的整个过程,并通过配置 PDB 保证过程中不影响服务的正常访问。