下载安装k8s

我是用curl安装的,一行命令就能下载

1
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"

安装 kubectl

1
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

这是k8s的架构图,可以看到他其实是由许多部分组成的 k8s架构图

至少还有下面的这些需要安装

配置kubectl

我用的是zsh,所以需要配置一下kubectl的自动补全

在~/.zshrc中添加如下

1
source <(kubectl completion zsh)

我用的是wsl上的debian安装,没想到添加后,补全的是kubectl.exe,wsl也太会玩了

一点小插曲

发现我的WSL用的是镜像网络模式(Mirrored),所以每台机子ip都是一致的,所以没法在wsl上直接开k8s了

所以我趁着腾讯云促销直接购入了新的vps,再把我原来阿里云的升级下☝️🤓

安装kubeadm等

更新 apt 包索引并安装使用 Kubernetes apt 仓库所需要的包

1
apt-get install -y apt-transport-https ca-certificates curl gpg

下载用于 Kubernetes 软件包仓库的公共签名密钥。

1
2
3
# 如果 `/etc/apt/keyrings` 目录不存在,则应在 curl 命令之前创建它,请阅读下面的注释。
# sudo mkdir -p -m 755 /etc/apt/keyrings
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

添加 Kubernetes apt 仓库

1
2
# 此操作会覆盖 /etc/apt/sources.list.d/kubernetes.list 中现存的所有配置。
echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list

更新 apt 包索引,安装 kubelet、kubeadm 和 kubectl,并锁定其版本:

1
2
3
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

这样查看下,就是安装好了,不过我有点疑惑,怎么有**WARNING: apt does not have a stable CLI interface. Use with caution in scripts.**这个警告呢?

原来真的有hold住版本的操作(apt-mark hold),这样就不会被自动升级了

初始化k8s(主从配置)

关闭swap

1
2
sudo swapoff -a                     # 临时关闭
sudo sed -i '/ swap / s/^/#/' /etc/fstab  # 永久关闭

关于k8s不建议使用swap的原因在此

初始化集群

1
2
3
sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \  # Calico 网络插件的默认 CIDR
  --apiserver-advertise-address=<MASTER_IP>  # 替换为 Master 节点的公网 IP

宿主机解决

执行后我出现了以下错误

1
2
3
4
5
6
7
8
# root @ k8s-master in ~ [17:14:08] C:1
$ sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \  
  --apiserver-advertise-address=114.514.191.981  

unknown command "\u00a0" for "kubeadm init"
To see the stack trace of this error execute with --v=5 or higher
zsh: command not found:  

结果竟然是一个奇特的空格问题,重新手动输入命令就好了

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# root @ k8s-master in ~ [17:19:19] C:127
$ sudo kubeadm init \
> --pod-network-cidr=10.244.0.0/16 \
> --apiserver-advertise-address=19.19.81.0
[init] Using Kubernetes version: v1.32.2
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
W0301 17:20:46.188020  457167 checks.go:846] detected that the sandbox image "registry.k8s.io/pause:3.6" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.k8s.io/pause:3.10" as the CRI sandbox image.
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [k8s-master kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 19.19.81.0]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [k8s-master localhost] and IPs [19.19.81.0 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [k8s-master localhost] and IPs [19.19.81.0 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "super-admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests"
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 500.924104ms
[api-check] Waiting for a healthy API server. This can take up to 4m0s

但是出现了下列报错

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
Unfortunately, an error has occurred:
        context deadline exceeded

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///var/run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: could not initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher

systemd查看日志出现下列

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# root @ k8s-master in ~ [17:25:24] C:1
$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since Sat 2025-03-01 17:21:23 CST; 7min ago
       Docs: https://kubernetes.io/docs/
   Main PID: 457406 (kubelet)
      Tasks: 11 (limit: 4490)
     Memory: 35.9M
        CPU: 3.162s
     CGroup: /system.slice/kubelet.service
             └─457406 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/li>
Mar 01 17:28:18 k8s-master kubelet[457406]: I0301 17:28:18.531707  457406 kubelet_node_status.go:76] "Attempting to register node" node="k8s-master"
Mar 01 17:28:22 k8s-master kubelet[457406]: E0301 17:28:22.753930  457406 kubelet.go:3196] "No need to create a mirror pod, since failed to get node info from >Mar 01 17:28:22 k8s-master kubelet[457406]: I0301 17:28:22.753996  457406 scope.go:117] "RemoveContainer" containerID="f20da810cbdb2ccf29cd8e95001c597b5b32d3e3>Mar 01 17:28:22 k8s-master kubelet[457406]: E0301 17:28:22.754118  457406 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to \"StartContainer\" >Mar 01 17:28:23 k8s-master kubelet[457406]: E0301 17:28:23.833994  457406 eviction_manager.go:292] "Eviction manager: failed to get summary stats" err="failed >Mar 01 17:28:24 k8s-master kubelet[457406]: W0301 17:28:24.467354  457406 reflector.go:569] k8s.io/client-go/informers/factory.go:160: failed to list *v1.Node:>Mar 01 17:28:24 k8s-master kubelet[457406]: E0301 17:28:24.467417  457406 reflector.go:166] "Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: F>Mar 01 17:28:25 k8s-master kubelet[457406]: W0301 17:28:25.965697  457406 reflector.go:569] k8s.io/client-go/informers/factory.go:160: failed to list *v1.CSIDr>Mar 01 17:28:25 k8s-master kubelet[457406]: E0301 17:28:25.965764  457406 reflector.go:166] "Unhandled Error" err="k8s.io/client-go/informers/factory.go:160: F>Mar 01 17:28:26 k8s-master kubelet[457406]: E0301 17:28:26.387444  457406 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://43.>lines 1-23/23 (END)

我初步猜测是防火墙有些端口我忘记放行了,所以我直接在腾讯云的策略组全部放行试试(不要轻易尝试),并且执行了下面的操作 如果再有问题,我怀疑就是云主机厂商的网卡是绑定在内网上的,我需要重新绑定公网网卡

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 重置 kubeadm
sudo kubeadm reset

# 重启 containerd 和 kubelet
sudo systemctl restart containerd
sudo systemctl restart kubelet

# 清理 CNI 配置 (如果有)
sudo rm -rf /etc/cni/net.d/*

# 重新初始化
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=<IP> --v=5

重启后还是出现了一样的问题

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 首先创建目录(如果不存在)
sudo mkdir -p /etc/network/interfaces.d/

# 创建虚拟网卡配置
sudo tee /etc/network/interfaces.d/eth0:1 <<EOF
auto eth0:1
iface eth0:1 inet static
    address 19.19.81.0
    netmask 255.255.255.255
EOF

# 立即应用配置
sudo ifup eth0:1

果然如此,修改了绑定的ip之后就成功了

显示了以下内容

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 19.19.81.0:6443 --token qwim1m.ov8fzob0zq41m8xc \
        --discovery-token-ca-cert-hash sha256:14b6902dee789af56faeff6ea78830e7e9fad1192fd7a7bf301a56219e67a24c 

按照上面说的执行即可

随后还要安装网络插件

1
2
# 使用 Calico 网络插件
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

Worker节点加入

在初始化slave节点时出现了以下错误

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# root @ k8s-hk-node1 in ~ [11:29:36] 
$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=114.514.19.19
[init] Using Kubernetes version: v1.32.2
[preflight] Running pre-flight checks
W0302 11:29:51.341165  133953 checks.go:1077] [preflight] WARNING: Couldn't create the interface used for talking to the container runtime: failed to create new CRI runtime service: validate service connection: validate CRI v1 runtime API for endpoint "unix:///var/run/containerd/containerd.sock": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService
        [WARNING Swap]: swap is supported for cgroup v2 only. The kubelet must be properly configured to use swap. Please refer to https://kubernetes.io/docs/concepts/architecture/nodes/#swap-memory, or disable swap on the node
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:
failed to create new CRI runtime service: validate service connection: validate CRI v1 runtime API for endpoint "unix:///var/run/containerd/containerd.sock": rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

报错显示是CRI(容器运行时接口)错误

在worker节点上执行以下命令

1
2
3
4
5
6
# 创建默认配置
sudo mkdir -p /etc/containerd
sudo containerd config default | sudo tee /etc/containerd/config.toml > /dev/null

# 编辑配置文件
sudo vi /etc/containerd/config.toml

查看后发现是SystemdCgroup的问题

在我修改了配置文件后containerd就正常了

再次执行kubeadm join命令,发现卡在这了。。。

由于我是公网执行的,有理由怀疑是网络不通的问题,使用telnet测试了一下,果然不通

1
2
3
4
# root @ k8s-hk-node1 in ~ [13:41:08] 
$ telnet 19.19.81.0 6443
Trying 19.19.81.0...
telnet: Unable to connect to remote host: Connection refused

休息下再看看怎么解决这个网络不通的问题吧。。。公网部署还是太蛋疼了

中场休息之后再来,用tcpdump工具进行抓包

1
2
3
4
5
6
# root @ k8s-master in ~ [21:48:04] 
$ tcpdump -i eth0 'tcp port 6443 and host 114.514.19.19'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:48:15.463636 IP 114.514.19.19.42444 > 10.3.0.13.6443: Flags [S], seq 474053441, win 64240, options [mss 1424,sackOK,TS val 2636673266 ecr 0,nop,wscale 7], length 0
21:48:15.463693 IP 10.3.0.13.6443 > 114.514.19.19.42444: Flags [R.], seq 0, ack 474053442, win 0, length 0

可以看到SYN 包已到达目标主机,但目标主机的 TCP 协议栈直接拒绝了连接

现在不止是集群不稳定,还出现了这个网络不通的情况,我决定重新安装一下k8s(master)

排障

移除原有的配置文件重新初始化,集群就正常了

网络问题我怀疑是calico的问题,所以我修改下配置文件

1
sed -i 's|192.168.0.0/16|10.244.0.0/16|g' calico.yaml

重启集群后发现还是不行,我决定还是换个网络插件试试算了,我选择改用flannel

1
2
wget https://raw.githubusercontent.com/flannel-io/flannel/v0.22.0/Documentation/kube-flannel.yml\n
kubectl apply -f kube-flannel.yml

改用flannel后,集群就正常启动了!

worker也正常加入了

第一个应用

第一个应用,我想就用nginx吧!

使用官方nginx pod镜像

创建nginx部署

1
kubectl create deployment nginx --image=nginx

拉取 Nginx 镜像

1
2
nerdctl pull nginx:latest
crictl pull nginx:latest

查看部署状态:

1
kubectl get deployments

成功运行!

参考:

  1. 在Kubernetes中运行无状态应用程序
  2. 组成 Kubernetes 集群的关键组件概述
  3. 将 Linux 工作节点添加到 kubeadm 集群
  4. https://juejin.cn/post/7143809823925092389