Create a Google Kubernetes Engine (GKE) Cluster within its VPC Using Terraform
Table of contents
In this tutorial, we will be looking at how to use Terraform to create a Google Kubernetes Engine (GKE) cluster within its VPC using Terraform. GKE is a managed, production-ready environment for deploying containerized applications on Google Cloud.
Terraform is an open-source Infrastructure as Code (IaC) tool developed by HashiCorp. It's a cloud-agnostic tool used to automate and manage your infrastructure in a predictable and consistent manner. Terraform uses a declarative language to describe your infrastructure, which makes it easier to understand and reason about.
What are we going to create?
In this blog we will look at how to create following resources in Azure Cloud.
VPC and Subnet
GKE Cluster
Worker nodepools
Connect to the GKE Cluster and Install
nginx
helm chart to validate the functionality
We will skip going in detail on terraform modules as we have already covered those in detail in our blog Create EKS cluster within its VPC.
The complete terraform code for what we will discuss below is in this repository.
Prerequisites
Before you get started, you need to have the following in place:
A Google Cloud Platform (GCP) account and a project where you want to create the resources. Also create a Service Account for permissions needed for terraform to interact with GCP APIs. Please sections Setup your GCP Project and Create a Service Account for Terraform below.
The Google Cloud SDK installed and initialized on your local machine
Basic understanding of GKEand VPC and Subnets
Terraform installed.
kubectl compatible with the AKS version you are installing.
terraform-docsif you want to auto-generate the documentation and tfswitch to manage multiple versions of terraform
helm a package manager for Kubernetes manifests, we will use it to install nginx helm chart once the cluster is created.
Setup your GCP Project
First, we need to set up a project on GCP. This can be done through the GCP console. Note the Project ID
as we will need this in Terraform scripts below.
Create a Service Account for Terraform
Terraform needs permissions to interact with the GCP API. This is accomplished by creating a service account. In the GCP console, navigate to IAM & Admin > Service Accounts > Create Service Account
, provide a name, and grant it the Kubernetes Engine Admin
and Service Account User
roles.
Next, create a JSON key for this service account and download it. Keep this file safe and secure, as it provides administrative access to your GCP project.
NOTE: Creating and downloading a service account key can indeed be accomplished with Terraform. However, it's important to note that handling these keys is a sensitive operation, as they grant access to your Google Cloud resources.
You should handle these keys in a secure manner and avoid exposing them whenever possible. Storing secrets in Terraform state is considered an anti-pattern because the state is often stored in plaintext, so ensure you are managing these secrets according to your organization's best practices for secret management.
Module Structure
We have explained terraform modularization in detail in our previous blog Create EKS cluster within its VPC hence we will not go over those details again. But, following is the outline of the modules we are going to build in the sections below.
You can simply clone the complete working code from this repository.
my-gke-tf/ # root directory
.
├── cluster # scaffold module which invokes gke and vpc_and_subnets module
│ ├── main.tf
│ ├── outputs.tf
│ └── variables.tf
├── modules
│ ├── gke # module to create k8s cluster and worker nodepools
│ │ ├── main.tf
│ │ ├── outputs.tf
│ │ └── variables.tf
│ └── vpc_and_subnets # module to create vpc and subnets
│ ├── main.tf
│ ├── outputs.tf
│ └── variables.tf
├── main.tf # invokes cluster module to create gke cluster in its vpc
├── outputs.tf
├── sample.tfvars # sample variables file
└── variables.tf
Terraform Modules
The following are the terraform modules we will create in my-gke-tf
directory. You can refer the above section for the directory structure. We will look at respective terraform files below. Please note the terraform files may have been abbreviated for brevity, the complete code is available in this repository.
modules
These are the APIs created by the Platform team, these modules can also be separated out to its dedicated repository in real world and can be imagined as being used as reference by remote modules prepared by the users wanting to claim the infrastructure.
vpc_and_subnets
This is an opinionated module created by the Platform team to create an GCP VPC and Subnet. Create following files under modules/vpc_and_subnets
directory.
main.tf
file below locks down the google provider version we have validated this module with and also externalizes the vars like vpc_name
, subnet_name
, cidrBlock
and region
where the resources need to be created.
The following file may have been abbreviated for brevity. The complete working code can be found here
# setup google terraform provider
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "4.74.0"
}
}
}
# VPC - https://registry.terraform.io/providers/hashicorp/google/4.74.0/docs/resources/compute_network
resource "google_compute_network" "vpc" {
name = var.vpc_name
description = var.vpc_description
# the network is created in "custom subnet mode"
# we will explicitly connect subnetwork resources below using google_compute_subnetwork resource
auto_create_subnetworks = "false"
}
# Subnet - https://registry.terraform.io/providers/hashicorp/google/4.74.0/docs/resources/compute_subnetwork
resource "google_compute_subnetwork" "subnet" {
name = var.subnet_name
description = var.subnet_description
region = var.region
network = google_compute_network.vpc.name
ip_cidr_range = var.cidrBlock
}
The variables.tf
file mentions the variables being accepted as inputs from the user, which you can seeing being referred as var.
in the above main.tf
file.
The following variables.tf
may have been abbreviated for brevity. The complete working code can be found here.
variable "vpc_name" {
type = string
description = "Name of the resource. Provided by the client when the resource is created. The name must be 1-63 characters long, and comply with RFC1035. Specifically, the name must be 1-63 characters long and match the regular expression [a-z]([-a-z0-9]*[a-z0-9])? which means the first character must be a lowercase letter, and all following characters must be a dash, lowercase letter, or digit, except the last character, which cannot be a dash."
}
variable "vpc_description" {
type = string
description = "An optional description of this resource. The resource must be recreated to modify this field."
default = ""
}
variable "subnet_name" {
type = string
description = "The name of the resource, provided by the client when initially creating the resource. The name must be 1-63 characters long, and comply with RFC1035. Specifically, the name must be 1-63 characters long and match the regular expression [a-z]([-a-z0-9]*[a-z0-9])? which means the first character must be a lowercase letter, and all following characters must be a dash, lowercase letter, or digit, except the last character, which cannot be a dash."
}
variable "subnet_description" {
type = string
description = "An optional description of this resource. Provide this property when you create the resource. This field can be set only at resource creation time."
default = ""
}
variable "cidrBlock" {
type = string
description = "The range of internal addresses that are owned by this subnetwork. Provide this property when you create the subnetwork. For example, 10.0.0.0/8 or 192.168.0.0/16. Ranges must be unique and non-overlapping within a network. Only IPv4 is supported."
}
variable "region" {
type = string
description = "The GCP region for this subnetwork."
}
The outputs.tf
file will output the necessary resource the user of this module might need to consume and probably use it as input to other modules. For example, we will need vpc_self_link
and subnet_self_link
as input to gke
module below.
The following file may have been abbreviated for brevity. The complete working code can be found here.
output "vpc_self_link" {
description = "The URI of the created resource"
value = google_compute_network.vpc.self_link
}
output "subnet_self_link" {
description = "The URI of the created resource"
value = google_compute_subnetwork.subnet.self_link
}
output "vpc_name" {
description = "vpc network name"
value = google_compute_network.vpc.name
}
output "subnet_name" {
description = "subnet name"
value = google_compute_subnetwork.subnet.name
}
gke
This is an opinionated module to create an GKE Cluster with optional ability to create more worker nodepools. Create following files under modules/gke
directory.
main.tf
below file below locks down the azure provider version we have validated this module with and also externalizes the vars like cluster_name
, k8s_version
, nodepools
config etc.. to create the GKE cluster. We are retrieving the first 3 availability zones using data resource google_compute_zones
, and these zones will be used to create the eks cluster and nodepools. We are also retrieving the latest min_master_version
from the k8s_version
provided by the user using data resource google_container_engine_versions
.
To be able to successfully create the GKE cluster and Nodepools, we also need to create/enable project services compute
, container
and cloudresourcemanager
.
We are making sure that the release_channel
is hard-coded to UNSPECIFIED
to avoid any automatic updates to the GKE cluster. This is a recommended practice if you want to control the k8s upgrades. We are also creating self managed nodepools instead of default nodepools, such that we can easily add/remove nodepools as needed.
The following file may have been abbreviated for brevity. The complete working code can be found here.
# setup google terraform provider
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "4.74.0"
}
}
}
locals {
zones = slice(data.google_compute_zones.available.names, 0, 3)
# we only care about compute and container service here hence only enabling these project services
# without cloudresourcemanager we get errors
services = toset(["compute", "container", "cloudresourcemanager"])
# we will pick the latest k8s version
master_version = data.google_container_engine_versions.main.valid_master_versions[0]
}
# allows management of a single API service for a Google Cloud Platform project.
# official documentation - https://registry.terraform.io/providers/hashicorp/google/4.74.0/docs/resources/google_project_service
resource "google_project_service" "main" {
for_each = local.services
service = "${each.value}.googleapis.com"
disable_on_destroy = false
}
# to extract the UP available compute zones for the provided region
# official documentation - https://registry.terraform.io/providers/hashicorp/google/4.74.0/docs/data-sources/compute_zones
data "google_compute_zones" "available" {
region = var.region
status = "UP"
depends_on = [google_project_service.main]
}
# to retrieve the latest k8s version supported for the provided k8s version in a region
# official documentation - https://registry.terraform.io/providers/hashicorp/google/4.74.0/docs/data-sources/container_engine_versions
data "google_container_engine_versions" "main" {
location = var.region
# Since this is just a string match, it's recommended that you append a . after minor versions
# to ensure that prefixes such as 1.1 don't match versions like 1.12.5-gke.10 accidentally.
version_prefix = "${var.k8s_version}."
depends_on = [google_project_service.main]
}
# GKE cluster - https://registry.terraform.io/providers/hashicorp/google/4.74.0/docs/resources/container_cluster
resource "google_container_cluster" "gke" {
name = var.cluster_name
location = var.region
node_locations = local.zones
min_master_version = local.master_version
# to prevent automatic updates to cluster
release_channel {
channel = "UNSPECIFIED"
}
# cluster cannot be created without a nodepool but since we want to use self managed nodes
# we will instruct to remove the default node pool on successful cluster creation
remove_default_node_pool = true
initial_node_count = 1
network = var.network
subnetwork = var.subnetwork
}
# self managed nodepool
# official documentation - https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool
resource "google_container_node_pool" "nodepools" {
for_each = var.nodepools
name = each.value.name
location = var.region
cluster = var.cluster_name
node_count = each.value.node_count
autoscaling {
min_node_count = "0"
max_node_count = "100"
}
management {
auto_repair = true
auto_upgrade = false
}
node_config {
machine_type = each.value.machine_type
labels = each.value.node_labels
}
lifecycle {
ignore_changes = [
initial_node_count,
node_count,
node_config,
node_locations
]
}
version = local.master_version
depends_on = [google_container_cluster.gke]
}
variables.tf
file allows user to configure the subnet where the gke cluster and nodepools needs to be created along with configurations for the nodepools. These configurations are referred in main.tf
as var.
.
The following file may have been abbreviated for brevity. The complete working code can be found here.
variable "cluster_name" {
type = string
description = "The name of the cluster, unique within the project and location."
}
variable "region" {
type = string
description = "The location (region or zone) in which the cluster master will be created, as well as the default node location. If you specify a zone (such as us-central1-a), the cluster will be a zonal cluster with a single cluster master. If you specify a region (such as us-west1), the cluster will be a regional cluster with multiple masters spread across zones in the region, and with default node locations in those zones as well"
}
variable "network" {
type = string
description = "The name or self_link of the Google Compute Engine network to which the cluster is connected. For Shared VPC, set this to the self link of the shared network."
}
variable "subnetwork" {
type = string
description = "The name or self_link of the Google Compute Engine subnetwork in which the cluster's instances are launched."
}
variable "k8s_version" {
type = string
description = "kubernetes version"
default = "1.27"
}
variable "nodepools" {
description = "Nodepools for the Kubernetes cluster"
type = map(object({
name = string
node_count = number
node_labels = map(any)
machine_type = string
}))
default = {
worker = {
name = "worker"
node_labels = { "worker-name" = "worker" }
machine_type = "n1-standard-1"
node_count = 1
}
}
}
outputs.tf
will output the variables which may be useful to the end user. You may observe that we client_certificate
, client_key
and cluster_ca_certificate
variables are marked as sensitive = true
, which prevents from printing any sensitive information on stdout, though it doesn't prevent it from being stored in tfstate
file.
The following file may have been abbreviated for brevity. The complete working code can be found here.
output "cluster_id" {
description = "an identifier for the resource with format projects/{{project}}/locations/{{zone}}/clusters/{{name}}"
value = google_container_cluster.gke.id
}
output "cluster_master_version" {
description = "The current version of the master in the cluster. This may be different than the min_master_version set in the config if the master has been updated by GKE."
value = google_container_cluster.gke.master_version
}
output "client_certificate" {
description = "Base64 encoded public certificate used by clients to authenticate to the Kubernetes cluster."
value = google_container_cluster.gke.master_auth.0.client_certificate
sensitive = true
}
output "client_key" {
description = "Base64 encoded private key used by clients to authenticate to the cluster endpoint."
value = google_container_cluster.gke.master_auth.0.client_key
sensitive = true
}
output "cluster_ca_certificate" {
description = "Base64 encoded public certificate that is the root certificate of the cluster."
value = google_container_cluster.gke.master_auth.0.cluster_ca_certificate
sensitive = true
}
output "endpoint" {
description = "The IP address of this cluster's Kubernetes master."
value = google_container_cluster.gke.endpoint
}
cluster modules
In the sections above we have created the modules/APIs, it's time to invoke these modules in a consolidated module named cluster
. You can imagine this module being written by the client of the platform team which can be any application team wanting to claim infrastructure resources. This module will further be opinionated catering to the needs of the application team.
main.tf
below accepts cluster_name
as an input and uses the same name for vpc_name
, subnet_name
and cluster_name
. You can also observe that gke_with_node_group
module uses vpc_self_link
(vpc unique url) and subnet_self_link
(subnet unique url) from vpc_with_subnets
modules output, this puts an indirect dependency on vpc_with_subnets
module. This means gke_with_node_group
module will wait for vpc_with_subnets
module to finish before executing.
The following file may have been abbreviated for brevity. The complete working code can be found here.
# invoking vpc and subnets modules
module "vpc_with_subnets" {
# invoke vpc_and_subnets module under modules directory
source = "../modules/vpc_and_subnets"
# create vpc and subnet with the same name as cluster name
vpc_name = var.cluster_name
subnet_name = var.cluster_name
# region where the resources need to be created
region = var.region
cidrBlock = var.cidrBlock
}
# invoking gke module to create gke cluster and node group
module "gke_with_node_group" {
# invoke gke module under modules directory
source = "../modules/gke"
cluster_name = var.cluster_name
k8s_version = var.k8s_version
region = var.region
nodepools = var.nodepools
network = module.vpc_with_subnets.vpc_self_link
subnetwork = module.vpc_with_subnets.subnet_self_link
}
variables.tf
file accepts less number of parameters than what we saw in vpc and gke modules earlier, as you can see above main.tf
is written in an opinionated manner catering to the needs of a team. Each team can write their own version of the module.
The following file may have been abbreviated for brevity. The complete working code can be found here.
variable "cluster_name" {
type = string
description = "vpc, subnet and gke cluster name"
}
variable "k8s_version" {
type = string
description = "kubernetes version"
default = "1.27"
}
variable "region" {
type = string
description = "gcp region where the gke cluster must be created, this region should match where you have created the vpc and subnet"
}
variable "cidrBlock" {
type = string
description = "The cidr block for subnet"
default = "10.1.0.0/16"
}
variable "nodepools" {
description = "Nodepools for the Kubernetes cluster"
type = map(object({
name = string
node_count = number
node_labels = map(any)
machine_type = string
}))
default = {
worker = {
name = "worker"
node_labels = { "worker-name" = "worker" }
machine_type = "n1-standard-1"
node_count = 1
}
}
}
outputs.tf
file is only retrieving the variables the team may need. The following file may have been abbreviated for brevity. The complete working code can be found here.
output "cluster_id" {
description = "an identifier for the resource with format projects/{{project}}/locations/{{zone}}/clusters/{{name}}"
value = module.gke_with_node_group.cluster_id
}
output "cluster_master_version" {
description = "The current version of the master in the cluster. This may be different than the min_master_version set in the config if the master has been updated by GKE."
value = module.gke_with_node_group.cluster_master_version
}
output "client_certificate" {
description = "Base64 encoded public certificate used by clients to authenticate to the Kubernetes cluster."
value = module.gke_with_node_group.client_certificate
sensitive = true
}
output "client_key" {
description = "Base64 encoded private key used by clients to authenticate to the cluster endpoint."
value = module.gke_with_node_group.client_key
sensitive = true
}
output "cluster_ca_certificate" {
description = "Base64 encoded public certificate that is the root certificate of the cluster."
value = module.gke_with_node_group.cluster_ca_certificate
sensitive = true
}
output "endpoint" {
description = "The IP address of this cluster's Kubernetes master."
value = module.gke_with_node_group.endpoint
}
output "vpc_self_link" {
description = "The URI of the created resource"
value = module.vpc_with_subnets.vpc_self_link
}
output "subnet_self_link" {
description = "The URI of the created resource"
value = module.vpc_with_subnets.subnet_self_link
}
prepare to invoke the cluster module
Now we are at the final stage, where members of the team may want to invoke the cluster module for various use cases. For example, we may want to create dev
, stage
and prod
gke clusters.
main.tf
below is only overriding the cluster_name
, k8s_version
and region
vars in cluster
module we created above, and using other default values.
Along with that it's setting the terraform backend
to store the tfstate
file in s3. This backend is configured at the time of initializing using terraform init
in the section below. We have explained about this in our earlier blog on how to Create EKS cluster within its VPC.
This also configures the google provider
, you will see in the section below that we are overriding the required parameters by setting some environment variables to make sure that terraform creates the resources in the desired gcp account/project. You will also notice that cluster
module invocation is pointing to the source
cluster
module we created in the section above.
The following file may have been abbreviated for brevity. The complete working code is here.
# to use s3 backend
# s3 bucket is configured at command line
terraform {
backend "s3" {}
}
# setup google provider
# the environment variables below will be set before invoking the module
# GOOGLE_CREDENTIALS - path/to/service/account/credentials/file
# GOOGLE_PROJECT - google project id where the resources need to be created
provider "google" {}
# invoke cluster module which creates vpc, subnet and gke cluter with a default worker nodepool
module "cluster" {
source = "./cluster"
region = var.region
cluster_name = var.cluster_name
k8s_version = var.k8s_version
}
variables.tf
file and outputs.tf
files are as follows. The actual files are here - variables.tf and outputs.tf.
variable "region" {
type = string
description = "gcp region where the resources are being created"
}
variable "cluster_name" {
type = string
description = "gke cluster name, same name is used for vpc and subnets"
default = "platformwale"
}
variable "k8s_version" {
type = string
description = "k8s version"
default = "1.27"
}
output "endpoint" {
description = "The IP address of this cluster's Kubernetes master."
value = module.cluster.endpoint
}
Now we also need to create .tfvars
file. You can imagine this as the input file used while invoking the module, this way you can have different behaviors based on your requirement. For example as discussed earlier, you may have dev.tfvars
, stage.tfvars
and prod.tfvars
for our environment specific clusters which may have distinguished configurations. The following is the sample.tfvars
which we will use in the sections below for provisioning the infrastructure. The complete code can be found here.
# gcp region
region = "us-east1"
# gke cluster name, this is the same name used to create the vpc and subnet
# hence this name must be unique
cluster_name = "platformwale"
With all these modules, now we are all set to actually see the infrastructure for GKE cluster come to live, please refer the sections below on further instructions.
Setting Up the Environment
Create GCP Project and note the Project ID
as we will need it below.
Terraform needs permissions to interact with the GCP API. This is accomplished by creating a service account. In the GCP console, navigate to IAM & Admin > Service Accounts > Create Service Account
, provide a name, and grant it the Kubernetes Engine Admin
and Service Account User
roles.
Next, create a JSON key for this service account and download it. Keep this file safe and secure, as it provides administrative access to your GCP project. We will need this file below.
Before we dive into creating our resources, let's authenticate gcloud CLI with our GCP account:
gcloud auth application-default login
Set the following environment variables to prepare to create AKS cluster in a designated subscription in your azure account.
export GOOGLE_CREDENTIALS="path/to/service/account/credentials/JSON from #2 above"
export GOOGLE_PROJECT="GCP Project ID from #1 above"
Deploy and Validate
In this section we will look at the details on how to execute the terraform modules we prepared above to create the GKE cluster within its VPC using terraform, connect to the cluster and deploy nginx
helm chart to validate the functionality of the cluster.
Create an s3 bucket to store the tfstate file
aws s3api create-bucket --bucket "your-bucket-name" --region "your-aws-region"
Initialize terraform module, run this from the root of my-aks-tf
where you have prepared the terraform files to invoke cluster
module
# tfstate file name
tfstate_file_name="<some name e.g. aks-1111111111>"
# tfstate s3 bucket name, this will have the tfstate file which you can use for further runs of this terraform module
# for example to upgrade k8s version or add new node pools etc.. The bucket name must be unique as s3 is a global service. Terraform will create the s3 bucket if it doesn't exist
tfstate_bucket_name="unique s3 bucket name you created above e.g. my-tfstate-<myname>"
# initialize the terraform module
terraform init -backend-config "key=${tfstate_file_name}" -backend-config "bucket=${tfstate_bucket_name}" -backend-config "region=us-east-1"
Retrieve the terraform plan
, a preview of what will happen when you apply this terraform module. This is a best practice to understand the change.
terraform plan -var-file="path/to/your/terraform.tfvars"
# example
terraform plan -var-file="sample.tfvars"
If you are satisfied with the plan above, this is the final step to apply the terraform and wait for the resources to be created. It will take about ~20 mins for all the resources to be created.
terraform apply -var-file="path/to/your/terraform.tfvars"
# example
terraform apply -var-file="sample.tfvars"
After successful cluster creation, retrieve the kubeconfig
, connect to the GKE cluster and validate the kubeconfig context
is now pointing to the new cluster.
gcloud auth login
gcloud container clusters get-credentials CLUSTER_NAME --zone ZONE_OR_REGION --project PROJECT_ID
# example
gcloud auth login
gcloud container clusters get-credentials "platformwale" --zone "us-east1" --project "${GOOGLE_PROJECT}"
# validate that the kubeconfig context is pointing to the new cluster
kubectl config current-context
Install nginx
helm chart, this will create a load balancer service which proves the functionality of the GKE cluster as nginx pods were able to come up successfully.
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install -n default nginx bitnami/nginx
# validate nginx pod and load balancer service
kubectl get pods -n default
kubectl get svc -n default
# example
$ kubectl get pods -n default
NAME READY STATUS RESTARTS AGE
nginx-7c8ff57685-ck9pn 1/1 Running 0 3m31s
$ kubectl get svc -n default nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx LoadBalancer 10.19.255.56 XX.XXX.XXX.XXX 80:32142/TCP 76s
You will be able to put the http://<EXTERNAL-IP>:80
in browser and will be able to see nginx
welcome page as below -
Clean Up
When you're done with your resources, you can destroy them with following commands. This is extremely important step as otherwise you will see unexpected costs for the resources in your account.
# uninstall nginx helm chart to make sure load balancer is deleted
helm uninstall -n default nginx
# destroy infrastructure
terraform destroy -var-file="sample.tfvars"
Make sure to delete the service account and GCP project created earlier from GCP console.
Conclusion
There you have it! You've successfully created an GKE cluster within a VPC using Terraform. With the power of IaC, you can easily manage, replicate, and version control your infrastructure. Happy Terraforming!
References
Please note that this tutorial is a basic guide, and best practices such as state management, data security, and others are not covered here. We recommend further study to understand and implement these practices for production-level projects.
Author Notes
Feel free to reach out with any concerns or questions you have, either on the GitHub repository or directly on this blog. I will make every effort to address your inquiries and provide resolutions. Stay tuned for the upcoming blog in this series dedicated to Platformwale (Engineers who work on Infrastructure Platform teams).
Originally published at platformwale.blog on July 25, 2023.