Running production workloads in EKS using Spot instances

Running resilient workloads in EKS using Spot instances

Spot Instances Overview

A Spot Instance is an instance that uses spare EC2 capacity that is available for less than the On-Demand price (up to 90% cheaper), which makes it a very cost efficient option, but comes with some downsides. Spot Instances are interruptible by AWS EC2 Spot service in what is called a Spot Instance interruption. The following are the possible reasons that Amazon EC2 might interrupt your Spot Instances:

You can see the historical interruption rates for your instance type in the Spot Instance Advisor.

If a Spot Instance is stopped, hibernated, or terminated, you can use CloudTrail to see whether Amazon EC2 interrupted the Spot Instance. In AWS CloudTrail, the event name BidEvictedEvent indicates that Amazon EC2 interrupted the Spot Instance. To view BidEvictedEvent events in CloudTrail:

Open the CloudTrail console
In the navigation pane, choose Event history.
In the filter drop-down, choose Event name, and then in the filter field to the right, enter BidEvictedEvent.
Choose BidEvictedEvent in the resulting list to view its details. Under Event record, you can find the instance ID.

Spot instances are usually suitable for stateless, fault-tolerant applications that are able to checkpoint and continue after an interruption, as well as batch jobs.

Considerations when using Spot Instances

At giffgaff we run all our applications in an EKS cluster using 100% Spot instances. We make use of Spot and the Ocean Clusters feature.

Spot continuously monitor the different capacity pools across operating systems, instance types, availability zones and regions to make decisions in real-time regarding which instances to choose for provisioning and which ones to proactively rebalance and replace, before an interruptions happens.

In a single day we have a number of interruptions somewhere between 40 and 60. Although Spot does a great job, sometimes instances are taken out before the rebalance fully happens, which could end up in downtime for some applications.

To minimise the chances of interruptions and avoid any downtime, these are some measures that we’ve put in place over the last few years, allowing us to take advantage of Spot instances without compromising availability.

Minimum number of replicas

As stated above, spot instances are interruptible by AWS EC2 Spot service. We’ve seen 2 instances being taken down at the same time very often, and sometimes even 4.

There’s a chance that all pods for a particular application are running in one of the instances being taken away. There are a few ways to avoid that, like configuring anti-affinity rules, or even better, pod topology spread constraints. We’ll be discussing these later on.

We’ve set by default the minimum number of pods to 4 in our critical applications, and we force those pods to be scheduled in different instances. This way we can afford losing up to 3 instances at the same time, running 3 of the pods for a particular application without incurring any downtime.

Instance types and availability zones

Deploying your application across many different instance types, in different availability zones will further enhance availability. When multiple instances are taken down at the same time, they usually belong to the same family and instance type, in a particular availability zone, as there’s high demand for that particular instance type.

We configure our cluster to run a wide variety of instance types. At the same time, we spread pods for a particular application across multiple availability zones, minimising the risk of having service interruption.

Pod disruption budget

A Pod Disruption Budget (PDB) allows you to limit the disruption to your application when its pods suffer a voluntary disruption:

A PDB limits the number of pods of a replicated application that are down simultaneously from voluntary disruptions. This is specially helpful during cluster upgrades, as nodes will be drained and upgraded in a particular order so that PDBs are respected.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: test-svc
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: test-svc

Pod anti-affinity

As stated in the official documentation, pod anti-affinity allows you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rule is of the form “this pod should not run in X if that X is already running one or more pods that meet rule Y”. Y is expressed as a LabelSelector.

Conceptually X is a topology domain like a node, cloud provider zone, cloud provider region, etc. You express it using a topologyKey .

In the example below, a pod cannot be scheduled in a node that is already running a pod with the label app: app-name .

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - app-name
        topologyKey: kubernetes.io/hostname

Up until recently, we’ve been using pod anti-affinity to spread pods across spot instances, reducing the risk of downtime. However, pod anti-affinity doesn’t work very well when you have an application with a large number of replicas, or that scales up and down based on a cron schedule (have a look at my article about Event Driven Autoscaling). For example, an application scaling up to 30 pods would require 30 different nodes to schedule all 30 pods. This is not efficient nor performant, as the application would have to wait for new nodes to be added to the cluster, and these nodes will be most likely half empty.

Pod Topology Spread Constraints fixes this issue.

Pod Topology Spread Constraints

Promoted to stable in Kubernetes v1.19, Pod Topology Spread Constraints helps you control how Pods are spread across your cluster among domains such as regions, zones, nodes, and other user-defined topology domains. This can help achieving high availability as well as efficient resource utilisation.

With the following configuration, we are able to schedule pods evenly across different availability zones, and across different nodes:

topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app: test-svc
    maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
  - labelSelector:
      matchLabels:
        app: test-svc
    maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway

Unlike pod anti-affinity, with Pod Topology Spread Constraints we could have multiple pods for the same application running in the same node. However, as we make use of 3 availability zones, the first topology constraint will make sure there are at least 3 pods running in 3 different instances. The second topology constraint will try to schedule the 4th pod in a different instance (topologyKey: kubernetes.io/hostname ), but this is not guaranteed (ScheduleAnyway tells the scheduler to still schedule it while prioritising nodes that minimize the skew).

maxSkew describes the degree to which Pods may be unevenly distributed. You must specify this field and the number must be greater than zero. Its semantics differ according to the value of whenUnsatisfiable:

Other settings that help with reliability

There are other settings that will help running highly available applications in Kubernetes, whether you do it in Spot instances or not

Horizontal Pod Autoscaling (HPA): automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand (i.e. CPU, memory, or any other metric if you’re using a metrics adapter, like KEDA).

Termination grace period: after a SIGTERM signal is sent to the pod, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds.

If your app finishes shutting down and exits before the terminationGracePeriod is done, Kubernetes moves to the next step immediately. If your pod usually takes longer than 30 seconds to shut down, make sure you increase the grace period.

rollingUpdate strategy: set the right values for maxSurge, maxUnavailable parameters in your rolling update strategy to avoid having multiple copies of your application down during a rollout.

Conclusion

Using Spot instances to run resilient production workloads is not as scary as it could sound if applications are configured following some (or all) of the parameters mentioned in this article. We’ve been doing this since the very beginning, taking advantage of cost savings, while achieving high levels of reliability to give the best experience to our members.