Skip to main content
Version: 0.6

Managing your Redis Cluster

Provisioning

Your Redis cluster can either be CPU bound (limited by CPU) or memory bound (limited by memory). In both cases, you will need to scale out the Redis cluster. Unlike Dynamo which either auto-scales or can be scaled by specifying QPS, with Redis you will need to add shards to your cluster. This includes selecting the shard size and the number of shards.

While Redis cluster usage (both CPU and memory) can vary depending on the read and write workload, we have generally observed that the shard cache.m5.2xlarge can handle a peak of 25,000 aggregate read and write QPS and 25 GB of data. While this is the peak, we typically suggest not going above 75% CPU or memory capacity. Hence for provisioning, we typically use the heuristic that one cache.m5.2xlarge shard can handle 18,000 QPS and 18GB of data in memory. Additionally, we strongly suggest customers have one replica shard for every primary shard.

Let’s look at a few provisioning examples.

Total QPSTotal Dataset SizePrimary Node CountTotal Node Count (with replicas)Comments
10,000100 GB612Memory bound cluster
200,000100 GB1224CPU bound cluster
500,000500 GB2856Both CPU and memory bound cluster

In certain situations, it is possible to select a cheaper node type than cache.m5.2xlarge by selecting one that is compute/memory optimized or has SSD. However, these situations are out of scope for this analysis.

Scaling

We suggest adding shards to your cluster (preferably of the same shard type) when you hit 75% of CPU or memory. Additionally, scaling should preferably be done at off-peak times.

When you add shards, Redis needs to reshard your data, by moving keys and values from the original shards to new shards, to rebalance your data across the new cluster. For a given key, this typically involves a read from the original shard, a write to the new shard, and a delete from the original shard. Since Redis is single threaded, this occurs serially for a given shard, but parallelly across shards.

Re-sharding time depends on:

  • The total amount of data that needs to be moved.
  • The number of existing shards and the number of new shards to be added.
  • Whether the data that needs to be moved is being read from memory or disk.
    • If the cluster is a data-tiering cluster, some keys will be fetched from the disk, which increases processing time.

It can take up to 20-30 seconds to migrate one slot in a Redis cluster. A Redis cluster has 16384 slots. If your cluster currently has 8 nodes and you try to double the cluster, then you will be migrating 8192 slots (as only half of the data will move to new nodes). Slot migration will happen serially for one shard but parallelly across shards. Hence, the expected best case time will be (8192 (slots) * 30 (seconds/slot)/ 8 shards (in parallel)) = 8.5 hours. The actual time will be impacted by other ongoing operations in the cluster.

While a cluster is being scaled, the cluster will be in a Modifying state. When in this state:

  • The scaling operation cannot be cancelled and no actions can be performed in the cluster.
  • It is possible for materialization jobs to be impacted.

Memory management

Redis stores all of its data in memory. This is primarily why it performs so well.

There is also a version of Redis that has data tiering, where all the data does not need to fit in memory and some of it can be stored on an SSD attached to each node. However, even for such Redis nodes, all reads and writes go through the memory and are moved to SSD (based on key last usage).

Overview of memory fragmentation

Memory on a Redis node can be challenging to maintain due to fragmentation.

Key deletions are a cause of memory fragmentation. Key deletions occur when feature views are deleted, individual keys expire due to TTL enforcement, or because of data movement due to scaling.

High memory fragmentation may cause a Redis cluster to run out of memory, resulting in failure of subsequent writes or the crash of the cluster.

Managing memory fragmentation

The activedefrag configuration setting can help to reduce fragmentation, but has a CPU usage trade-off.

You should always set activedefrag to yes. Control the CPU resources it uses by configuring active-defrag-threshold-lower and active-defrag-threshold-upper. Note that if fragmentation is high and the CPU resources available for defrag are low, then de-fragmentation can take a long time. This can be mitigated by running defrag with a higher CPU usage at off-peak times.

Suggested Elasticache parameters

We suggest applying the following parameter settings to your Elasticache cluster.

  1. Parameter Name: cluster-allow-reads-when-down
    1. Type: string
    2. Default Value: no
    3. Suggested Value: yes
    4. Why: When set to yes, a Redis (cluster mode enabled) replication group continues to process read commands even when a node is not able to reach a quorum of primaries. This allows us to prioritize reads and feature serving to models.
  2. Parameter Name: activedefrag
    1. Type: boolean
    2. Default Value: no
    3. Suggested Value: yes
    4. Why: When there are writes and deletes happening, memory within a node can become very fragmented. As such, you can have some nodes with high memory consumption and some with less. Setting this parameter to yes enables a background process to reclaim memory more efficiently.
  3. Parameter Name: maxmemory-policy
    1. Type: string
    2. Default Value: volatile-lru
    3. Suggested Value: noeviction
    4. Why: We are using Redis as the primary data store and do not want data to get evicted silently. As such, when memory reaches capacity, new writes will fail.
  4. Parameter Name: reserved-memory-percent
    1. Type: integer
    2. Default Value: 25
    3. Suggested Value: 25
    4. Why: This parameter is important for manager backup and failover operations. While decreasing this can increase capacity, if you don’t have sufficient memory available for all the writes, the process fails. Backups may also affect the latency of services using redis as a backend. See more here about manage latency increases cause by backups.

Monitoring

See this document for information on monitoring your Elasticache metrics in Cloudwatch.

Tecton also shows the following metrics in the Web UI. These metrics are located on the Online Store Monitoring tab, which appears when clicking Services on the left navigation bar.

  • Total Redis Serving QPS
  • Redis Read Latencies
  • Number of primary and replica nodes in the cluster
  • Memory Utilization
  • Total number of keys in the cluster

Alerting

We strongly suggest adding Cloudwatch alerts for CPU and memory consumption of the nodes in the cluster.

Was this page helpful?

Happy React is loading...