Blog Post

Data Skew in Data Engineering: What It Is and How to Fix It

,

You kick off a distributed job expecting it to finish in minutes — but one task keeps running while all others have long since completed. The culprit is almost always data skew. It is one of the most common performance killers in distributed data engineering, yet it is often misdiagnosed or ignored until it starts breaking SLAs.

Data skew happens when data is unevenly distributed across partitions or compute nodes. In systems like Apache Spark, Amazon Redshift, or Azure Synapse, the workload is designed to run in parallel. When one partition holds a disproportionate share of the data — because of a high-frequency key or a poor distribution choice — that one node does all the heavy lifting while the rest sit idle. The job cannot complete until every partition finishes, so your entire pipeline is only as fast as the slowest, most overloaded worker.

1. What causes data skew

The root cause is almost always a skewed join or grouping key. If 60% of your orders belong to a single customer_id, or your event log has millions of rows for one status = 'active' value, Spark or your MPP engine will route all of those rows to the same partition during a shuffle. The result: one executor handles millions of rows while ten others process a few hundred.

In Amazon Redshift, the problem manifests at the storage layer. If your DISTKEY column has low cardinality or a heavily concentrated value distribution, Redshift routes too many rows to the same compute node slice, forcing it to do more work than the others during every query that touches that table.

Types of data skew

Data skew is not a single pattern — it shows up in different shapes depending on the root cause. The four most common types in distributed pipelines:

Key skew — hot partition key

One key (e.g. a single customer_id) floods one partition while all others stay near-empty.

Partition skew — uneven data split

Wide size variation across partitions due to a low-cardinality key like status or country.

Join skew — NULL key explosion

A large share of rows with NULL or a default value all land in the same executor during shuffle.

Temporal skew — time-based spike

Event data clusters at specific timestamps (e.g. end-of-day batch), overloading those time-aligned partitions.

Normal partition

Skewed / hot partition

Ideal even distribution / spike window

2. Issues data skew causes in your pipelines

The consequences go beyond slow queries. Here are the most damaging effects data skew causes in practice:

  • Pipeline bottlenecks: A single overloaded partition stalls the entire stage. Every downstream task waits on it.
  • Out-of-memory errors: The skewed executor receives far more data than it can fit in memory, causing spills to disk or outright failures.
  • Wasted cluster resources: Most nodes finish quickly and remain idle while one struggles. You are paying for compute that does nothing.
  • Unpredictable SLA breaches: The job duration becomes tied to the size of the hot partition rather than total data volume, making runtime estimates unreliable.
  • Cascading failures in multi-step ETL: A skew-induced timeout in one stage can trigger retries, delay downstream jobs, and corrupt incremental watermarks.

3. Detecting skew before it becomes an issue

In Spark (including Azure Synapse Analytics and Databricks), open the Spark UI and look at the task duration histogram under a shuffle stage. A long tail of one task while all others finish quickly is the signature of skew. You can also profile the key distribution directly before running a heavy join:

# Quick skew check in PySpark
df.groupBy("customer_id")
  .count()
  .orderBy(col("count").desc())
  .show(20)
  

In Redshift, the SVV_TABLE_INFO view exposes a skew_rows column that reports the ratio between the most-populated and least-populated node slices. AWS documentation recommends reviewing the distribution key on any table with a skew_rows value of 4.0 or higher — values below that are generally acceptable natural imbalance. The Identifying tables with data skew guide in the AWS docs walks through how to interpret this output.

-- Check skew_rows ratio in Redshift
SELECT "table", diststyle, skew_rows
FROM   svv_table_info
WHERE  skew_rows >= 4.0
ORDER BY skew_rows DESC;
  

4. Technique: Salting to break hot keys

Salting is the most widely used fix for skewed joins and aggregations. The idea is to append a random integer suffix to the skewed key, artificially spreading one large partition across multiple smaller ones. On the smaller (lookup) side of the join, you replicate each row once per salt value so that both sides still match correctly. After the join, you strip the salt column.

The Azure Synapse Spark performance guide explicitly recommends salting the entire key — or only the subset of skewed keys — as the primary remedy for shuffle-induced slowdowns.

from pyspark.sql import functions as F

SALT_BUCKETS = 10

# Salt the large (skewed) table
df_large = df_large.withColumn(
    "salt", (F.rand() * SALT_BUCKETS).cast("int")
)

# Explode the small (lookup) table to match all salt values
df_small = df_small.withColumn(
    "salt", F.explode(F.array([F.lit(i) for i in range(SALT_BUCKETS)]))
)

# Join on both the original key and the salt
result = df_large.join(df_small, ["customer_id", "salt"]).drop("salt")
  

5. Technique: Broadcast joins for small dimension tables

When one side of a join is small enough to fit in executor memory, a broadcast join eliminates the shuffle entirely. Instead of moving data between nodes, Spark sends a full copy of the small table to every executor, so each worker can join locally. This is the fastest fix when the skewed side is the larger table and the lookup side is compact.

from pyspark.sql.functions import broadcast

result = df_large.join(
    broadcast(df_small), "customer_id"
)
  

In Redshift, the equivalent approach is setting DISTSTYLE ALL on small dimension tables. This stores a full copy of the table on every compute node, so join colocation is guaranteed without any redistribution at query time. The Redshift distribution styles documentation recommends this for slow-moving reference tables under a few hundred megabytes.

6. Technique: Adaptive Query Execution (Spark 3.x)

Since Spark 3.0, Adaptive Query Execution (AQE) can detect and automatically split skewed partitions at runtime without any manual intervention. When AQE identifies a task whose partition size significantly exceeds the median, it breaks that partition into smaller sub-tasks and replicates the corresponding data from the other join side to match. You can enable it with two configuration properties:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
  

AQE is a strong first line of defence on modern Spark environments. If it does not resolve the skew fully — for example when the hot key is extreme — fall back to manual salting as described in the previous section.

7. Fixing skew in Amazon Redshift: choosing the right DISTKEY

In Redshift, skew is a storage-level problem. The fix is selecting a distribution key with high cardinality and uniform value distribution. Columns like order_id or event_id make far better distribution keys than status or country_code. The AWS prescriptive guidance on Redshift table design recommends validating cardinality before committing to a key column — if you notice skewness, change the distribution key.

For tables used heavily in joins, align the distribution key with the join column used most often. This ensures that matching rows already reside on the same node slice, eliminating costly data redistribution at query time. If automatic table optimization is enabled, Redshift will manage distribution style changes in the background as table sizes evolve.

Wrapping Up

Data skew is a structural problem that compounds over time: as tables grow, hot keys attract more rows, and pipelines that once ran in minutes begin taking hours. The pattern to follow is consistent — profile key distributions early, detect skew via the Spark UI or system views like SVV_TABLE_INFO in Redshift, and apply the right fix for the context: salting for heavy shuffle joins, broadcast joins when the lookup side is small, AQE for automatic handling in Spark 3.x, and a well-chosen DISTKEY for Redshift table design. The most reliable rule: never use a column with low cardinality as a partition or distribution key.

Original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating