Blog Post

Stop Using Pandas for Aggregations — Try DuckDB Instead

,

If you've ever loaded a 2 GB CSV into pandas just to run a few aggregations — and watched your machine struggle — there's a better tool for the job.

DuckDB is a free, open-source, in-process OLAP database that runs SQL directly inside your application, with no server to set up and no external dependencies.

Think of it as SQLite for analytics: install it in seconds, query Parquet files, CSVs, JSON, or even pandas DataFrames, and get columnar-speed performance on your laptop.

1. Installing DuckDB

DuckDB is available for Python, R, the CLI, and several other runtimes. The most common way to get started is via pip or the CLI binary from the official installation page.

# Python
pip install duckdb
 
# R
install.packages("duckdb")
 
# macOS CLI via Homebrew
brew install duckdb
 
# Windows CLI via winget
winget install DuckDB.cli
  

2. What "in-process" actually means

Traditional databases run as a separate server process. Your application connects over a network socket, serializes the query, waits for the server to deserialize and execute it, and then serializes the result back. That overhead adds up quickly, especially for analytical workloads.

DuckDB runs inside your process — no server, no network round-trip. When you use the Python package, for example, DuckDB can run queries directly on an existing pandas DataFrame without ever copying the data. This makes it extremely fast for ad-hoc analytics and ETL transformations.

# Install
pip install duckdb
 
# Query a pandas DataFrame directly — no data copy
import duckdb, pandas as pd
 
df = pd.read_csv("sales.csv")
 
result = duckdb.sql("""
    SELECT region, SUM(amount) AS total
    FROM df
    GROUP BY region
    ORDER BY total DESC
""").df()
  

3. Querying files directly — no table creation needed

One of DuckDB's most practical features is its ability to query files directly from the FROM clause. You can point it at a local Parquet file, a remote CSV over HTTPS, or even an S3 bucket — without defining a schema or creating a table first.

DuckDB auto-detects column names and types, so exploratory work is fast. This is documented in the official DuckDB Guides.

-- Query a local Parquet file
SELECT customer_id, COUNT(*) AS orders
FROM   'orders_2024.parquet'
WHERE  status = 'completed'
GROUP BY customer_id
ORDER BY orders DESC
LIMIT  10;
 
-- Or query a remote CSV over HTTPS (httpfs extension auto-loads)
SELECT COUNT(*) FROM 'https://data.example.com/events.csv';
  

4. Where DuckDB fits in a data engineering workflow

DuckDB is not a replacement for SQL Server, Redshift, or PostgreSQL in transactional or multi-user systems. Its strength is in local analytical processing — the kind of work that typically lands in a Python script or Jupyter notebook. Common patterns include:

  • ETL pre-processing: filter, join, and reshape raw CSV or Parquet files before loading into a data warehouse.
  • Format conversion: read JSON or CSV and write out clean Parquet with a single SQL statement using COPY ... TO.
  • Ad-hoc analytics on exports: query a data warehouse export locally without spinning up another cloud service.
  • Privacy-first analysis: process sensitive data (health records, financial statements) entirely on-premises, with data never leaving the machine.
-- Convert a CSV to Parquet in one shot
COPY (
    SELECT *
    FROM   'raw_data.csv'
    WHERE  event_date >= '2024-01-01'
) TO 'clean_data.parquet' (FORMAT PARQUET);
  

5. Columnar execution — why it's fast

DuckDB uses a columnar-vectorized query engine. Instead of processing one row at a time (like PostgreSQL or MySQL do internally for OLTP), DuckDB processes large batches of column values together in a single CPU operation. This reduces the per-value overhead dramatically, making aggregations and scans over wide tables significantly faster than row-oriented engines. For a data engineer, this means you can comfortably run GROUP BY queries over tens of millions of rows on a standard laptop in seconds.

Wrapping Up

DuckDB fills a gap that pandas alone struggles with: fast, SQL-native, columnar analytics that run in-process with zero infrastructure. It reads Parquet, CSV, and JSON natively; integrates cleanly into Python and R workflows; and handles files too large for memory by spilling to disk automatically. For data engineers doing local transformations, format conversions, or exploratory analysis on raw exports, DuckDB is often the fastest path from raw data to insight. Start with the official DuckDB documentation — the installation takes under a minute, and the CLI is ready to query immediately.

Original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating