본문으로 건너뛰기
본문으로 건너뛰기

Performance Mode

DataStore has two compatibility modes that control whether output is shaped for pandas compatibility or optimized for raw SQL performance.

Overview

Modecompat_mode valueDescription
Pandas (default)"pandas"Full pandas behavior compatibility. Row order preserved, MultiIndex, set_index, dtype corrections, stable sort tiebreakers, -If/isNaN wrappers.
Performance"performance"SQL-first execution. All pandas compatibility overhead removed. Maximum throughput, but results may differ structurally from pandas.

What Performance Mode Disables

OverheadPandas mode behaviorPerformance mode behavior
Row-order preservation_row_id injection, rowNumberInAllBlocks(), __orig_row_num__ subqueriesDisabled — row order not guaranteed
Stable sort tiebreakerrowNumberInAllBlocks() ASC appended to ORDER BYDisabled — ties may have arbitrary order
Parquet preserve_orderinput_format_parquet_preserve_order=1Disabled — parallel Parquet reading allowed
GroupBy auto ORDER BYORDER BY group_key added (pandas default sort=True)Disabled — groups returned in arbitrary order
GroupBy dropna WHEREWHERE key IS NOT NULL added (pandas default dropna=True)Disabled — NULL groups included
GroupBy set_indexGroup keys set as indexDisabled — group keys stay as columns
MultiIndex columnsagg({'col': ['sum','mean']}) returns MultiIndex columnsDisabled — flat column names (col_sum, col_mean)
-If/isNaN wrapperssumIf(col, NOT isNaN(col)) for skipnaDisabled — plain sum(col) (ClickHouse natively skips NULL)
toInt64 on counttoInt64(count()) to match pandas int64Disabled — native SQL dtype returned
fillna(0) for all-NaN sumSum of all-NaN returns 0 (pandas behavior)Disabled — returns NULL
Dtype correctionsabs() unsigned→signed, etc.Disabled — native SQL dtypes
Index preservationRestores original index after SQL executionDisabled
first()/last()argMin/argMax(col, rowNumberInAllBlocks())any(col) / anyLast(col) — faster but non-deterministic
Single-SQL aggregationColumnExpr groupby materializes intermediate DataFrameInjects LazyGroupByAgg into lazy ops chain — single SQL query

Enabling Performance Mode

Using config object

from chdb.datastore.config import config

# Enable performance mode
config.use_performance_mode()

# Back to pandas compatibility
config.use_pandas_compat()

# Check current mode
print(config.compat_mode)  # 'pandas' or 'performance'

Using module-level functions

from chdb.datastore.config import set_compat_mode, CompatMode, is_performance_mode

# Enable performance mode
set_compat_mode(CompatMode.PERFORMANCE)

# Check
print(is_performance_mode())  # True

# Back to default
set_compat_mode(CompatMode.PANDAS)

Using convenience imports

from chdb import use_performance_mode, use_pandas_compat

use_performance_mode()
# ... high-performance operations ...
use_pandas_compat()
노트

Setting performance mode automatically sets the execution engine to chdb. You do not need to call config.use_chdb() separately.


When to Use Performance Mode

Use performance mode when:

  • Processing large datasets (hundreds of thousands to millions of rows)
  • Running aggregation-heavy workloads (groupby, sum, mean, count)
  • Row order does not matter (e.g., aggregated results, reports, dashboards)
  • You want maximum SQL throughput and minimal overhead
  • Memory usage is a concern (parallel Parquet reading, no intermediate DataFrames)

Stay in pandas mode when:

  • You need exact pandas behavior (row order, MultiIndex, dtypes)
  • You rely on first()/last() returning the true first/last row
  • You use shift(), diff(), cumsum() that depend on row order
  • You're writing tests that compare DataStore output with pandas

Behavior Differences

Row Order

In performance mode, row order is not guaranteed for any operation. This includes:

  • Filter results
  • GroupBy aggregation results
  • head() / tail() without explicit sort_values()
  • first() / last() aggregations

If you need ordered results, add an explicit sort_values():

config.use_performance_mode()

ds = pd.read_csv("data.csv")

# Unordered (fast)
result = ds.groupby("region")["revenue"].sum()

# Ordered (still fast, just adds ORDER BY)
result = ds.groupby("region")["revenue"].sum().sort_values()

GroupBy Results

AspectPandas modePerformance mode
Group key locationIndex (via set_index)Regular column
Group orderSorted by key (default)Arbitrary order
NULL groupsExcluded (default dropna=True)Included
Column formatMultiIndex for multi-aggFlat names (col_func)
first()/last()Deterministic (row order)Non-deterministic (any()/anyLast())

Aggregation

config.use_performance_mode()

# Sum of all-NaN group returns NULL (not 0)
# Count returns native uint64 (not forced int64)
# No -If wrappers: sum() instead of sumIf()
result = ds.groupby("cat")["val"].sum()

Single-SQL Execution

In performance mode, ColumnExpr groupby aggregation (e.g., ds[condition].groupby('col')['val'].sum()) is executed as a single SQL query instead of the two-step process used in pandas mode:

config.use_performance_mode()

# Pandas mode: two SQL queries (filter → materialize → groupby)
# Performance mode: one SQL query (WHERE + GROUP BY in same query)
result = ds[ds["rating"] > 3.5].groupby("category")["revenue"].sum()

# Generated SQL (single query):
# SELECT category, sum(revenue) FROM data WHERE rating > 3.5 GROUP BY category

This eliminates the intermediate DataFrame materialization and can significantly reduce memory usage and execution time.


Comparison with Execution Engine

Performance mode (compat_mode) and execution engine (execution_engine) are independent configuration axes:

ConfigControlsValues
execution_engineWhich engine runs the computationauto, chdb, pandas
compat_modeWhether to reshape output for pandas compatibilitypandas, performance

Setting compat_mode='performance' automatically sets execution_engine='chdb', since performance mode is designed for SQL execution.

from chdb.datastore.config import config

# These are independent
config.use_chdb()              # Force chDB engine, keep pandas compat
config.use_performance_mode()  # Force chDB + remove pandas overhead

Testing with Performance Mode

When writing tests for performance mode, results may differ from pandas in row order and structural format. Use these strategies:

Sort-then-compare (aggregations, filters)

# Sort both sides by the same columns before comparing
ds_result = ds.groupby("cat")["val"].sum()
pd_result = pd_df.groupby("cat")["val"].sum()

ds_sorted = ds_result.sort_index()
pd_sorted = pd_result.sort_index()
np.testing.assert_array_equal(ds_sorted.values, pd_sorted.values)

Value-range check (first/last)

# first() with any() returns an arbitrary element from the group
result = ds.groupby("cat")["val"].first()
for group_key in groups:
    assert result.loc[group_key] in group_values[group_key]

Schema-and-count (LIMIT without ORDER BY)

# head() without sort_values: row set is non-deterministic
result = ds.head(5)
assert len(result) == 5
assert set(result.columns) == expected_columns

Best Practices

1. Enable early in your script

from chdb.datastore.config import config

config.use_performance_mode()

# All subsequent operations benefit
ds = pd.read_parquet("data.parquet")
result = ds[ds["amount"] > 100].groupby("region")["amount"].sum()

2. Add explicit sorting when order matters

# For display or downstream processing that expects order
result = (ds
    .groupby("region")["revenue"].sum()
    .sort_values(ascending=False)
)

3. Use for batch/ETL workloads

config.use_performance_mode()

# ETL pipeline — order doesn't matter, throughput does
summary = (ds
    .filter(ds["date"] >= "2024-01-01")
    .groupby(["region", "product"])
    .agg({"revenue": "sum", "quantity": "sum", "rating": "mean"})
)
summary.to_df().to_parquet("summary.parquet")

4. Switch modes within a session

# Performance mode for heavy computation
config.use_performance_mode()
aggregated = ds.groupby("cat")["val"].sum()

# Back to pandas mode for exact-match comparison
config.use_pandas_compat()
detailed = ds[ds["val"] > 100].head(10)