본문으로 건너뛰기
본문으로 건너뛰기

Key Differences from pandas

While DataStore is highly compatible with pandas, there are important differences to understand.

Summary Table

AspectpandasDataStore
ExecutionEager (immediate)Lazy (deferred)
Return typesDataFrame/SeriesDataStore/ColumnExpr
Row orderPreservedPreserved (automatic); not guaranteed in performance mode
inplaceSupportedNot supported
IndexFull supportSimplified
MemoryAll data in memoryData at source

1. Lazy vs Eager Execution

pandas (Eager)

Operations execute immediately:

import pandas as pd

df = pd.read_csv("data.csv")  # Loads entire file NOW
result = df[df['age'] > 25]   # Filters NOW
grouped = result.groupby('city')['salary'].mean()  # Aggregates NOW

DataStore (Lazy)

Operations are deferred until results are needed:

from chdb import datastore as pd

ds = pd.read_csv("data.csv")  # Just records the source
result = ds[ds['age'] > 25]   # Just records the filter
grouped = result.groupby('city')['salary'].mean()  # Just records

# Execution happens here:
print(grouped)        # Executes when displaying
df = grouped.to_df()  # Or when converting to pandas

Why It Matters

Lazy execution enables:

  • Query optimization: Multiple operations compile to one SQL query
  • Column pruning: Only needed columns are read
  • Filter pushdown: Filters apply at the source
  • Memory efficiency: Don't load data you don't need

2. Return Types

pandas

df['col']           # Returns pd.Series
df[['a', 'b']]      # Returns pd.DataFrame
df[df['x'] > 10]    # Returns pd.DataFrame
df.groupby('x')     # Returns DataFrameGroupBy

DataStore

ds['col']           # Returns ColumnExpr (lazy)
ds[['a', 'b']]      # Returns DataStore (lazy)
ds[ds['x'] > 10]    # Returns DataStore (lazy)
ds.groupby('x')     # Returns LazyGroupBy

Converting to pandas Types

# Get pandas DataFrame
df = ds.to_df()
df = ds.to_pandas()

# Get pandas Series from column
series = ds['col'].to_pandas()

# Or trigger execution
print(ds)  # Automatically converts for display

3. Execution Triggers

DataStore executes when you need actual values:

TriggerExampleNotes
print() / repr()print(ds)Display needs data
len()len(ds)Need row count
.columnsds.columnsNeed column names
.dtypesds.dtypesNeed type info
.shapeds.shapeNeed dimensions
.valuesds.valuesNeed actual data
.indexds.indexNeed index
to_df()ds.to_df()Explicit conversion
Iterationfor row in dsNeed to iterate
equals()ds.equals(other)Need comparison

Operations That Stay Lazy

OperationReturns
filter()DataStore
select()DataStore
sort()DataStore
groupby()LazyGroupBy
join()DataStore
ds['col']ColumnExpr
ds[['a', 'b']]DataStore
ds[condition]DataStore

4. Row Order

pandas

Row order is always preserved:

df = pd.read_csv("data.csv")
print(df.head())  # Always same order as file

DataStore

Row order is automatically preserved for most operations:

ds = pd.read_csv("data.csv")
print(ds.head())  # Matches file order

# Filter preserves order
ds_filtered = ds[ds['age'] > 25]  # Same order as pandas

DataStore automatically tracks original row positions internally (using rowNumberInAllBlocks()) to ensure order consistency with pandas.

When Order Is Preserved

  • File sources (CSV, Parquet, JSON, etc.)
  • pandas DataFrame sources
  • Filter operations
  • Column selection
  • After explicit sort() or sort_values()
  • Operations that define order (nlargest(), nsmallest(), head(), tail())

When Order May Differ

  • After groupby() aggregations (use sort_values() to ensure consistent order)
  • After merge() / join() with certain join types
  • In performance mode (config.use_performance_mode()): row order is not guaranteed for any operation. See Performance Mode.

5. No inplace Parameter

pandas

df.drop(columns=['col'], inplace=True)  # Modifies df
df.fillna(0, inplace=True)              # Modifies df
df.rename(columns={'old': 'new'}, inplace=True)

DataStore

inplace=True is not supported. Always assign the result:

ds = ds.drop(columns=['col'])           # Returns new DataStore
ds = ds.fillna(0)                       # Returns new DataStore
ds = ds.rename(columns={'old': 'new'})  # Returns new DataStore

Why No inplace?

DataStore uses immutable operations to enable:

  • Query building (lazy evaluation)
  • Thread safety
  • Easier debugging
  • Cleaner code

6. Index Support

pandas

Full index support:

df = df.set_index('id')
df.loc['user123']           # Label-based access
df.loc['a':'z']             # Label-based slicing
df.reset_index()
df.index.name = 'user_id'

DataStore

Simplified index support:

# Basic operations work
ds.loc[0:10]               # Integer position
ds.iloc[0:10]              # Same as loc for DataStore

# For pandas-style index operations, convert first
df = ds.to_df()
df = df.set_index('id')
df.loc['user123']

DataStore Source Matters

  • DataFrame source: Preserves pandas index
  • File source: Uses simple integer index

7. Comparison Behavior

Comparing with pandas

pandas doesn't recognize DataStore objects:

import pandas as pd
from chdb import datastore as ds

pdf = pd.DataFrame({'a': [1, 2, 3]})
dsf = ds.DataFrame({'a': [1, 2, 3]})

# This doesn't work as expected
pdf == dsf  # pandas doesn't know DataStore

# Solution: convert DataStore to pandas
pdf.equals(dsf.to_pandas())  # True

Using equals()

# DataStore.equals() also works
dsf.equals(pdf)  # Compares with pandas DataFrame

8. Type Inference

pandas

Uses numpy/pandas types:

df['col'].dtype  # int64, float64, object, datetime64, etc.

DataStore

May use ClickHouse types:

ds['col'].dtype  # Int64, Float64, String, DateTime, etc.

# Types are converted when going to pandas
df = ds.to_df()
df['col'].dtype  # Now pandas type

Explicit Casting

# Force specific type
ds['col'] = ds['col'].astype('int64')

9. Memory Model

pandas

All data lives in memory:

df = pd.read_csv("huge.csv")  # 10GB in memory!

DataStore

Data stays at source until needed:

ds = pd.read_csv("huge.csv")  # Just metadata
ds = ds.filter(ds['year'] == 2024)  # Still just metadata

# Only filtered result is loaded
df = ds.to_df()  # Maybe only 1GB now

10. Error Messages

Different Error Sources

  • pandas errors: From pandas library
  • DataStore errors: From chDB or ClickHouse
# May see ClickHouse-style errors
# "Code: 62. DB::Exception: Syntax error..."

Debugging Tips

# View the SQL to debug
print(ds.to_sql())

# See execution plan
ds.explain()

# Enable debug logging
from chdb.datastore.config import config
config.enable_debug()

Migration Checklist

When migrating from pandas:

  • Change import statement
  • Remove inplace=True parameters
  • Add explicit to_df() where pandas DataFrame is required
  • Add sorting if row order matters
  • Use to_pandas() for comparison tests
  • Test with representative data sizes

Quick Reference

pandasDataStore
df[condition]Same (returns DataStore)
df.groupby()Same (returns LazyGroupBy)
df.drop(inplace=True)ds = ds.drop()
df.equals(other)ds.to_pandas().equals(other)
df.loc['label']ds.to_df().loc['label']
print(df)Same (triggers execution)
len(df)Same (triggers execution)