Key Differences from pandas

While DataStore is highly compatible with pandas, there are important differences to understand.

Summary Table

Aspect	pandas	DataStore
Execution	Eager (immediate)	Lazy (deferred)
Return types	DataFrame/Series	DataStore/ColumnExpr
Row order	Preserved	Preserved (automatic); not guaranteed in performance mode
inplace	Supported	Not supported
Index	Full support	Simplified
Memory	All data in memory	Data at source

1. Lazy vs Eager Execution

pandas (Eager)

Operations execute immediately:

import pandas as pd

df = pd.read_csv("data.csv")  # Loads entire file NOW
result = df[df['age'] > 25]   # Filters NOW
grouped = result.groupby('city')['salary'].mean()  # Aggregates NOW

DataStore (Lazy)

Operations are deferred until results are needed:

from chdb import datastore as pd

ds = pd.read_csv("data.csv")  # Just records the source
result = ds[ds['age'] > 25]   # Just records the filter
grouped = result.groupby('city')['salary'].mean()  # Just records

# Execution happens here:
print(grouped)        # Executes when displaying
df = grouped.to_df()  # Or when converting to pandas

Why It Matters

Lazy execution enables:

Query optimization: Multiple operations compile to one SQL query
Column pruning: Only needed columns are read
Filter pushdown: Filters apply at the source
Memory efficiency: Don't load data you don't need

2. Return Types

pandas

df['col']           # Returns pd.Series
df[['a', 'b']]      # Returns pd.DataFrame
df[df['x'] > 10]    # Returns pd.DataFrame
df.groupby('x')     # Returns DataFrameGroupBy

DataStore

ds['col']           # Returns ColumnExpr (lazy)
ds[['a', 'b']]      # Returns DataStore (lazy)
ds[ds['x'] > 10]    # Returns DataStore (lazy)
ds.groupby('x')     # Returns LazyGroupBy

Converting to pandas Types

# Get pandas DataFrame
df = ds.to_df()
df = ds.to_pandas()

# Get pandas Series from column
series = ds['col'].to_pandas()

# Or trigger execution
print(ds)  # Automatically converts for display

3. Execution Triggers

DataStore executes when you need actual values:

Trigger	Example	Notes
`print()` / `repr()`	`print(ds)`	Display needs data
`len()`	`len(ds)`	Need row count
`.columns`	`ds.columns`	Need column names
`.dtypes`	`ds.dtypes`	Need type info
`.shape`	`ds.shape`	Need dimensions
`.values`	`ds.values`	Need actual data
`.index`	`ds.index`	Need index
`to_df()`	`ds.to_df()`	Explicit conversion
Iteration	`for row in ds`	Need to iterate
`equals()`	`ds.equals(other)`	Need comparison

Operations That Stay Lazy

Operation	Returns
`filter()`	DataStore
`select()`	DataStore
`sort()`	DataStore
`groupby()`	LazyGroupBy
`join()`	DataStore
`ds['col']`	ColumnExpr
`ds[['a', 'b']]`	DataStore
`ds[condition]`	DataStore

4. Row Order

pandas

Row order is always preserved:

df = pd.read_csv("data.csv")
print(df.head())  # Always same order as file

DataStore

Row order is automatically preserved for most operations:

ds = pd.read_csv("data.csv")
print(ds.head())  # Matches file order

# Filter preserves order
ds_filtered = ds[ds['age'] > 25]  # Same order as pandas

DataStore automatically tracks original row positions internally (using rowNumberInAllBlocks()) to ensure order consistency with pandas.

When Order Is Preserved

File sources (CSV, Parquet, JSON, etc.)
pandas DataFrame sources
Filter operations
Column selection
After explicit sort() or sort_values()
Operations that define order (nlargest(), nsmallest(), head(), tail())

When Order May Differ

After groupby() aggregations (use sort_values() to ensure consistent order)
After merge() / join() with certain join types
In performance mode (config.use_performance_mode()): row order is not guaranteed for any operation. See Performance Mode.

5. No inplace Parameter

pandas

df.drop(columns=['col'], inplace=True)  # Modifies df
df.fillna(0, inplace=True)              # Modifies df
df.rename(columns={'old': 'new'}, inplace=True)

DataStore

inplace=True is not supported. Always assign the result:

ds = ds.drop(columns=['col'])           # Returns new DataStore
ds = ds.fillna(0)                       # Returns new DataStore
ds = ds.rename(columns={'old': 'new'})  # Returns new DataStore

Why No inplace?

DataStore uses immutable operations to enable:

Query building (lazy evaluation)
Thread safety
Easier debugging
Cleaner code

6. Index Support

pandas

Full index support:

df = df.set_index('id')
df.loc['user123']           # Label-based access
df.loc['a':'z']             # Label-based slicing
df.reset_index()
df.index.name = 'user_id'

DataStore

Simplified index support:

# Basic operations work
ds.loc[0:10]               # Integer position
ds.iloc[0:10]              # Same as loc for DataStore

# For pandas-style index operations, convert first
df = ds.to_df()
df = df.set_index('id')
df.loc['user123']

DataStore Source Matters

DataFrame source: Preserves pandas index
File source: Uses simple integer index

7. Comparison Behavior

Comparing with pandas

pandas doesn't recognize DataStore objects:

import pandas as pd
from chdb import datastore as ds

pdf = pd.DataFrame({'a': [1, 2, 3]})
dsf = ds.DataFrame({'a': [1, 2, 3]})

# This doesn't work as expected
pdf == dsf  # pandas doesn't know DataStore

# Solution: convert DataStore to pandas
pdf.equals(dsf.to_pandas())  # True

Using equals()

# DataStore.equals() also works
dsf.equals(pdf)  # Compares with pandas DataFrame

8. Type Inference

pandas

Uses numpy/pandas types:

df['col'].dtype  # int64, float64, object, datetime64, etc.

DataStore

May use ClickHouse types:

ds['col'].dtype  # Int64, Float64, String, DateTime, etc.

# Types are converted when going to pandas
df = ds.to_df()
df['col'].dtype  # Now pandas type

Explicit Casting

# Force specific type
ds['col'] = ds['col'].astype('int64')

9. Memory Model

pandas

All data lives in memory:

df = pd.read_csv("huge.csv")  # 10GB in memory!

DataStore

Data stays at source until needed:

ds = pd.read_csv("huge.csv")  # Just metadata
ds = ds.filter(ds['year'] == 2024)  # Still just metadata

# Only filtered result is loaded
df = ds.to_df()  # Maybe only 1GB now

10. Error Messages

Different Error Sources

pandas errors: From pandas library
DataStore errors: From chDB or ClickHouse

# May see ClickHouse-style errors
# "Code: 62. DB::Exception: Syntax error..."

Debugging Tips

# View the SQL to debug
print(ds.to_sql())

# See execution plan
ds.explain()

# Enable debug logging
from chdb.datastore.config import config
config.enable_debug()

Migration Checklist

When migrating from pandas:

Change import statement
Remove inplace=True parameters
Add explicit to_df() where pandas DataFrame is required
Add sorting if row order matters
Use to_pandas() for comparison tests
Test with representative data sizes

Quick Reference

pandas	DataStore
`df[condition]`	Same (returns DataStore)
`df.groupby()`	Same (returns LazyGroupBy)
`df.drop(inplace=True)`	`ds = ds.drop()`
`df.equals(other)`	`ds.to_pandas().equals(other)`
`df.loc['label']`	`ds.to_df().loc['label']`
`print(df)`	Same (triggers execution)
`len(df)`	Same (triggers execution)

Summary Table​

1. Lazy vs Eager Execution​

pandas (Eager)​

DataStore (Lazy)​

Why It Matters​

2. Return Types​

pandas​

DataStore​

Converting to pandas Types​

3. Execution Triggers​

Operations That Stay Lazy​

4. Row Order​

pandas​

DataStore​

When Order Is Preserved​

When Order May Differ​

5. No inplace Parameter​

pandas​

DataStore​

Why No inplace?​

6. Index Support​

pandas​

DataStore​

DataStore Source Matters​

7. Comparison Behavior​

Comparing with pandas​

Using equals()​

8. Type Inference​

pandas​

DataStore​

Explicit Casting​

9. Memory Model​

pandas​

DataStore​

10. Error Messages​

Different Error Sources​

Debugging Tips​

Migration Checklist​

Quick Reference​

Summary Table

1. Lazy vs Eager Execution

pandas (Eager)

DataStore (Lazy)

Why It Matters

2. Return Types

pandas

DataStore

Converting to pandas Types

3. Execution Triggers

Operations That Stay Lazy

4. Row Order

pandas

DataStore

When Order Is Preserved

When Order May Differ

5. No inplace Parameter

pandas

DataStore

Why No inplace?

6. Index Support

pandas

DataStore

DataStore Source Matters

7. Comparison Behavior

Comparing with pandas

Using equals()

8. Type Inference

pandas

DataStore

Explicit Casting

9. Memory Model

pandas

DataStore

10. Error Messages

Different Error Sources

Debugging Tips

Migration Checklist

Quick Reference