Data Interface¶
Overview¶
The COLAData class is the central data container in COLA. It manages both factual (original) and counterfactual (generated) data, ensuring consistency and providing convenient methods for data manipulation and inspection.
Key Features¶
Unified Interface: Works with both Pandas DataFrame and NumPy arrays
Automatic Validation: Ensures data consistency between factual and counterfactual
Feature Type Tracking: Distinguishes between numerical and categorical features
Preprocessor Integration: Supports sklearn transformers for data preprocessing
Summary Information: Quick overview of your data structure
When to Use COLAData¶
Use COLAData when you need to:
Organize factual instances and their counterfactual explanations together
Automatically track which features are numerical vs categorical
Integrate with preprocessing pipelines (StandardScaler, OneHotEncoder, etc.)
Ensure data consistency before refinement
Basic Usage¶
Scenario 1: Using Pandas DataFrame¶
The most common and recommended approach:
from xai_cola.ce_sparsifier.data import COLAData
import pandas as pd
# Prepare your data with a target column
df # dataframe of factual, including label column 'Risk'
# Create COLAData instance
data = COLAData(
factual_data=df, # pandas dataframe
label_column='Risk',
numerical_features=['Age', 'Income']
)
# Check the data
data.summary()
Output:
========== COLAData Summary ==========
{'factual_samples': 10,
'feature_count': 9,
'label_column': 'Risk',
'all_columns': ['Age',
'Sex',
'Job',
'Housing',
'Saving accounts',
'Checking account',
'Credit amount',
'Duration',
'Purpose',
'Risk'],
'has_counterfactual': True,
'has_transform_method': False,
'has_transformed_data': False,
'counterfactual_samples': None}
======================================
Scenario 2: Using NumPy Arrays¶
If you’re working with NumPy arrays instead of DataFrames:
import numpy as np
# Prepare numpy array (must include label column)
X # numpy array of factual, including label column 'Risk'
all_columns = ['Age','Sex','Job','Housing','Saving accounts','Checking account','Credit amount','Duration','Purpose','Risk']
# MUST provide column names when using numpy
data = COLAData(
factual_data=X, # numpy array
label_column='Risk',
column_names=all_columns,
numerical_features=['Age', 'Income']
)
Warning
When using NumPy arrays, you must provide column_names that includes the label column.
Scenario 3: Adding Counterfactuals after initializing COLAData¶
After generating counterfactuals using DiCE, DisCount, or another explainer:
# Generate counterfactuals (example with DiCE)
from xai_cola.ce_generator import DiCE
explainer = DiCE(ml_model=ml_model)
factual, counterfactual = explainer.generate_counterfactuals(
data=data,
factual_class=1,
total_cfs=2
)
# Add counterfactuals to COLAData
data.add_counterfactuals(counterfactual, with_target_column=True)
# Now summary shows both
data.summary()
Output:
========== COLAData Summary ==========
{'factual_samples': 10,
'feature_count': 9,
'label_column': 'Risk',
'all_columns': ['Age',
'Sex',
'Job',
'Housing',
'Saving accounts',
'Checking account',
'Credit amount',
'Duration',
'Purpose',
'Risk'],
'has_counterfactual': True,
'has_transform_method': False,
'has_transformed_data': False,
'counterfactual_samples': 20}
...
Scenario 4: Adding Counterfactuals when initializing COLAData¶
After generating counterfactuals using DiCE, DisCount, or another explainer:
from xai_cola.ce_sparsifier.data import COLAData
import pandas as pd
# Prepare your data with a target column
df # dataframe of factual, including label column 'Risk'
cf_df # dataframe of counterfactuals, including label column 'Risk'
# Create COLAData instance
data = COLAData(
factual_data=df, # pandas dataframe
label_column='Risk',
counterfactual_data=cf_df, # pandas dataframe of counterfactuals
numerical_features=['Age', 'Income']
)
# Check the data
data.summary()
Output:
========== COLAData Summary ==========
{'factual_samples': 10,
'feature_count': 9,
'label_column': 'Risk',
'all_columns': ['Age',
'Sex',
'Job',
'Housing',
'Saving accounts',
'Checking account',
'Credit amount',
'Duration',
'Purpose',
'Risk'],
'has_counterfactual': True,
'has_transform_method': False,
'has_transformed_data': False,
'counterfactual_samples': 20}
...
Scenario 5: With Preprocessing¶
Integrate sklearn preprocessors for automatic transformation:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Define preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['Age', 'Income']),
('cat', OneHotEncoder(drop='first'), ['HasLoan'])
]
)
# Fit the preprocessor
preprocessor.fit(df[['Age', 'Income', 'HasLoan']])
# Create COLAData with preprocessor
data = COLAData(
factual_data=df,
label_column='Risk',
numerical_features=['Age', 'Income'],
preprocessor=preprocessor
)
Note
transform_method and preprocessor are aliases - use either one.
Common Issues¶
Issue 1: Missing Label Column¶
Error:
KeyError: "Label column 'Risk' not found in data"
Solution:
Make sure your DataFrame includes the label column:
# ❌ Wrong - label column missing
df = pd.DataFrame({'Age': [25], 'Income': [30000]})
data = COLAData(factual_data=df, label_column='Risk')
# ✅ Correct - label column included
df = pd.DataFrame({'Age': [25], 'Income': [30000], 'Risk': [1]})
data = COLAData(factual_data=df, label_column='Risk')
Issue 2: Column Mismatch¶
Error:
ValueError: Counterfactual columns don't match factual columns
Solution:
Ensure counterfactuals have the same columns as factuals:
# Check column names
print(data.factual_df.columns)
print(counterfactual_df.columns)
# Make sure they match (order doesn't matter)
Issue 3: NumPy Without Column Names¶
Error:
ValueError: column_names must be provided when using numpy array
Solution:
Always provide column names for NumPy arrays:
# ❌ Wrong
data = COLAData(factual_data=X, label_column='Risk')
# ✅ Correct
data = COLAData(
factual_data=X,
label_column='Risk',
column_names=['Age', 'Income', 'Risk']
)
Best Practices¶
✅ DO:
Always specify numerical_features explicitly
data = COLAData( factual_data=df, label_column='Risk', numerical_features=['Age', 'Income', 'Duration'] )
Use Pandas DataFrames when possible - easier to debug
Add counterfactuals before using COLA
data.add_counterfactuals(cf_df, with_target_column=True) sparsifier = COLA(data=data, ml_model=model)
Check summary after creation
data.summary() # Verify everything looks correct
❌ DON’T:
Don’t modify factual_df directly after adding counterfactuals
data.add_counterfactuals(cf_df) data.factual_df['NewColumn'] = 0 # ❌ Don't do this
Don’t forget the label column in your data
Don’t assume feature types - always specify numerical_features
Don’t mix preprocessed and raw data - be consistent
API Reference¶
For complete parameter details, see COLAData.
Next Steps¶
Learn about Model Interface - Wrapping your ML model
Explore Counterfactual Explainers - Generating counterfactuals
Continue to Matching Policies - Configuring COLA