Data Interface

Overview

The COLAData class is the central data container in COLA. It manages both factual (original) and counterfactual (generated) data, ensuring consistency and providing convenient methods for data manipulation and inspection.

Key Features

  • Unified Interface: Works with both Pandas DataFrame and NumPy arrays

  • Automatic Validation: Ensures data consistency between factual and counterfactual

  • Feature Type Tracking: Distinguishes between numerical and categorical features

  • Preprocessor Integration: Supports sklearn transformers for data preprocessing

  • Summary Information: Quick overview of your data structure

When to Use COLAData

Use COLAData when you need to:

  • Organize factual instances and their counterfactual explanations together

  • Automatically track which features are numerical vs categorical

  • Integrate with preprocessing pipelines (StandardScaler, OneHotEncoder, etc.)

  • Ensure data consistency before refinement

Basic Usage

Scenario 1: Using Pandas DataFrame

The most common and recommended approach:

from xai_cola.ce_sparsifier.data import COLAData
import pandas as pd

# Prepare your data with a target column
df # dataframe of factual, including label column 'Risk'

# Create COLAData instance
data = COLAData(
    factual_data=df, # pandas dataframe
    label_column='Risk',
    numerical_features=['Age', 'Income']
)

# Check the data
data.summary()

Output:

========== COLAData Summary ==========
{'factual_samples': 10,
'feature_count': 9,
'label_column': 'Risk',
'all_columns': ['Age',
'Sex',
'Job',
'Housing',
'Saving accounts',
'Checking account',
'Credit amount',
'Duration',
'Purpose',
'Risk'],
'has_counterfactual': True,
'has_transform_method': False,
'has_transformed_data': False,
'counterfactual_samples': None}
======================================

Scenario 2: Using NumPy Arrays

If you’re working with NumPy arrays instead of DataFrames:

import numpy as np

# Prepare numpy array (must include label column)
X  # numpy array of factual, including label column 'Risk'
all_columns = ['Age','Sex','Job','Housing','Saving accounts','Checking account','Credit amount','Duration','Purpose','Risk']
# MUST provide column names when using numpy
data = COLAData(
    factual_data=X, # numpy array
    label_column='Risk',
    column_names=all_columns,
    numerical_features=['Age', 'Income']
)

Warning

When using NumPy arrays, you must provide column_names that includes the label column.

Scenario 3: Adding Counterfactuals after initializing COLAData

After generating counterfactuals using DiCE, DisCount, or another explainer:

# Generate counterfactuals (example with DiCE)
from xai_cola.ce_generator import DiCE

explainer = DiCE(ml_model=ml_model)
factual, counterfactual = explainer.generate_counterfactuals(
    data=data,
    factual_class=1,
    total_cfs=2
)

# Add counterfactuals to COLAData
data.add_counterfactuals(counterfactual, with_target_column=True)

# Now summary shows both
data.summary()

Output:

========== COLAData Summary ==========
{'factual_samples': 10,
'feature_count': 9,
'label_column': 'Risk',
'all_columns': ['Age',
'Sex',
'Job',
'Housing',
'Saving accounts',
'Checking account',
'Credit amount',
'Duration',
'Purpose',
'Risk'],
'has_counterfactual': True,
'has_transform_method': False,
'has_transformed_data': False,
'counterfactual_samples': 20}
...

Scenario 4: Adding Counterfactuals when initializing COLAData

After generating counterfactuals using DiCE, DisCount, or another explainer:

from xai_cola.ce_sparsifier.data import COLAData
import pandas as pd

# Prepare your data with a target column
df # dataframe of factual, including label column 'Risk'
cf_df # dataframe of counterfactuals, including label column 'Risk'
# Create COLAData instance
data = COLAData(
    factual_data=df, # pandas dataframe
    label_column='Risk',
    counterfactual_data=cf_df, # pandas dataframe of counterfactuals
    numerical_features=['Age', 'Income']
)

# Check the data
data.summary()

Output:

========== COLAData Summary ==========
{'factual_samples': 10,
'feature_count': 9,
'label_column': 'Risk',
'all_columns': ['Age',
'Sex',
'Job',
'Housing',
'Saving accounts',
'Checking account',
'Credit amount',
'Duration',
'Purpose',
'Risk'],
'has_counterfactual': True,
'has_transform_method': False,
'has_transformed_data': False,
'counterfactual_samples': 20}
...

Scenario 5: With Preprocessing

Integrate sklearn preprocessors for automatic transformation:

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Define preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['Age', 'Income']),
        ('cat', OneHotEncoder(drop='first'), ['HasLoan'])
    ]
)

# Fit the preprocessor
preprocessor.fit(df[['Age', 'Income', 'HasLoan']])

# Create COLAData with preprocessor
data = COLAData(
    factual_data=df,
    label_column='Risk',
    numerical_features=['Age', 'Income'],
    preprocessor=preprocessor
)

Note

transform_method and preprocessor are aliases - use either one.

Common Issues

Issue 1: Missing Label Column

Error:

KeyError: "Label column 'Risk' not found in data"

Solution:

Make sure your DataFrame includes the label column:

# ❌ Wrong - label column missing
df = pd.DataFrame({'Age': [25], 'Income': [30000]})
data = COLAData(factual_data=df, label_column='Risk')

# ✅ Correct - label column included
df = pd.DataFrame({'Age': [25], 'Income': [30000], 'Risk': [1]})
data = COLAData(factual_data=df, label_column='Risk')

Issue 2: Column Mismatch

Error:

ValueError: Counterfactual columns don't match factual columns

Solution:

Ensure counterfactuals have the same columns as factuals:

# Check column names
print(data.factual_df.columns)
print(counterfactual_df.columns)

# Make sure they match (order doesn't matter)

Issue 3: NumPy Without Column Names

Error:

ValueError: column_names must be provided when using numpy array

Solution:

Always provide column names for NumPy arrays:

# ❌ Wrong
data = COLAData(factual_data=X, label_column='Risk')

# ✅ Correct
data = COLAData(
    factual_data=X,
    label_column='Risk',
    column_names=['Age', 'Income', 'Risk']
)

Best Practices

DO:

  1. Always specify numerical_features explicitly

    data = COLAData(
        factual_data=df,
        label_column='Risk',
        numerical_features=['Age', 'Income', 'Duration']
    )
    
  2. Use Pandas DataFrames when possible - easier to debug

  3. Add counterfactuals before using COLA

    data.add_counterfactuals(cf_df, with_target_column=True)
    sparsifier = COLA(data=data, ml_model=model)
    
  4. Check summary after creation

    data.summary()  # Verify everything looks correct
    

DON’T:

  1. Don’t modify factual_df directly after adding counterfactuals

    data.add_counterfactuals(cf_df)
    data.factual_df['NewColumn'] = 0  # ❌ Don't do this
    
  2. Don’t forget the label column in your data

  3. Don’t assume feature types - always specify numerical_features

  4. Don’t mix preprocessed and raw data - be consistent

API Reference

For complete parameter details, see COLAData.

Next Steps