================== Matching Policies ================== Overview ======== Matching policies determine how COLA pairs factual instances with counterfactual instances before refinement. The choice of matching policy affects both the quality and computational cost of the refined counterfactuals. COLA provides four matching strategies: 1. **Exact matching (ECT)** - Fast, deterministic matching (recommended) 2. **Optimal Transport (OT)** - Globally optimal matching (recommended) 3. **Nearest Neighbor (NN)** - Simple proximity-based matching (recommended) 4. **Coarsened Exact Matching (CEM)** - Coarsened exact matching Quick Start =========== .. code-block:: python from xai_cola import COLA from xai_cola.ce_sparsifier.data import COLAData from xai_cola.ce_sparsifier.models import Model # Initialize COLA sparsifier = COLA(data=data, ml_model=ml_model) # Set matching policy sparsifier.set_policy( matcher="ot", # Matching strategy attributor="pshap", # Feature attribution method random_state=42 # For reproducibility ) # Query minimum actions needed min_actions = sparsifier.query_minimum_actions() # Refine counterfactuals refined = sparsifier.refine_counterfactuals(limited_actions=min_actions) Matching Strategies =================== 1. Exact Matching (ECT) -------------------------------- **When to use:** - You need fast results - You have clear class transitions (e.g., 0→1, 1→0) - Number of factuals equals number of counterfactuals (n factuals = n counterfactuals) - One-to-one matching is desired (creating an n×n identity matrix) **How it works:** Matches factual instances to counterfactuals based on exact class transitions. For instance, factuals with class 0 are matched to counterfactuals with class 1. .. code-block:: python sparsifier.set_policy( matcher="ect", attributor="pshap" ) **Advantages:** - ✅ Very fast - ✅ Deterministic results - ✅ Simple and interpretable - ✅ No hyperparameters **Disadvantages:** - ⚠️ May not be globally optimal - ⚠️ Requires balanced classes - ⚠️ Limited flexibility **Best for:** Binary classification with similar class sizes. 2. Optimal Transport (OT) ------------------------- **When to use:** - You want the best quality results - Computational cost is acceptable - You have many counterfactuals per instance **How it works:** Solves an optimal transport problem to find the globally optimal assignment between factual and counterfactual instances, minimizing total transportation cost. .. code-block:: python sparsifier.set_policy( matcher="ot", attributor="pshap" ) **Advantages:** - ✅ Globally optimal matching - ✅ Best refinement quality - ✅ Considers all possible pairings - ✅ Theoretically grounded **Disadvantages:** - ⚠️ Slower than other methods - ⚠️ Complexity: O(n³) for n instances 3. Nearest Neighbor (NN) ------------------------- **When to use:** - You want the simplest approach - Computational resources are very limited - Quick prototyping **How it works:** Matches each factual to its nearest counterfactual in feature space using Euclidean distance. .. code-block:: python sparsifier.set_policy( matcher="nn", attributor="pshap" ) **Advantages:** - ✅ Extremely fast - ✅ Simple to understand - ✅ Works with any data **Disadvantages:** - ⚠️ Locally optimal only - ⚠️ Sensitive to scale 4. Coarsened Exact Matching (CEM) ---------------------------------- **When to use:** - You want to match on coarsened/binned feature values - Variables have natural stratification (e.g., age groups, income brackets) - You need balance on important covariates - Exact matching is too restrictive but you want interpretable strata **How it works:** Temporarily coarsens (bins) continuous variables into discrete strata, performs exact matching on these coarsened values, then uses original feature values for refinement. This balances the trade-off between exact matching precision and matching feasibility. .. code-block:: python sparsifier.set_policy( matcher="cem", attributor="pshap" ) **Advantages:** - ✅ More flexible than exact matching - ✅ Ensures balance on key covariates - ✅ Interpretable stratification - ✅ Reduces model dependence **Disadvantages:** - ⚠️ Requires choosing binning strategy - ⚠️ May reduce sample size if strata are too fine - ⚠️ Less optimal than OT for complex relationships Feature Attribution =================== PSHAP Attributor ---------------- COLA uses PSHAP for feature attribution, determining which features are most important for the transition from factual to counterfactual. .. code-block:: python sparsifier.set_policy( matcher="ot", attributor="pshap", random_state=42 ) **How PSHAP works:** 1. For each factual-counterfactual pair, compute Shapley values 2. Rank features by their contribution to the class change 3. Select top-k features with highest importance 4. Generate refined counterfactual using only these features **Parameters:** - ``random_state`` (int): Random seed for reproducibility Querying Minimum Actions ========================= Before refining, you can query the minimum number of actions needed: .. code-block:: python # Query minimum actions min_actions = sparsifier.query_minimum_actions() print(f"Minimum actions needed: {min_actions}") # Use this value for refinement refined = sparsifier.refine_counterfactuals(limited_actions=min_actions) This tells you the theoretical minimum number of feature changes needed for your dataset. Refinement Options ================== Basic Refinement ---------------- .. code-block:: python # Refine with specific action limit refined = sparsifier.refine_counterfactuals(limited_actions=5) With Feature Restrictions -------------------------- You can restrict which features can be modified: .. code-block:: python # Only allow these features to change refined = sparsifier.refine_counterfactuals( limited_actions=5, features_to_vary=['Income', 'Duration', 'LoanAmount'] ) .. note:: This is different from the explainer's ``features_to_vary``. The explainer controls CF generation, while this controls CF refinement. Getting All Results ------------------- Get factual, original counterfactual, and refined counterfactual together: .. code-block:: python factual_df, ce_df, ace_df = sparsifier.get_all_results(limited_actions=5) print("Original CF actions:", (factual_df != ce_df).sum().sum()) print("Refined ACE actions:", (factual_df != ace_df).sum().sum()) Complete Examples ================= Example 1: Optimal Transport with Minimum Actions -------------------------------------------------- .. code-block:: python from xai_cola import COLA from xai_cola.ce_sparsifier.data import COLAData from xai_cola.ce_sparsifier.models import Model from xai_cola.ce_generator import DiCE # Setup data = COLAData(factual_data=df, label_column='Risk') ml_model = Model(model=trained_model, backend="sklearn") # Generate CFs explainer = DiCE(ml_model=ml_model) _, cf = explainer.generate_counterfactuals( data=data, factual_class=1, total_cfs=2 ) data.add_counterfactuals(cf, with_target_column=True) # Refine with OT sparsifier = COLA(data=data, ml_model=ml_model) sparsifier.set_policy(matcher="ot", attributor="pshap", random_state=42) # Find and use minimum actions min_actions = sparsifier.query_minimum_actions() refined = sparsifier.refine_counterfactuals(limited_actions=min_actions) print(f"Refined {len(refined)} counterfactuals") print(f"Using {min_actions} feature changes per instance") Example 2: Fast ECT Matching ----------------------------- .. code-block:: python # For quick results, use ECT sparsifier = COLA(data=data, ml_model=ml_model) sparsifier.set_policy(matcher="ect", attributor="pshap") # ECT is much faster than OT import time start = time.time() refined = sparsifier.refine_counterfactuals(limited_actions=5) print(f"Refinement time: {time.time() - start:.2f}s") Example 3: Comparing Matchers ------------------------------ .. code-block:: python import pandas as pd results = [] for matcher in ["ect", "ot", "nn", "softcem"]: sparsifier = COLA(data=data, ml_model=ml_model) sparsifier.set_policy(matcher=matcher, attributor="pshap") min_actions = sparsifier.query_minimum_actions() refined = sparsifier.refine_counterfactuals(limited_actions=min_actions) # Count changes factual_df, ce_df, ace_df = sparsifier.get_all_results( limited_actions=min_actions ) n_changes = (factual_df != ace_df).sum().sum() results.append({ 'Matcher': matcher, 'Min Actions': min_actions, 'Total Changes': n_changes }) results_df = pd.DataFrame(results) print(results_df) Example 4: With Feature Restrictions ------------------------------------- .. code-block:: python # Scenario: Only financial features can change financial_features = ['Income', 'LoanAmount', 'Duration'] sparsifier = COLA(data=data, ml_model=ml_model) sparsifier.set_policy(matcher="ot", attributor="pshap") refined = sparsifier.refine_counterfactuals( limited_actions=3, features_to_vary=financial_features ) # Verify only financial features changed factual_df, _, ace_df = sparsifier.get_all_results(limited_actions=3) for col in factual_df.columns: if col not in financial_features + ['Risk']: assert (factual_df[col] == ace_df[col]).all(), f"{col} changed!" print("✓ Only financial features were modified") Choosing the Right Policy ========================== Decision Guide -------------- .. code-block:: text ┌─────────────────────────────────────┐ │ Need best quality? │ │ ├─ Yes → Use OT │ │ └─ No → Continue │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Need fast results? │ │ ├─ Yes → Use ECT │ │ └─ No → Continue │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Have complex overlaps? │ │ ├─ Yes → Use SoftCEM │ │ └─ No → Use NN │ └─────────────────────────────────────┘ Recommendation Table -------------------- +------------------+----------------+----------------+-----------------+ | Scenario | Matcher | Speed | Quality | +==================+================+================+=================+ | Production use | **OT** | Medium | Best | +------------------+----------------+----------------+-----------------+ | Quick iteration | **ECT** | Fast | Good | +------------------+----------------+----------------+-----------------+ | Binary class | **ECT** | Fast | Good | +------------------+----------------+----------------+-----------------+ | Large dataset | **ECT/NN** | Fast | Acceptable | +------------------+----------------+----------------+-----------------+ | Research | **OT** | Medium | Best | +------------------+----------------+----------------+-----------------+ | Prototype | **NN** | Very Fast | Basic | +------------------+----------------+----------------+-----------------+ Common Issues ============= Issue 1: Matching Takes Too Long --------------------------------- **Problem:** OT matching is slow on large datasets. **Solution:** Use ECT or NN for faster results: .. code-block:: python # ❌ Slow on 1000+ instances sparsifier.set_policy(matcher="ot", attributor="pshap") # ✅ Much faster sparsifier.set_policy(matcher="ect", attributor="pshap") Issue 2: Unbalanced Classes ---------------------------- **Problem:** CEM fails with unbalanced class distribution. **Error:** .. code-block:: text ValueError: Cannot match - unbalanced class distribution **Solution:** Use OT which handles imbalance: .. code-block:: python # ✅ Works with any class distribution sparsifier.set_policy(matcher="ot", attributor="pshap") Issue 3: Inconsistent Results ------------------------------ **Problem:** Results vary between runs. **Solution:** Set random_state for reproducibility: .. code-block:: python # ✅ Reproducible results sparsifier.set_policy( matcher="ot", attributor="pshap", random_state=42 # Fixed seed ) Best Practices ============== ✅ **DO:** 1. **Start with ECT for exploration** .. code-block:: python # Quick first pass sparsifier.set_policy(matcher="ect", attributor="pshap") 2. **Use OT for final results** .. code-block:: python # Best quality for production sparsifier.set_policy(matcher="ot", attributor="pshap") 3. **Always set random_state for research** .. code-block:: python sparsifier.set_policy( matcher="ot", attributor="pshap", random_state=42 ) 4. **Query minimum actions before refining** .. code-block:: python min_actions = sparsifier.query_minimum_actions() refined = sparsifier.refine_counterfactuals(limited_actions=min_actions) ❌ **DON'T:** 1. **Don't use CEM as default when having few samples** - it's lowest quality 2. **Don't ignore computational cost** - OT can be slow on large datasets 3. **Don't forget to set the policy** - must call ``set_policy()`` before refinement .. code-block:: python # ❌ Error - no policy set sparsifier = COLA(data=data, ml_model=ml_model) refined = sparsifier.refine_counterfactuals(limited_actions=5) # ✅ Correct sparsifier.set_policy(matcher="ot", attributor="pshap") refined = sparsifier.refine_counterfactuals(limited_actions=5) API Reference ============= For complete parameter details, see: - :class:`~xai_cola.ce_sparsifier.COLA` - :class:`~xai_cola.ce_sparsifier.policies.matching.CounterfactualOptimalTransportPolicy` - :class:`~xai_cola.ce_sparsifier.policies.feature_attributor.PSHAP` Next Steps ========== - Learn about :doc:`visualization` - Visualizing refinement results - See :doc:`explainers` - Generating counterfactuals - Review :doc:`data_interface` - Managing data