Benchmarks¶
FeatCopilot has been extensively benchmarked to demonstrate its effectiveness in automated feature engineering. This page presents comprehensive results across 63 datasets spanning classification, regression, forecasting, and text tasks.
Executive Summary¶
-
Simple Models Benchmark
+7.52% average improvement
+144% max improvement (triple_interaction)
-
FE Tools Comparison
#1 overall — 80% win rate
Beats autofeat and featuretools
-
AutoML Benchmark
+1.85% avg improvement (FLAML)
90% datasets improved
-
Feature Generation
7→24 to 54→57 feature expansion
GBM-based importance selection
Simple Models Benchmark¶
Testing FeatCopilot with RandomForest (n_estimators=200, max_depth=20) and LogisticRegression/Ridge across 63 datasets to measure feature engineering impact.
Summary Results¶
| Configuration | Datasets | Improved | Avg Improvement | Best Improvement |
|---|---|---|---|---|
| Tabular Engine | 63 | 31 (49%) | +7.52% | +144% (triple_interaction) |
Classification Results (26 Datasets)¶
| Dataset | Baseline | +FeatCopilot | Improvement | Features |
|---|---|---|---|---|
| xor_classification | 0.6960 | 0.8120 | +16.67% | 20→24 |
| complex_classification | 0.7125 | 0.8300 | +16.49% | 15→23 |
| polynomial_classification | 0.7875 | 0.8675 | +10.16% | 15→21 |
| interaction_classification | 0.7650 | 0.8075 | +5.56% | 12→17 |
| credit_risk | 0.8525 | 0.8675 | +1.76% | 10→17 |
| bioresponse | 0.7700 | 0.7802 | +1.32% | 419→419 |
| covertype_cat | 0.8747 | 0.8819 | +0.82% | 54→55 |
| magic_telescope | 0.8509 | 0.8572 | +0.75% | 10→12 |
| credit_card_fraud | 0.9840 | 0.9840 | +0.00% | 30→40 |
| employee_attrition | 0.9252 | 0.9252 | +0.00% | 11→16 |
Regression Results (30 Datasets)¶
| Dataset | Baseline R² | +FeatCopilot R² | Improvement | Features |
|---|---|---|---|---|
| triple_interaction_regression | 0.3542 | 0.8649 | +144% | 18→23 |
| xor_regression | 0.3330 | 0.6801 | +104% | 20→24 |
| pairwise_product_regression | 0.5132 | 0.8698 | +69.5% | 16→23 |
| nonlinear_regression | 0.6086 | 0.8756 | +43.9% | 12→18 |
| complex_regression | 0.6339 | 0.8725 | +37.6% | 15→20 |
| quadratic_heavy_regression | 0.7134 | 0.9341 | +30.9% | 18→25 |
| polynomial_regression | 0.7321 | 0.8692 | +18.7% | 12→19 |
| sqrt_log_regression | 0.8725 | 0.8997 | +3.12% | 15→25 |
| bike_sharing | 0.9534 | 0.9697 | +1.71% | 10→12 |
| house_prices | 0.9798 | 0.9953 | +1.58% | 14→16 |
| spotify_tracks | 0.9529 | 0.9648 | +1.25% | 13→17 |
| ecommerce_product | 0.9462 | 0.9564 | +1.08% | 10→11 |
Key Insight
FeatCopilot provides the largest improvements on datasets with complex feature interactions (XOR, polynomial, triple interactions) where simple models struggle with raw features. The GBM-based feature selection ensures only high-quality derived features are kept.
FE Tools Comparison¶
Comparing FeatCopilot against autofeat and featuretools across 10 datasets using FLAML (30s budget).
Overall Ranking¶
| Rank | Tool | Win Rate | Avg Improvement | Speed | Coverage | Composite |
|---|---|---|---|---|---|---|
| 🥇 | FeatCopilot | 80% (8/10) | +1.89% | 1.9s | 100% | 0.606 |
| 🥈 | featuretools | 0% (0/10) | -2.71% | 0.1s | 100% | 0.397 |
| 🥉 | autofeat | 40% (2/5) | +1.46% | 48.1s | 50% | 0.351 |
Key Advantages¶
- Highest win rate: FeatCopilot wins 8 out of 10 datasets
- Best coverage: Works on all datasets (autofeat times out on 50%)
- Fast: ~25x faster than autofeat, with intelligent feature selection
- No harm: Never significantly hurts performance (min improvement -0.02%)
Why autofeat has low coverage
autofeat uses L1 regularization which is computationally expensive for classification tasks. It timed out (>120s) on 5 of 10 datasets. FeatCopilot uses GBM-based selection which is much faster.
AutoML Benchmark¶
Testing FeatCopilot with FLAML and AutoGluon (120s time budget) across 10 datasets to evaluate feature engineering benefits with AutoML optimization.
Cross-Framework Summary¶
| Framework | Datasets | Improved | Avg Improvement | Max Improvement |
|---|---|---|---|---|
| FLAML | 10 | 9 (90%) | +1.85% | +6.67% |
| AutoGluon | 10 | 9 (90%) | +1.55% | +7.62% |
| Combined | 20 | 18 (90%) | +1.70% | — |
Per-Dataset Results¶
| Dataset | FLAML Baseline | FLAML +FE | Δ | AutoGluon Base | AutoGluon +FE | Δ |
|---|---|---|---|---|---|---|
| complex_classification | 0.7875 | 0.8400 | +6.67% 🔥 | 0.7550 | 0.8125 | +7.62% 🔥 |
| xor_classification | 0.8180 | 0.8640 | +5.62% 🔥 | 0.8280 | 0.8480 | +2.42% 🔥 |
| polynomial_regression | 0.9103 | 0.9375 | +2.99% 🔥 | 0.9404 | 0.9492 | +0.94% |
| titanic | 0.8156 | 0.8268 | +1.37% | 0.8212 | 0.8324 | +1.36% |
| complex_regression | 0.8704 | 0.8790 | +0.99% | 0.9166 | 0.9401 | +2.57% 🔥 |
| credit_risk | 0.8500 | 0.8550 | +0.59% | 0.8525 | 0.8650 | +1.47% |
| house_prices | 0.9963 | 0.9972 | +0.09% | 0.9969 | 0.9975 | +0.06% |
AutoML Observations
With AutoML (FLAML/AutoGluon), improvements are more modest because these frameworks already use powerful gradient boosting models. FeatCopilot still provides consistent value by generating meaningful derived features that help AutoML find better models faster.
When FeatCopilot Excels¶
Based on comprehensive benchmarking across 63 datasets, FeatCopilot provides the most value in these scenarios:
| Scenario | Expected Benefit | Evidence |
|---|---|---|
| Complex interactions | Very High (+30-144%) | triple_interaction: +144%, xor: +104% |
| Small feature sets | High (+5-20%) | polynomial_classification: +10%, complex_classification: +16.5% |
| AutoML enhancement | Medium (+1-8%) | complex_classification: +6.67% (FLAML), +7.62% (AutoGluon) |
| Already high-performing | Low (0-1%) | credit_card_fraud: +0.00% (baseline 0.984) |
| FE tool comparison | Winner (80% win rate) | Composite score 0.606 (#1 of 3 tools) |
Key Insight
FeatCopilot provides the largest improvements on datasets where:
- Feature interactions matter - XOR, polynomial, and multi-way interactions
- Feature set is small - More potential for derived features
- Baseline performance has room - Near-perfect baselines show minimal improvement
FeatCopilot's GBM-based feature selection ensures it rarely hurts performance, even when improvements are small.
Running Benchmarks¶
Quick Start¶
# Clone and install
git clone https://github.com/thinkall/featcopilot.git
cd featcopilot
pip install -e ".[benchmark]"
# Run simple models benchmark (63 datasets)
python -m benchmarks.simple_models.run_simple_models_benchmark --all
# Run with LLM engine
python -m benchmarks.simple_models.run_simple_models_benchmark --all --with-llm
# Run AutoML benchmark (FLAML + AutoGluon)
python -m benchmarks.automl.run_automl_benchmark --framework all
# Run FE tools comparison
python -m benchmarks.compare_tools.run_fe_tools_comparison
Available Benchmarks¶
| Benchmark | Command | Description |
|---|---|---|
| Simple Models | python -m benchmarks.simple_models.run_simple_models_benchmark --all |
RF/Ridge on 63 datasets |
| Simple Models + LLM | python -m benchmarks.simple_models.run_simple_models_benchmark --all --with-llm |
With LLM-generated features |
| AutoML (FLAML) | python -m benchmarks.automl.run_automl_benchmark --framework flaml |
FLAML on 10 datasets |
| AutoML (AutoGluon) | python -m benchmarks.automl.run_automl_benchmark --framework autogluon |
AutoGluon on 10 datasets |
| AutoML (All) | python -m benchmarks.automl.run_automl_benchmark --framework all |
Both frameworks |
| Tool Comparison | python -m benchmarks.compare_tools.run_fe_tools_comparison |
vs autofeat, featuretools |
Benchmark Options¶
# Specific datasets
python -m benchmarks.simple_models.run_simple_models_benchmark --datasets titanic,house_prices
# By category
python -m benchmarks.automl.run_automl_benchmark --category classification
# Use AutoGluon instead of FLAML
python -m benchmarks.automl.run_automl_benchmark --framework autogluon
# Custom time budget
python -m benchmarks.automl.run_automl_benchmark --time-budget 300
Benchmark Reports¶
benchmarks/
├── simple_models/
│ ├── SIMPLE_MODELS_BENCHMARK.md # Tabular only
│ └── SIMPLE_MODELS_BENCHMARK_LLM.md # Tabular + LLM
├── automl/
│ ├── AUTOML_BENCHMARK.md # Combined (FLAML + AutoGluon)
│ ├── AUTOML_FLAML_BENCHMARK.md # FLAML only
│ └── AUTOML_AUTOGLUON_BENCHMARK.md # AutoGluon only
└── compare_tools/
└── FE_TOOLS_COMPARISON.md # FeatCopilot vs competitors
Methodology¶
Evaluation Protocol¶
- Train/Test Split: 80/20 with
random_state=42for reproducibility - Metrics: Accuracy for classification, R² for regression
- Models:
- Simple: RandomForest (n_estimators=200, max_depth=20), LogisticRegression/Ridge
- AutoML: FLAML / AutoGluon with 120s time budget
- FE Comparison: FLAML with 30s time budget
- Feature Selection: GBM-based importance with median threshold + GBM refinement
FeatCopilot Configuration¶
from featcopilot import AutoFeatureEngineer
# Standard configuration
engineer = AutoFeatureEngineer(
engines=["tabular", "relational"],
max_features=100,
verbose=False,
)
X_transformed = engineer.fit_transform(X_train, y_train)
X_test_transformed = engineer.transform(X_test)
Dataset Coverage¶
63 datasets across 4 categories:
| Category | Count | Examples |
|---|---|---|
| Classification | 26 | titanic, xor_classification, complex_classification, higgs, covertype |
| Regression | 30 | house_prices, complex_regression, polynomial_regression, diamonds |
| Forecasting | 3 | sensor_anomaly, retail_demand, server_latency |
| Text | 4 | product_reviews, fake_news, news_classification, medical_notes |
Dataset sources include: - OpenML (INRIA benchmark suite) - Kaggle (popular ML datasets) - Synthetic (interaction-heavy datasets for FE evaluation)
Conclusion¶
FeatCopilot demonstrates consistent improvements across diverse datasets and frameworks:
- Simple Models: +7.52% average, up to +144% maximum improvement across 63 datasets
- AutoML: +1.70% average across FLAML and AutoGluon, 90% improvement rate
- vs Competitors: #1 ranked FE tool with 80% win rate, beating autofeat and featuretools
Best use cases: - Datasets with complex feature interactions (XOR, polynomial, multi-way) - Small feature sets that benefit from derived features - Any ML pipeline where consistent, safe feature engineering is needed
Key strengths: - GBM-based selection ensures features rarely hurt performance - Fast (~2s) compared to alternatives (autofeat: ~48s) - Works on all dataset types (100% coverage vs autofeat's 50%)