Benchmarks¶
FeatCopilot has been extensively benchmarked to demonstrate its effectiveness in automated feature engineering. This page presents comprehensive results across 42 datasets spanning classification and regression tasks.
Executive Summary¶
-
Simple Models Benchmark
+4.54% average improvement
+197% max improvement (delays_zurich)
-
LLM-Enhanced Results
+420% max improvement with LLM
55% datasets improved (23/42)
-
AutoML Benchmark
+8.55% best improvement (abalone)
46% datasets improved with FLAML
-
Feature Generation
7→30 to 54→100 feature expansion
Smart importance-based selection
Simple Models Benchmark¶
Testing FeatCopilot with RandomForest (n_estimators=200, max_depth=20) and LogisticRegression/Ridge across 42 datasets to measure feature engineering impact.
Summary Results¶
| Configuration | Datasets | Improved | Avg Improvement | Best Improvement |
|---|---|---|---|---|
| Tabular Engine | 42 | 20 (48%) | +4.54% | +197% (delays_zurich) |
| Tabular + LLM | 42 | 23 (55%) | +6.12% | +420% (delays_zurich) |
Classification Results (22 Datasets)¶
| Dataset | Baseline | +FeatCopilot | Improvement | +LLM | LLM Imp | Features |
|---|---|---|---|---|---|---|
| complex_classification | 0.8800 | 0.9300 | +5.68% | 0.9300 | +5.68% | 15→100 |
| road_safety | 0.7815 | 0.7895 | +1.02% | 0.8040 | +2.88% | 32→89 |
| customer_churn | 0.7575 | 0.7650 | +0.99% | 0.7650 | +0.99% | 10→82 |
| albert | 0.6558 | 0.6591 | +0.50% | 0.6529 | -0.44% | 31→100 |
| bioresponse | 0.7700 | 0.7729 | +0.38% | 0.7773 | +0.95% | 419→419 |
| higgs | 0.7129 | 0.7081 | -0.67% | 0.7154 | +0.35% | 24→100 |
| magic_telescope | 0.8509 | 0.8528 | +0.22% | 0.8528 | +0.22% | 10→65 |
| electricity | 0.8984 | 0.8986 | +0.03% | 0.9006 | +0.25% | 8→50 |
| covertype_cat | 0.8747 | 0.8749 | +0.02% | 0.8839 | +1.05% | 54→100 |
| titanic | 0.9218 | 0.9162 | -0.61% | 0.9162 | -0.61% | 7→27 |
| credit_card_fraud | 0.9840 | 0.9840 | +0.00% | 0.9840 | +0.00% | 30→100 |
| employee_attrition | 0.9558 | 0.9558 | +0.00% | 0.9558 | +0.00% | 11→74 |
Regression Results (20 Datasets)¶
| Dataset | Baseline R² | +FeatCopilot R² | Improvement | +LLM R² | LLM Imp | Features |
|---|---|---|---|---|---|---|
| delays_zurich | 0.0051 | 0.0153 | +197% | 0.0268 | +420% | 11→57 |
| abalone | 0.5287 | 0.5762 | +8.98% | 0.5769 | +9.12% | 7→30 |
| nyc_taxi | 0.6391 | 0.6253 | -2.17% | 0.6775 | +6.01% | 16→44 |
| bike_sharing | 0.8080 | 0.8082 | +0.02% | 0.8367 | +3.55% | 10→48 |
| wine_quality | 0.4972 | 0.5027 | +1.12% | 0.5067 | +1.91% | 11→45 |
| bike_sharing_inria | 0.6788 | 0.6861 | +1.07% | 0.6901 | +1.67% | 6→37 |
| miami_housing | 0.9146 | 0.9201 | +0.61% | 0.9214 | +0.74% | 13→70 |
| brazilian_houses | 0.9960 | 0.9964 | +0.04% | 1.0000 | +0.40% | 11→66 |
| diamonds | 0.9456 | 0.9461 | +0.05% | 0.9462 | +0.06% | 6→19 |
| house_prices | 0.9306 | 0.9308 | +0.02% | 0.9305 | -0.02% | 14→46 |
| cpu_act | 0.9798 | 0.9800 | +0.02% | 0.9803 | +0.05% | 21→100 |
| superconduct | 0.9300 | 0.9301 | +0.01% | 0.9299 | -0.01% | 79→100 |
Key Insight
The LLM engine provides additional value on top of tabular features, particularly for datasets where domain knowledge helps (delays_zurich: +420%, nyc_taxi: +6.01%, bike_sharing: +3.55%).
AutoML Benchmark¶
Testing FeatCopilot with FLAML (120s time budget per model) across 41 datasets to evaluate feature engineering benefits with AutoML optimization.
Summary¶
| Metric | Value |
|---|---|
| Total Datasets | 41 |
| Classification | 21 |
| Regression | 20 |
| Improved | 19 (46%) |
| Best Improvement | +8.55% (abalone) |
Top Improvements¶
| Dataset | Task | Baseline | +FeatCopilot | Improvement | Features |
|---|---|---|---|---|---|
| abalone | regression | 0.5384 | 0.5844 | +8.55% | 7→30 |
| credit_risk | classification | 0.6925 | 0.7100 | +2.53% | 10→86 |
| delays_zurich | regression | 0.0810 | 0.0828 | +2.24% | 11→57 |
| mercedes_benz | regression | 0.5763 | 0.5866 | +1.79% | 359→359 |
| complex_classification | classification | 0.9100 | 0.9225 | +1.37% | 15→100 |
| eye_movements | classification | 0.6649 | 0.6721 | +1.09% | 23→97 |
| bioresponse | classification | 0.7802 | 0.7875 | +0.93% | 419→419 |
| medical_diagnosis | classification | 0.8400 | 0.8467 | +0.79% | 12→72 |
AutoML Observations
With AutoML (FLAML), improvements are more modest because the framework already performs internal feature selection and hyperparameter tuning. FeatCopilot still provides value by generating meaningful derived features that AutoML can leverage.
When FeatCopilot Excels¶
Based on comprehensive benchmarking across 42 datasets, FeatCopilot provides the most value in these scenarios:
| Scenario | Expected Benefit | Evidence |
|---|---|---|
| Low baseline performance | Very High (+50-400%) | delays_zurich: +197% (tabular), +420% (LLM) |
| Small feature sets | High (+5-10%) | abalone: +8.98% (7→30 features) |
| Domain-specific tasks | High (+3-6%) | bike_sharing: +3.55%, nyc_taxi: +6.01% |
| Complex classification | Medium (+1-5%) | complex_classification: +5.68% |
| Already high-performing | Low (0-1%) | credit_card_fraud: +0.00% (baseline 0.984) |
Key Insight
FeatCopilot provides the largest improvements on datasets where:
- Baseline performance is poor - More room for improvement
- Feature set is small - More potential for derived features
- Domain knowledge helps - LLM can suggest meaningful features
Datasets already near perfect performance (>0.98) show minimal improvement, as expected.
Running Benchmarks¶
Quick Start¶
# Clone and install
git clone https://github.com/thinkall/featcopilot.git
cd featcopilot
pip install -e ".[benchmark]"
# Run simple models benchmark (42 datasets)
python -m benchmarks.simple_models.run_simple_models_benchmark --all
# Run with LLM engine
python -m benchmarks.simple_models.run_simple_models_benchmark --all --with-llm
# Run AutoML benchmark (42 datasets)
python -m benchmarks.automl.run_automl_benchmark --all
# Run FE tools comparison
python -m benchmarks.compare_tools.run_fe_tools_comparison --all
Available Benchmarks¶
| Benchmark | Command | Description |
|---|---|---|
| Simple Models | python -m benchmarks.simple_models.run_simple_models_benchmark --all |
RF/Ridge on 42 datasets |
| Simple Models + LLM | python -m benchmarks.simple_models.run_simple_models_benchmark --all --with-llm |
With LLM-generated features |
| AutoML | python -m benchmarks.automl.run_automl_benchmark --all |
FLAML on 42 datasets |
| AutoML + LLM | python -m benchmarks.automl.run_automl_benchmark --all --with-llm |
FLAML with LLM features |
| Tool Comparison | python -m benchmarks.compare_tools.run_fe_tools_comparison --all |
vs Featuretools, OpenFE, etc. |
Benchmark Options¶
# Specific datasets
python -m benchmarks.simple_models.run_simple_models_benchmark --datasets titanic,house_prices
# By category
python -m benchmarks.automl.run_automl_benchmark --category classification
# Use AutoGluon instead of FLAML
python -m benchmarks.automl.run_automl_benchmark --framework autogluon
# Custom time budget
python -m benchmarks.automl.run_automl_benchmark --time-budget 300
Benchmark Reports¶
Reports are saved with date suffix and LLM indicator:
benchmarks/
├── simple_models/
│ ├── SIMPLE_MODELS_BENCHMARK.md # Tabular only
│ └── SIMPLE_MODELS_BENCHMARK_LLM.md # Tabular + LLM
├── automl/
│ ├── AUTOML_BENCHMARK.md # Tabular only
│ └── AUTOML_BENCHMARK_LLM.md # Tabular + LLM
└── compare_tools/
└── FE_TOOLS_COMPARISON.md
Methodology¶
Evaluation Protocol¶
- Train/Test Split: 80/20 with
random_state=42for reproducibility - Metrics: F1 (weighted) for classification, R² for regression
- Models:
- Simple: RandomForest (n_estimators=200, max_depth=20), LogisticRegression/Ridge
- AutoML: FLAML with 120s time budget
- Feature Selection: Importance-based with 1% threshold filter
FeatCopilot Configuration¶
from featcopilot import FeatureEngineer
from featcopilot.selection import FeatureSelector
# Standard configuration
engineer = FeatureEngineer(
engines=["tabular"],
max_features=100,
verbose=1,
)
# With LLM engine
engineer = FeatureEngineer(
engines=["tabular", "llm"],
llm_config={"model": "gpt-4o-mini"},
max_features=100,
)
# Feature selection
selector = FeatureSelector(
method="importance",
threshold=0.01, # Keep features with >1% relative importance
)
Dataset Coverage¶
42 datasets across classification and regression:
| Category | Count | Examples |
|---|---|---|
| Classification | 22 | titanic, higgs, covertype, credit, albert, eye_movements |
| Regression | 20 | house_prices, diamonds, abalone, cpu_act, superconduct |
Dataset sources include: - OpenML (INRIA benchmark suite) - Kaggle (popular ML datasets) - Custom synthetic (complex_classification, complex_regression)
Conclusion¶
FeatCopilot demonstrates consistent improvements across diverse datasets:
- Simple Models: +4.54% average, up to +197% maximum improvement
- With LLM: Additional +1.5% on average, with some datasets seeing +200% more improvement
- AutoML: +8.55% best improvement, 46% datasets improved
Best use cases: - Datasets with poor baseline performance - Small feature sets that benefit from derived features - Tasks where domain knowledge helps (LLM engine)
Current limitations: - Near-perfect baselines show minimal improvement - Very high-dimensional datasets (400+ features) see less benefit