Benchmarks¶

FeatCopilot has been extensively benchmarked to demonstrate its effectiveness in automated feature engineering. This page presents comprehensive results across 42 datasets spanning classification and regression tasks.

Executive Summary¶

Simple Models Benchmark

+4.54% average improvement

+197% max improvement (delays_zurich)
LLM-Enhanced Results

+420% max improvement with LLM

55% datasets improved (23/42)
AutoML Benchmark

+8.55% best improvement (abalone)

46% datasets improved with FLAML
Feature Generation

7→30 to 54→100 feature expansion

Smart importance-based selection

Simple Models Benchmark¶

Testing FeatCopilot with RandomForest (n_estimators=200, max_depth=20) and LogisticRegression/Ridge across 42 datasets to measure feature engineering impact.

Summary Results¶

Configuration	Datasets	Improved	Avg Improvement	Best Improvement
Tabular Engine	42	20 (48%)	+4.54%	+197% (delays_zurich)
Tabular + LLM	42	23 (55%)	+6.12%	+420% (delays_zurich)

Classification Results (22 Datasets)¶

Dataset	Baseline	+FeatCopilot	Improvement	+LLM	LLM Imp	Features
complex_classification	0.8800	0.9300	+5.68%	0.9300	+5.68%	15→100
road_safety	0.7815	0.7895	+1.02%	0.8040	+2.88%	32→89
customer_churn	0.7575	0.7650	+0.99%	0.7650	+0.99%	10→82
albert	0.6558	0.6591	+0.50%	0.6529	-0.44%	31→100
bioresponse	0.7700	0.7729	+0.38%	0.7773	+0.95%	419→419
higgs	0.7129	0.7081	-0.67%	0.7154	+0.35%	24→100
magic_telescope	0.8509	0.8528	+0.22%	0.8528	+0.22%	10→65
electricity	0.8984	0.8986	+0.03%	0.9006	+0.25%	8→50
covertype_cat	0.8747	0.8749	+0.02%	0.8839	+1.05%	54→100
titanic	0.9218	0.9162	-0.61%	0.9162	-0.61%	7→27
credit_card_fraud	0.9840	0.9840	+0.00%	0.9840	+0.00%	30→100
employee_attrition	0.9558	0.9558	+0.00%	0.9558	+0.00%	11→74

Regression Results (20 Datasets)¶

Dataset	Baseline R²	+FeatCopilot R²	Improvement	+LLM R²	LLM Imp	Features
delays_zurich	0.0051	0.0153	+197%	0.0268	+420%	11→57
abalone	0.5287	0.5762	+8.98%	0.5769	+9.12%	7→30
nyc_taxi	0.6391	0.6253	-2.17%	0.6775	+6.01%	16→44
bike_sharing	0.8080	0.8082	+0.02%	0.8367	+3.55%	10→48
wine_quality	0.4972	0.5027	+1.12%	0.5067	+1.91%	11→45
bike_sharing_inria	0.6788	0.6861	+1.07%	0.6901	+1.67%	6→37
miami_housing	0.9146	0.9201	+0.61%	0.9214	+0.74%	13→70
brazilian_houses	0.9960	0.9964	+0.04%	1.0000	+0.40%	11→66
diamonds	0.9456	0.9461	+0.05%	0.9462	+0.06%	6→19
house_prices	0.9306	0.9308	+0.02%	0.9305	-0.02%	14→46
cpu_act	0.9798	0.9800	+0.02%	0.9803	+0.05%	21→100
superconduct	0.9300	0.9301	+0.01%	0.9299	-0.01%	79→100

Key Insight

The LLM engine provides additional value on top of tabular features, particularly for datasets where domain knowledge helps (delays_zurich: +420%, nyc_taxi: +6.01%, bike_sharing: +3.55%).

AutoML Benchmark¶

Testing FeatCopilot with FLAML (120s time budget per model) across 41 datasets to evaluate feature engineering benefits with AutoML optimization.

Summary¶

Metric	Value
Total Datasets	41
Classification	21
Regression	20
Improved	19 (46%)
Best Improvement	+8.55% (abalone)

Top Improvements¶

Dataset	Task	Baseline	+FeatCopilot	Improvement	Features
abalone	regression	0.5384	0.5844	+8.55%	7→30
credit_risk	classification	0.6925	0.7100	+2.53%	10→86
delays_zurich	regression	0.0810	0.0828	+2.24%	11→57
mercedes_benz	regression	0.5763	0.5866	+1.79%	359→359
complex_classification	classification	0.9100	0.9225	+1.37%	15→100
eye_movements	classification	0.6649	0.6721	+1.09%	23→97
bioresponse	classification	0.7802	0.7875	+0.93%	419→419
medical_diagnosis	classification	0.8400	0.8467	+0.79%	12→72

AutoML Observations

With AutoML (FLAML), improvements are more modest because the framework already performs internal feature selection and hyperparameter tuning. FeatCopilot still provides value by generating meaningful derived features that AutoML can leverage.

When FeatCopilot Excels¶

Based on comprehensive benchmarking across 42 datasets, FeatCopilot provides the most value in these scenarios:

Scenario	Expected Benefit	Evidence
Low baseline performance	Very High (+50-400%)	delays_zurich: +197% (tabular), +420% (LLM)
Small feature sets	High (+5-10%)	abalone: +8.98% (7→30 features)
Domain-specific tasks	High (+3-6%)	bike_sharing: +3.55%, nyc_taxi: +6.01%
Complex classification	Medium (+1-5%)	complex_classification: +5.68%
Already high-performing	Low (0-1%)	credit_card_fraud: +0.00% (baseline 0.984)

Key Insight

FeatCopilot provides the largest improvements on datasets where:

Baseline performance is poor - More room for improvement
Feature set is small - More potential for derived features
Domain knowledge helps - LLM can suggest meaningful features

Datasets already near perfect performance (>0.98) show minimal improvement, as expected.

Running Benchmarks¶

Quick Start¶

# Clone and install
git clone https://github.com/thinkall/featcopilot.git
cd featcopilot
pip install -e ".[benchmark]"

# Run simple models benchmark (42 datasets)
python -m benchmarks.simple_models.run_simple_models_benchmark --all

# Run with LLM engine
python -m benchmarks.simple_models.run_simple_models_benchmark --all --with-llm

# Run AutoML benchmark (42 datasets)
python -m benchmarks.automl.run_automl_benchmark --all

# Run FE tools comparison
python -m benchmarks.compare_tools.run_fe_tools_comparison --all

Available Benchmarks¶

Benchmark	Command	Description
Simple Models	`python -m benchmarks.simple_models.run_simple_models_benchmark --all`	RF/Ridge on 42 datasets
Simple Models + LLM	`python -m benchmarks.simple_models.run_simple_models_benchmark --all --with-llm`	With LLM-generated features
AutoML	`python -m benchmarks.automl.run_automl_benchmark --all`	FLAML on 42 datasets
AutoML + LLM	`python -m benchmarks.automl.run_automl_benchmark --all --with-llm`	FLAML with LLM features
Tool Comparison	`python -m benchmarks.compare_tools.run_fe_tools_comparison --all`	vs Featuretools, OpenFE, etc.

Benchmark Options¶

# Specific datasets
python -m benchmarks.simple_models.run_simple_models_benchmark --datasets titanic,house_prices

# By category
python -m benchmarks.automl.run_automl_benchmark --category classification

# Use AutoGluon instead of FLAML
python -m benchmarks.automl.run_automl_benchmark --framework autogluon

# Custom time budget
python -m benchmarks.automl.run_automl_benchmark --time-budget 300

Benchmark Reports¶

Reports are saved with date suffix and LLM indicator:

benchmarks/
├── simple_models/
│   ├── SIMPLE_MODELS_BENCHMARK.md      # Tabular only
│   └── SIMPLE_MODELS_BENCHMARK_LLM.md  # Tabular + LLM
├── automl/
│   ├── AUTOML_BENCHMARK.md             # Tabular only
│   └── AUTOML_BENCHMARK_LLM.md         # Tabular + LLM
└── compare_tools/
    └── FE_TOOLS_COMPARISON.md

Methodology¶

Evaluation Protocol¶

Train/Test Split: 80/20 with random_state=42 for reproducibility
Metrics: F1 (weighted) for classification, R² for regression
Models:
Simple: RandomForest (n_estimators=200, max_depth=20), LogisticRegression/Ridge
AutoML: FLAML with 120s time budget
Feature Selection: Importance-based with 1% threshold filter

FeatCopilot Configuration¶

from featcopilot import FeatureEngineer
from featcopilot.selection import FeatureSelector

# Standard configuration
engineer = FeatureEngineer(
    engines=["tabular"],
    max_features=100,
    verbose=1,
)

# With LLM engine
engineer = FeatureEngineer(
    engines=["tabular", "llm"],
    llm_config={"model": "gpt-4o-mini"},
    max_features=100,
)

# Feature selection
selector = FeatureSelector(
    method="importance",
    threshold=0.01,  # Keep features with >1% relative importance
)

Dataset Coverage¶

42 datasets across classification and regression:

Category	Count	Examples
Classification	22	titanic, higgs, covertype, credit, albert, eye_movements
Regression	20	house_prices, diamonds, abalone, cpu_act, superconduct

Dataset sources include: - OpenML (INRIA benchmark suite) - Kaggle (popular ML datasets) - Custom synthetic (complex_classification, complex_regression)

Conclusion¶

FeatCopilot demonstrates consistent improvements across diverse datasets:

Simple Models: +4.54% average, up to +197% maximum improvement
With LLM: Additional +1.5% on average, with some datasets seeing +200% more improvement
AutoML: +8.55% best improvement, 46% datasets improved

Best use cases: - Datasets with poor baseline performance - Small feature sets that benefit from derived features - Tasks where domain knowledge helps (LLM engine)

Current limitations: - Near-perfect baselines show minimal improvement - Very high-dimensional datasets (400+ features) see less benefit