Skip to content

Overview

FeatCopilot provides a unified framework for automated feature engineering, combining multiple approaches into a single, easy-to-use API.

Architecture

graph TD
    subgraph AE[AutoFeatureEngineer - Main Entry Point]
        TE[Tabular Engine] --> FG[Feature Generation]
        TSE[TimeSeries Engine] --> FG
        RE[Relational Engine] --> FG
        LE[LLM Engine] --> FG
        FG --> FS[Feature Selection]
        FS --> SF[Selected Features]
    end

Core Components

1. Engines

Engines are responsible for generating new features from input data:

Engine Purpose Key Features
TabularEngine Numeric feature transformation Polynomial, interactions, math transforms
TimeSeriesEngine Time series feature extraction Statistics, autocorrelation, FFT
RelationalEngine Multi-table aggregation Joins, aggregations, group-by
TextEngine Text feature extraction Length stats, TF-IDF, embeddings
SemanticEngine LLM-powered generation Semantic understanding, code gen

2. Feature Selection

After generation, features are ranked and selected:

  • Statistical Selection: Mutual information, F-test, chi-square
  • Model-based Selection: Random Forest importance, XGBoost
  • Redundancy Elimination: Correlation-based filtering

3. Feature Representation

Every feature includes metadata:

from featcopilot.core import Feature, FeatureType, FeatureOrigin

feature = Feature(
    name="age_income_ratio",
    dtype=FeatureType.NUMERIC,
    origin=FeatureOrigin.LLM_GENERATED,
    source_columns=["age", "income"],
    explanation="Ratio indicating financial maturity",
    code="result = df['age'] / (df['income'] + 1e-8)"
)

Workflow

Basic Workflow

from featcopilot import AutoFeatureEngineer

# 1. Initialize
engineer = AutoFeatureEngineer(
    engines=['tabular'],
    max_features=50
)

# 2. Fit (learns from data)
engineer.fit(X_train, y_train)

# 3. Transform (generates features)
X_train_fe = engineer.transform(X_train)
X_test_fe = engineer.transform(X_test)

LLM-Enhanced Workflow

# 1. Initialize with LLM
engineer = AutoFeatureEngineer(
    engines=['tabular', 'llm'],
    llm_config={'model': 'gpt-5.2', 'domain': 'finance'}
)

# 2. Fit with context
engineer.fit(
    X_train, y_train,
    column_descriptions={...},
    task_description="Predict loan default"
)

# 3. Get explanations
explanations = engineer.explain_features()

# 4. Generate custom features
custom = engineer.generate_custom_features(
    prompt="Create risk indicators"
)

Design Principles

1. Modularity

Each component can be used independently:

# Use just the tabular engine
from featcopilot.engines import TabularEngine

engine = TabularEngine(polynomial_degree=2)
X_fe = engine.fit_transform(X)

2. Sklearn Compatibility

Works with sklearn pipelines:

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('features', AutoFeatureEngineer(engines=['tabular'])),
    ('model', RandomForestClassifier())
])

3. Interpretability

Every feature has an explanation:

for name, explanation in engineer.explain_features().items():
    print(f"{name}: {explanation}")

4. Graceful Degradation

LLM features fall back to heuristics when unavailable:

# Works even without Copilot authentication
engineer = AutoFeatureEngineer(engines=['llm'])
# Warning: Using mock LLM responses

Configuration

Engine Configuration

AutoFeatureEngineer(
    engines=['tabular', 'timeseries', 'llm'],
    max_features=100,
    selection_methods=['mutual_info', 'importance'],
    correlation_threshold=0.95,
    verbose=True
)

LLM Configuration

llm_config = {
    'model': 'gpt-5.2',           # Model to use
    'max_suggestions': 20,       # Features to suggest
    'domain': 'healthcare',      # Domain context
    'validate_features': True    # Validate generated code
}

Next Steps