Feature Selection¶

After generating features, FeatCopilot automatically selects the most important ones to prevent overfitting and reduce dimensionality.

Overview¶

The selection pipeline:

Statistical Selection: Filter by statistical significance
Model-based Selection: Rank by ML model importance
Redundancy Elimination: Remove highly correlated features
Final Selection: Keep top N features

FeatureSelector¶

The unified selector combines multiple methods:

from featcopilot.selection import FeatureSelector

selector = FeatureSelector(
    methods=['mutual_info', 'importance'],
    max_features=50,
    correlation_threshold=0.95,
    verbose=True
)

X_selected = selector.fit_transform(X, y)

# Get results
print(f"Selected: {len(selector.get_selected_features())} features")
print(f"Top features: {selector.get_ranking()[:5]}")

Selection Methods¶

Statistical Selection¶

from featcopilot.selection import StatisticalSelector

# Mutual Information
selector = StatisticalSelector(
    method='mutual_info',
    max_features=30
)

# F-test (ANOVA)
selector = StatisticalSelector(
    method='f_test',
    max_features=30
)

# Chi-square (for categorical targets)
selector = StatisticalSelector(
    method='chi2',
    max_features=30
)

# Correlation with target
selector = StatisticalSelector(
    method='correlation',
    threshold=0.1  # Minimum correlation
)

Model-based Selection¶

from featcopilot.selection import ImportanceSelector

# Random Forest importance
selector = ImportanceSelector(
    model='random_forest',
    max_features=30,
    n_estimators=100
)

# Gradient Boosting
selector = ImportanceSelector(
    model='gradient_boosting',
    max_features=30
)

# XGBoost (if installed)
selector = ImportanceSelector(
    model='xgboost',
    max_features=30
)

Redundancy Elimination¶

from featcopilot.selection import RedundancyEliminator

eliminator = RedundancyEliminator(
    correlation_threshold=0.95,  # Remove if corr > 0.95
    method='pearson',            # pearson, spearman, kendall
    importance_scores=None       # Optional: keep more important feature
)

X_reduced = eliminator.fit_transform(X)

# See what was removed
print(f"Removed: {eliminator.get_removed_features()}")

Combined Selection¶

In AutoFeatureEngineer¶

from featcopilot import AutoFeatureEngineer

engineer = AutoFeatureEngineer(
    engines=['tabular'],
    max_features=50,
    selection_methods=['mutual_info', 'importance'],
    correlation_threshold=0.95
)

# Selection happens automatically in fit_transform
X_selected = engineer.fit_transform(X, y)

Manual Pipeline¶

from featcopilot.engines import TabularEngine
from featcopilot.selection import FeatureSelector

# Generate features
engine = TabularEngine(polynomial_degree=2)
X_features = engine.fit_transform(X)

# Select best features
selector = FeatureSelector(
    methods=['mutual_info', 'f_test', 'importance'],
    max_features=30,
    correlation_threshold=0.95
)
X_selected = selector.fit_transform(X_features, y)

Configuration¶

Method Comparison¶

Method	Best For	Speed	Handles Non-linear
`mutual_info`	General	Medium	✅ Yes
`f_test`	Linear relationships	Fast	❌ No
`chi2`	Categorical features	Fast	❌ No
`correlation`	Quick filtering	Fast	❌ No
`importance`	Complex patterns	Slow	✅ Yes

Combining Methods¶

# Union: feature selected by ANY method
selector = FeatureSelector(
    methods=['mutual_info', 'importance'],
    combination='union'
)

# Intersection: feature selected by ALL methods
selector = FeatureSelector(
    methods=['mutual_info', 'importance'],
    combination='intersection'
)

Accessing Results¶

Feature Scores¶

# Combined scores (normalized 0-1)
scores = selector.get_feature_scores()

# Per-method scores
method_scores = selector.get_method_scores()
print(method_scores['mutual_info'])
print(method_scores['importance'])

Feature Ranking¶

# Sorted list of (feature, score) tuples
ranking = selector.get_ranking()

for i, (feature, score) in enumerate(ranking[:10], 1):
    print(f"{i}. {feature}: {score:.4f}")

Selected Features¶

selected = selector.get_selected_features()
print(f"Selected {len(selected)} features:")
print(selected)

Best Practices¶

1. Use Multiple Methods¶

# More robust selection
selector = FeatureSelector(
    methods=['mutual_info', 'f_test', 'importance']
)

2. Set Appropriate Thresholds¶

# Conservative: fewer, stronger features
selector = FeatureSelector(
    max_features=20,
    correlation_threshold=0.90
)

# Liberal: more features
selector = FeatureSelector(
    max_features=100,
    correlation_threshold=0.99
)

3. Consider Task Type¶

# Classification
selector = StatisticalSelector(method='mutual_info')

# Regression
selector = StatisticalSelector(method='f_test')

4. Handle Imbalanced Data¶

# For imbalanced classification, importance-based selection
# often works better than statistical methods
selector = ImportanceSelector(
    model='random_forest',
    n_estimators=200
)

Correlation Matrix¶

Visualize feature correlations:

eliminator = RedundancyEliminator(correlation_threshold=0.95)
eliminator.fit(X)

# Get correlation matrix
corr_matrix = eliminator.get_correlation_matrix()

# Plot (requires matplotlib/seaborn)
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.savefig('correlation_matrix.png')