LLM-Powered Example¶

Demonstrates FeatCopilot's unique LLM capabilities using LiteLLM (supports OpenAI, Azure, Anthropic, GitHub Models, GitHub Copilot, and more).

Prerequisites¶

LLM provider API key (e.g., OpenAI, Azure, Anthropic, GitHub)
featcopilot[litellm] installed
API key configured via environment variable

LLM Backend Options¶

FeatCopilot supports multiple LLM backends:

# Option 1: GitHub Copilot SDK (default)
llm_config = {'model': 'gpt-5.2'}

# Option 2: LiteLLM with OpenAI
llm_config = {'model': 'gpt-4o', 'backend': 'litellm'}

# Option 3: LiteLLM with Anthropic
llm_config = {'model': 'claude-3-opus', 'backend': 'litellm'}

# Option 4: LiteLLM with GitHub Marketplace Models
# Uses GITHUB_API_KEY environment variable
llm_config = {'model': 'github/gpt-4o', 'backend': 'litellm'}

# Option 5: LiteLLM with GitHub Copilot Chat API
# Uses OAuth device flow authentication (requires Copilot subscription)
llm_config = {'model': 'github_copilot/gpt-4', 'backend': 'litellm'}

# Option 6: LiteLLM with local Ollama
llm_config = {
    'model': 'ollama/llama2',
    'backend': 'litellm',
    'api_base': 'http://localhost:11434'
}

The Problem¶

Build a diabetes risk prediction model with:

Semantic feature understanding
Domain-aware feature generation
Human-readable explanations
Reusable transform rules

Setup¶

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

from featcopilot import AutoFeatureEngineer
from featcopilot import TransformRule, TransformRuleStore, TransformRuleGenerator

Create Healthcare Data¶

def create_healthcare_data(n_samples=500):
    """Create synthetic healthcare dataset."""
    np.random.seed(42)

    data = pd.DataFrame({
        'age': np.random.randint(20, 90, n_samples),
        'bmi': np.random.normal(26, 5, n_samples),
        'blood_pressure_systolic': np.random.normal(120, 20, n_samples),
        'blood_pressure_diastolic': np.random.normal(80, 12, n_samples),
        'cholesterol_total': np.random.normal(200, 40, n_samples),
        'cholesterol_hdl': np.random.normal(55, 15, n_samples),
        'cholesterol_ldl': np.random.normal(120, 35, n_samples),
        'glucose_fasting': np.random.normal(100, 25, n_samples),
        'hba1c': np.random.normal(5.5, 1.2, n_samples),
        'smoking_years': np.random.exponential(5, n_samples),
        'exercise_hours_weekly': np.random.exponential(3, n_samples),
    })

    # Create diabetes risk target
    risk = (
        0.01 * (data['age'] - 40)
        + 0.02 * (data['bmi'] - 25)
        + 0.01 * data['glucose_fasting']
        + 0.1 * data['hba1c']
        + 0.01 * data['smoking_years']
        - 0.02 * data['exercise_hours_weekly']
    )
    risk = 1 / (1 + np.exp(-risk))
    data['diabetes_risk'] = (np.random.random(n_samples) < risk).astype(int)

    return data

data = create_healthcare_data(500)
X = data.drop('diabetes_risk', axis=1)
y = data['diabetes_risk']

Define Column Descriptions¶

This is key for LLM understanding:

column_descriptions = {
    'age': 'Patient age in years',
    'bmi': 'Body Mass Index (weight in kg / height in m squared)',
    'blood_pressure_systolic': 'Systolic blood pressure in mmHg',
    'blood_pressure_diastolic': 'Diastolic blood pressure in mmHg',
    'cholesterol_total': 'Total cholesterol level in mg/dL',
    'cholesterol_hdl': 'HDL (good) cholesterol in mg/dL',
    'cholesterol_ldl': 'LDL (bad) cholesterol in mg/dL',
    'glucose_fasting': 'Fasting blood glucose in mg/dL',
    'hba1c': 'Hemoglobin A1c percentage (3-month glucose average)',
    'smoking_years': 'Number of years patient has smoked',
    'exercise_hours_weekly': 'Average hours of exercise per week',
}

Initialize with LLM¶

Using GitHub Copilot SDK (Default)¶

engineer = AutoFeatureEngineer(
    engines=['tabular', 'llm'],
    max_features=40,
    llm_config={
        'model': 'gpt-5.2',
        'max_suggestions': 15,
        'domain': 'healthcare',
        'validate_features': True
    },
    verbose=True
)

Using LiteLLM with GitHub Copilot¶

There are two GitHub providers in LiteLLM:

GitHub Marketplace Models (`github/` prefix)¶

Access models from GitHub Marketplace using the github/ prefix. Requires GITHUB_API_KEY environment variable.

import os
os.environ['GITHUB_API_KEY'] = 'your-github-token'

engineer = AutoFeatureEngineer(
    engines=['tabular', 'llm'],
    max_features=40,
    llm_config={
        'model': 'github/gpt-4o',      # GitHub Marketplace GPT-4o
        'backend': 'litellm',
        'max_suggestions': 15,
        'domain': 'healthcare',
        'validate_features': True
    },
    verbose=True
)

Available GitHub Marketplace models: - github/gpt-4o - GPT-4o - github/gpt-4o-mini - Lighter, faster GPT-4o - github/Llama-3.2-11B-Vision-Instruct - Llama 3.2 Vision - github/Llama-3.1-70b-Versatile - Llama 3.1 70B - github/Phi-4 - Microsoft Phi-4 - github/Mixtral-8x7b-32768 - Mixtral 8x7B

GitHub Copilot Chat API (`github_copilot/` prefix)¶

Access GitHub Copilot's Chat API using the github_copilot/ prefix. Uses OAuth device flow authentication (requires paid Copilot subscription).

engineer = AutoFeatureEngineer(
    engines=['tabular', 'llm'],
    max_features=40,
    llm_config={
        'model': 'github_copilot/gpt-4',  # GitHub Copilot Chat API
        'backend': 'litellm',
        'max_suggestions': 15,
        'domain': 'healthcare',
        'validate_features': True
    },
    verbose=True
)

On first use, you'll be prompted to authenticate: 1. LiteLLM displays a device code and verification URL 2. Visit the URL and enter the code 3. Credentials are stored locally for future use

Available GitHub Copilot models: - github_copilot/gpt-4 - GPT-4 - github_copilot/gpt-5.1-codex - GPT-5.1 Codex

Using LiteLLM with OpenAI¶

import os
os.environ['OPENAI_API_KEY'] = 'your-openai-key'

engineer = AutoFeatureEngineer(
    engines=['tabular', 'llm'],
    max_features=40,
    llm_config={
        'model': 'gpt-4o',
        'backend': 'litellm',
        'max_suggestions': 15,
        'domain': 'healthcare',
    },
    verbose=True
)

Using LiteLLM with Anthropic Claude¶

import os
os.environ['ANTHROPIC_API_KEY'] = 'your-anthropic-key'

engineer = AutoFeatureEngineer(
    engines=['tabular', 'llm'],
    max_features=40,
    llm_config={
        'model': 'claude-3-opus',
        'backend': 'litellm',
        'max_suggestions': 15,
        'domain': 'healthcare',
    },
    verbose=True
)

Using LiteLLM with Local Ollama¶

engineer = AutoFeatureEngineer(
    engines=['tabular', 'llm'],
    max_features=40,
    llm_config={
        'model': 'ollama/llama2',
        'backend': 'litellm',
        'api_base': 'http://localhost:11434',
        'max_suggestions': 15,
        'domain': 'healthcare',
    },
    verbose=True
)

Feature Engineering¶

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train_fe = engineer.fit_transform(
    X_train, y_train,
    column_descriptions=column_descriptions,
    task_description="Predict Type 2 diabetes risk based on patient health metrics"
)

X_test_fe = engineer.transform(X_test)

# Align and clean
common_cols = [c for c in X_train_fe.columns if c in X_test_fe.columns]
X_train_fe = X_train_fe[common_cols].fillna(0)
X_test_fe = X_test_fe[common_cols].fillna(0)

print(f"Features generated: {len(X_train_fe.columns)}")

Get Feature Explanations¶

explanations = engineer.explain_features()

print("Feature Explanations:")
print("=" * 50)

for name, explanation in list(explanations.items())[:5]:
    print(f"\n📊 {name}")
    print(f"   {explanation}")

Example Output:

Feature Explanations:
==================================================

📊 age_bmi_ratio
   Ratio of age to BMI, may indicate metabolic age vs chronological age

📊 glucose_hba1c_product
   Interaction between fasting glucose and HbA1c captures glucose control

📊 cholesterol_ratio
   Ratio of total to HDL cholesterol, key cardiovascular risk indicator

📊 blood_pressure_mean
   Mean arterial pressure approximation from systolic and diastolic

📊 lifestyle_score
   Combined score from exercise and inverse of smoking years

Get Generated Code¶

feature_code = engineer.get_feature_code()

print("Generated Feature Code:")
print("=" * 50)

for name, code in list(feature_code.items())[:3]:
    print(f"\n# {name}")
    print(code)

Example Output:

# age_bmi_ratio
result = df['age'] / (df['bmi'] + 1e-8)

# glucose_hba1c_product
result = df['glucose_fasting'] * df['hba1c']

# cholesterol_ratio
result = df['cholesterol_total'] / (df['cholesterol_hdl'] + 1e-8)

Train Model¶

model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X_train_fe, y_train)

pred = model.predict_proba(X_test_fe)[:, 1]
auc = roc_auc_score(y_test, pred)

print(f"\nROC-AUC: {auc:.4f}")

Generate Custom Features¶

Request specific features:

# Generate risk stratification features
custom_features = engineer.generate_custom_features(
    prompt="Create cardiac risk stratification features based on blood pressure and cholesterol",
    n_features=3
)

for feat in custom_features:
    print(f"\nFeature: {feat['name']}")
    print(f"Code: {feat['code']}")
    print(f"Explanation: {feat['explanation']}")

Generate Feature Report¶

from featcopilot.llm import FeatureExplainer

explainer = FeatureExplainer(model='gpt-5.2')

report = explainer.generate_feature_report(
    features=engineer._engine_instances['llm'].get_feature_set(),
    X=X_train_fe,
    column_descriptions=column_descriptions,
    task_description="Predict diabetes risk"
)

# Save report
with open('diabetes_feature_report.md', 'w') as f:
    f.write(report)

print("Report saved to diabetes_feature_report.md")

Transform Rules¶

Create reusable feature transformations from natural language that can be saved and applied across different datasets.

from featcopilot import TransformRule, TransformRuleStore, TransformRuleGenerator

# Initialize store and generator
store = TransformRuleStore()  # Default: ~/.featcopilot/rules.json
generator = TransformRuleGenerator(store=store)

# Generate rule from natural language
rule = generator.generate_from_description(
    description="Calculate the ratio of glucose to HbA1c",
    columns={"glucose_fasting": "float", "hba1c": "float"},
    tags=["healthcare", "diabetes"],
    save=True
)

# Apply to data
df = pd.DataFrame({'glucose_fasting': [95, 110], 'hba1c': [5.4, 6.2]})
result = rule.apply(df)

Reuse on Different Datasets¶

Rules automatically match columns with similar names:

# New dataset with different column names
new_data = pd.DataFrame({
    'patient_glucose': [100, 120],
    'patient_hba1c': [5.8, 6.5]
})

# Find and apply matching rules
matches = store.find_matching_rules(columns=new_data.columns.tolist())
if matches:
    rule, mapping = matches[0]
    result = rule.apply(new_data, column_mapping=mapping)

Manual Rules and Management¶

# Create manual rule
manual_rule = TransformRule(
    name="bmi_calc",
    description="Calculate BMI",
    code="result = df['weight'] / (df['height'] ** 2 + 1e-8)",
    input_columns=["weight", "height"],
    column_patterns=[".*weight.*", ".*height.*"],
)
store.save_rule(manual_rule)

# Search, list, export
store.list_rules(tags=["healthcare"])
store.search_by_description("diabetes")
store.export_rules("rules.json")

Complete Script¶

"""
FeatCopilot LLM-Powered Example
"""
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

from featcopilot import AutoFeatureEngineer
from featcopilot import TransformRule, TransformRuleStore, TransformRuleGenerator

# Create sample healthcare data
np.random.seed(42)
n = 500
X = pd.DataFrame({
    'age': np.random.randint(20, 90, n),
    'bmi': np.random.normal(26, 5, n),
    'glucose': np.random.normal(100, 25, n),
    'hba1c': np.random.normal(5.5, 1.2, n),
})
y = ((X['glucose'] > 100) & (X['hba1c'] > 5.7)).astype(int)

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# LLM-powered feature engineering
engineer = AutoFeatureEngineer(
    engines=['tabular', 'llm'],
    max_features=30,
    llm_config={'model': 'gpt-5.2', 'domain': 'healthcare'}
)

X_train_fe = engineer.fit_transform(
    X_train, y_train,
    column_descriptions={
        'age': 'Patient age',
        'bmi': 'Body Mass Index',
        'glucose': 'Fasting glucose mg/dL',
        'hba1c': 'HbA1c percentage'
    },
    task_description="Predict diabetes risk"
).fillna(0)

X_test_fe = engineer.transform(X_test).fillna(0)
cols = [c for c in X_train_fe.columns if c in X_test_fe.columns]

# Train and evaluate
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train_fe[cols], y_train)
auc = roc_auc_score(y_test, model.predict_proba(X_test_fe[cols])[:, 1])

print(f"ROC-AUC: {auc:.4f}")

# Show explanations
for feat, expl in list(engineer.explain_features().items())[:3]:
    print(f"{feat}: {expl}")

# ============================================================
# Transform Rules: Create reusable transformations
# ============================================================
store = TransformRuleStore()
generator = TransformRuleGenerator(store=store)

# Generate and save a reusable rule
rule = generator.generate_from_description(
    description="Calculate glucose-HbA1c product as diabetes indicator",
    columns={"glucose": "float", "hba1c": "float"},
    tags=["healthcare", "diabetes"],
    save=True
)

print(f"\nCreated reusable rule: {rule.name}")
print(f"Code: {rule.code}")
print(f"This rule can now be reused on any dataset with similar columns!")