Skip to content

Command-Line Interface

FeatCopilot ships a stable, agent-friendly featcopilot CLI for using the library from shells, CI pipelines, and agentic / LLM tool-use workflows without writing Python glue. All subcommands accept --json for machine-readable stdout; user-facing errors are written to stderr with a non-zero exit code so that automation can parse failures deterministically.

The CLI is installed automatically with the package via the [project.scripts] entry point (featcopilot = "featcopilot.cli:main"), so after pip install featcopilot the featcopilot command is available on $PATH. The equivalent module form python -m featcopilot ... always works regardless of how the package was installed.

Subcommands

Command Purpose
featcopilot info Print version, supported engines, selection methods, leakage guards, I/O formats, and a runtime parquet_available flag.
featcopilot transform Read a CSV / Parquet / JSON file, run AutoFeatureEngineer, and write engineered features to an output file.
featcopilot explain Fit and print a JSON document with {name, explanation, code} per feature for downstream LLM consumption (no output file is written).

Run any subcommand with --help to see the full flag list:

featcopilot --help
featcopilot transform --help
featcopilot explain --help

Output contract

All three subcommands honor the same agent-friendly contract:

  • stdout carries the result. With --json (always implicit for explain), exactly one JSON document is written.
  • stderr is reserved for failures. A successful run keeps stderr empty even when AutoFeatureEngineer emits leakage warnings or verbose logger output ─ those are surfaced via the JSON payload's warnings field instead. This same contract covers warnings emitted during pandas / pyarrow read or write phases (e.g. DtypeWarning on mixed-type CSVs, FutureWarning from a successful Parquet write): they are routed to the JSON warnings field, never to stderr.
  • Exit codes: 0 on success; 2 for user-input errors (missing files, malformed config, unknown target, etc.); 1 for unexpected internal errors.

featcopilot info

Discover capabilities without running an engineer:

featcopilot info --json

Sample (truncated) output:

{
  "version": "0.4.0",
  "supported_engines": ["llm", "relational", "tabular", "text", "timeseries"],
  "supported_selection_methods": [
    "chi2",
    "correlation",
    "f_test",
    "importance",
    "mutual_info",
    "xgboost"
  ],
  "supported_leakage_guards": ["off", "raise", "warn"],
  "supported_input_formats": ["csv", "json"],
  "supported_output_formats": ["csv", "json"],
  "parquet_available": false
}

When a parquet engine (pyarrow or fastparquet) IS importable in the current environment, "parquet" is added to supported_input_formats and supported_output_formats (in source order, so the lists become ["csv", "parquet", "json"]) and parquet_available flips to true.

parquet_available reflects whether pyarrow or fastparquet is importable in the current environment. The base FeatCopilot install does not pin a parquet engine; install one with pip install pyarrow (or fastparquet) to enable Parquet I/O.

featcopilot transform

Run feature engineering on a tabular input and write the engineered features to disk:

featcopilot transform \
    --input data.csv --target label --output features.csv \
    --engines tabular --max-features 50 \
    --json

Common flags:

Flag Purpose
--input / -i Path to input file (CSV / Parquet / JSON). Required.
--output / -o Path to output file. Required.
--target / -t Target column. Required when feature selection is applied (i.e. when --max-features / config max_features is set).
--input-format / --output-format Override format detection (csv / parquet / json).
--engines One or more engines to enable (default: tabular).
--max-features N Cap on engine output / selection. Forwarded both to engine constructors and to the selector.
--no-selection Skip feature selection entirely (raw feature generation).
--selection-methods Override the default mutual_info importance selection set.
--leakage-guard How to handle suspicious column names: warn (default — log a warning and continue), raise (hard-fail with an error), or off (disable the check).
--include-target Re-attach the target column to the output file (collision-safe).
--task-description Free-form ML task description forwarded to LLM-aware engines.
--config FILE JSON config with nested keys (e.g. llm_config, selection_methods). CLI flags override config values.
--verbose / --no-verbose Toggle verbose logging. With --json, log records are routed to the JSON warnings field rather than stderr.
--gate-n-jobs Parallelism for the do-no-harm gate's RF (default 1; -1 = all cores).
--json Emit a one-line JSON status object on stdout instead of human-readable text.

A successful --json run prints something like:

{
  "status": "ok",
  "input": "data.csv",
  "output": "features.csv",
  "input_format": "csv",
  "output_format": "csv",
  "n_rows": 1000,
  "n_features": 47,
  "n_input_columns": 12,
  "n_generated_features": 47,
  "engines": ["tabular"],
  "selection_methods": ["mutual_info", "importance"],
  "max_features": 50,
  "target": "label",
  "selection_applied": true,
  "warnings": []
}

featcopilot explain

Fit the engineer (without writing any output file) and print a JSON catalog of generated features for downstream LLM consumption:

featcopilot explain --input data.csv --target label

Each entry in the features array contains the feature name, an LLM-style natural-language explanation, and the executable Python code used to produce it.

explain defaults to running on the full input so the metadata is a faithful description of what a corresponding transform would generate. Some engines (notably the tabular engine's categorical encoding) consult per-row / per-category statistics when planning features, so blind subsampling can silently change results. For very large inputs where metadata-only explain should not pay full memory or compute cost, opt in with:

featcopilot explain --input big.csv --target label --explain-sample-size 5000

The cap is a deterministic head slice (the first N rows), threaded through pd.read_csv(nrows=N) for CSV so memory is bounded natively. For Parquet / JSON pandas has no native row-limit, so the file is fully read and then truncated; a UserWarning explaining the limitation is emitted (and surfaced in the JSON warnings field) only when the cap actually truncates the input.

Configuration files

Pass --config config.json to provide nested keys that don't have matching CLI flags, such as the llm_config engine kwargs:

{
  "engines": ["tabular", "llm"],
  "max_features": 80,
  "selection_methods": ["mutual_info", "importance"],
  "llm_config": {
    "backend": "litellm",
    "model": "gpt-4o",
    "max_suggestions": 20
  }
}

Explicit CLI flags override values from the config file. Any malformed scalar (e.g. "max_features": "5", "verbose": "false") is rejected with a clean exit-2 error rather than failing later inside the engineer.

Parquet I/O

The base FeatCopilot install does not pin a parquet engine. To use --input file.parquet / --output file.parquet (or the parquet value of --input-format / --output-format), install one of:

pip install pyarrow      # recommended
# or
pip install fastparquet

Confirm with featcopilot info --json:

{ "parquet_available": true, ... }

If neither engine is installed, attempting Parquet I/O fails with a clean exit-2 error pointing at the missing dependency.

Agentic-usage tips

  • Always pass --json. Treat anything on stderr as a hard failure; treat anything on stdout as the JSON result.
  • Treat the JSON warnings field as a list of human-readable diagnostic strings ─ it is non-empty for transform runs that generated leakage / mock-mode / sampling notices, and empty for fully clean runs.
  • For long-running batch jobs, prefer featcopilot transform to python -m featcopilot transform only because the former is shorter; both invoke the exact same entry point.

See also

  • Overview ─ the underlying AutoFeatureEngineer API.
  • Engines ─ what each engine generates.
  • LLM Features ─ configuring the LLM backend (provide an llm_config object inside the JSON file passed to --config, as shown in the Configuration files section above).