Command-Line Interface¶
FeatCopilot ships a stable, agent-friendly featcopilot CLI for using the
library from shells, CI pipelines, and agentic / LLM tool-use workflows
without writing Python glue. All subcommands accept --json for
machine-readable stdout; user-facing errors are written to stderr with
a non-zero exit code so that automation can parse failures
deterministically.
The CLI is installed automatically with the package via the
[project.scripts] entry point (featcopilot = "featcopilot.cli:main"),
so after pip install featcopilot the featcopilot command is available
on $PATH. The equivalent module form python -m featcopilot ... always
works regardless of how the package was installed.
Subcommands¶
| Command | Purpose |
|---|---|
featcopilot info |
Print version, supported engines, selection methods, leakage guards, I/O formats, and a runtime parquet_available flag. |
featcopilot transform |
Read a CSV / Parquet / JSON file, run AutoFeatureEngineer, and write engineered features to an output file. |
featcopilot explain |
Fit and print a JSON document with {name, explanation, code} per feature for downstream LLM consumption (no output file is written). |
Run any subcommand with --help to see the full flag list:
Output contract¶
All three subcommands honor the same agent-friendly contract:
stdoutcarries the result. With--json(always implicit forexplain), exactly one JSON document is written.stderris reserved for failures. A successful run keepsstderrempty even whenAutoFeatureEngineeremits leakage warnings orverboselogger output ─ those are surfaced via the JSON payload'swarningsfield instead. This same contract covers warnings emitted during pandas / pyarrow read or write phases (e.g.DtypeWarningon mixed-type CSVs,FutureWarningfrom a successful Parquet write): they are routed to the JSONwarningsfield, never tostderr.- Exit codes:
0on success;2for user-input errors (missing files, malformed config, unknown target, etc.);1for unexpected internal errors.
featcopilot info¶
Discover capabilities without running an engineer:
Sample (truncated) output:
{
"version": "0.4.0",
"supported_engines": ["llm", "relational", "tabular", "text", "timeseries"],
"supported_selection_methods": [
"chi2",
"correlation",
"f_test",
"importance",
"mutual_info",
"xgboost"
],
"supported_leakage_guards": ["off", "raise", "warn"],
"supported_input_formats": ["csv", "json"],
"supported_output_formats": ["csv", "json"],
"parquet_available": false
}
When a parquet engine (pyarrow or fastparquet) IS importable in the
current environment, "parquet" is added to supported_input_formats
and supported_output_formats (in source order, so the lists become
["csv", "parquet", "json"]) and parquet_available flips to true.
parquet_available reflects whether pyarrow or fastparquet is
importable in the current environment. The base FeatCopilot install does
not pin a parquet engine; install one with
pip install pyarrow (or fastparquet) to enable Parquet I/O.
featcopilot transform¶
Run feature engineering on a tabular input and write the engineered features to disk:
featcopilot transform \
--input data.csv --target label --output features.csv \
--engines tabular --max-features 50 \
--json
Common flags:
| Flag | Purpose |
|---|---|
--input / -i |
Path to input file (CSV / Parquet / JSON). Required. |
--output / -o |
Path to output file. Required. |
--target / -t |
Target column. Required when feature selection is applied (i.e. when --max-features / config max_features is set). |
--input-format / --output-format |
Override format detection (csv / parquet / json). |
--engines |
One or more engines to enable (default: tabular). |
--max-features N |
Cap on engine output / selection. Forwarded both to engine constructors and to the selector. |
--no-selection |
Skip feature selection entirely (raw feature generation). |
--selection-methods |
Override the default mutual_info importance selection set. |
--leakage-guard |
How to handle suspicious column names: warn (default — log a warning and continue), raise (hard-fail with an error), or off (disable the check). |
--include-target |
Re-attach the target column to the output file (collision-safe). |
--task-description |
Free-form ML task description forwarded to LLM-aware engines. |
--config FILE |
JSON config with nested keys (e.g. llm_config, selection_methods). CLI flags override config values. |
--verbose / --no-verbose |
Toggle verbose logging. With --json, log records are routed to the JSON warnings field rather than stderr. |
--gate-n-jobs |
Parallelism for the do-no-harm gate's RF (default 1; -1 = all cores). |
--json |
Emit a one-line JSON status object on stdout instead of human-readable text. |
A successful --json run prints something like:
{
"status": "ok",
"input": "data.csv",
"output": "features.csv",
"input_format": "csv",
"output_format": "csv",
"n_rows": 1000,
"n_features": 47,
"n_input_columns": 12,
"n_generated_features": 47,
"engines": ["tabular"],
"selection_methods": ["mutual_info", "importance"],
"max_features": 50,
"target": "label",
"selection_applied": true,
"warnings": []
}
featcopilot explain¶
Fit the engineer (without writing any output file) and print a JSON catalog of generated features for downstream LLM consumption:
Each entry in the features array contains the feature name, an
LLM-style natural-language explanation, and the executable Python
code used to produce it.
explain defaults to running on the full input so the metadata is
a faithful description of what a corresponding transform would
generate. Some engines (notably the tabular engine's categorical
encoding) consult per-row / per-category statistics when planning
features, so blind subsampling can silently change results. For very
large inputs where metadata-only explain should not pay full memory
or compute cost, opt in with:
The cap is a deterministic head slice (the first N rows), threaded
through pd.read_csv(nrows=N) for CSV so memory is bounded natively.
For Parquet / JSON pandas has no native row-limit, so the file is
fully read and then truncated; a UserWarning explaining the
limitation is emitted (and surfaced in the JSON warnings field) only
when the cap actually truncates the input.
Configuration files¶
Pass --config config.json to provide nested keys that don't have
matching CLI flags, such as the llm_config engine kwargs:
{
"engines": ["tabular", "llm"],
"max_features": 80,
"selection_methods": ["mutual_info", "importance"],
"llm_config": {
"backend": "litellm",
"model": "gpt-4o",
"max_suggestions": 20
}
}
Explicit CLI flags override values from the config file. Any malformed
scalar (e.g. "max_features": "5", "verbose": "false") is rejected
with a clean exit-2 error rather than failing later inside the
engineer.
Parquet I/O¶
The base FeatCopilot install does not pin a parquet engine. To use
--input file.parquet / --output file.parquet (or the parquet
value of --input-format / --output-format), install one of:
Confirm with featcopilot info --json:
If neither engine is installed, attempting Parquet I/O fails with a clean exit-2 error pointing at the missing dependency.
Agentic-usage tips¶
- Always pass
--json. Treat anything onstderras a hard failure; treat anything onstdoutas the JSON result. - Treat the JSON
warningsfield as a list of human-readable diagnostic strings ─ it is non-empty fortransformruns that generated leakage / mock-mode / sampling notices, and empty for fully clean runs. - For long-running batch jobs, prefer
featcopilot transformtopython -m featcopilot transformonly because the former is shorter; both invoke the exact same entry point.
See also¶
- Overview ─ the underlying
AutoFeatureEngineerAPI. - Engines ─ what each engine generates.
- LLM Features ─ configuring the LLM backend (provide
an
llm_configobject inside the JSON file passed to--config, as shown in the Configuration files section above).