Skip to content

validation

Validation helpers for safer feature engineering workflows.

find_potential_leakage_columns(columns, target_name=None, keywords=None)

Find suspicious columns that may leak label or future information.

Parameters:

Name Type Description Default
columns list

Input column names or labels to inspect.

required
target_name optional

Expected target/label column name or label. Related variants will be flagged.

None
keywords list[str]

Additional suspicious keywords to match against normalized column names.

None

Returns:

Type Description
list

Column names or labels that deserve manual review for leakage.

Notes

Target-name matching is intentionally fuzzy: labels are normalized and substring variants are flagged so derived names such as target_encoded are reviewed.

Pass keywords=[] to opt out of keyword-based matching entirely (only the explicit target_name will be used). target_name is treated as absent only when it is None or normalizes to an empty string after stripping non-alphanumerics; this lets falsy-but-meaningful values such as 0 participate in matching while preventing target_name="" from matching every column via the empty-substring trap. Symmetrically, columns whose labels normalize to an empty string (e.g. "---", "!!!") are skipped entirely so the empty normalized_column cannot be reported as a substring of every normalized_target.

Source code in featcopilot/utils/validation.py
def find_potential_leakage_columns(
    columns: list[Any],
    target_name: Any | None = None,
    keywords: list[str] | None = None,
) -> list[Any]:
    """
    Find suspicious columns that may leak label or future information.

    Parameters
    ----------
    columns : list
        Input column names or labels to inspect.
    target_name : optional
        Expected target/label column name or label. Related variants will be flagged.
    keywords : list[str], optional
        Additional suspicious keywords to match against normalized column names.

    Returns
    -------
    list
        Column names or labels that deserve manual review for leakage.

    Notes
    -----
    Target-name matching is intentionally fuzzy: labels are normalized and substring
    variants are flagged so derived names such as ``target_encoded`` are reviewed.

    Pass ``keywords=[]`` to opt out of keyword-based matching entirely (only
    the explicit ``target_name`` will be used). ``target_name`` is treated as
    absent only when it is ``None`` *or* normalizes to an empty string after
    stripping non-alphanumerics; this lets falsy-but-meaningful values such
    as ``0`` participate in matching while preventing ``target_name=""``
    from matching every column via the empty-substring trap. Symmetrically,
    columns whose labels normalize to an empty string (e.g. ``"---"``,
    ``"!!!"``) are skipped entirely so the empty ``normalized_column`` cannot
    be reported as a substring of every ``normalized_target``.
    """
    # Use ``is None`` defaulting (rather than ``or``) so callers can pass an
    # empty list to explicitly disable keyword matching.
    if keywords is None:
        keywords = DEFAULT_LEAKAGE_KEYWORDS
    normalized_keywords = [_normalize_column_name(keyword) for keyword in keywords]

    # Use explicit ``is None`` so falsy but meaningful target labels (e.g. ``0``)
    # still participate in matching. After normalization, an empty string is
    # treated as "no target" so that ``target_name=""`` (or values like ``"---"``
    # that strip to nothing) cannot match every column via the
    # ``normalized_target in normalized_column`` substring check.
    if target_name is None:
        normalized_target: str | None = None
    else:
        normalized = _normalize_column_name(target_name)
        normalized_target = normalized if normalized else None

    suspicious: list[Any] = []
    for column in columns:
        normalized_column = _normalize_column_name(column)

        # A column whose label normalizes to an empty string (e.g. ``"---"`` or
        # ``"!!!"``) has no meaningful content to compare against. Skipping it
        # closes the column-side counterpart of the ``"" in normalized_target``
        # trap (the empty ``normalized_column`` would otherwise be a substring
        # of every non-empty ``normalized_target`` and be flagged whenever a
        # target was provided), and similarly avoids any keyword false-positive
        # via the same empty-substring path.
        if not normalized_column:
            continue

        keyword_hit = any(keyword and keyword in normalized_column for keyword in normalized_keywords)
        target_hit = normalized_target is not None and (
            normalized_column == normalized_target
            or normalized_target in normalized_column
            or normalized_column in normalized_target
        )

        if keyword_hit or target_hit:
            suspicious.append(column)

    return suspicious