sklearn_utilities.pandas package

class sklearn_utilities.pandas.DataFrameWrapper(estimator: TEstimator, *, pattern_x: str = '^(:?fit|transform|fit_transform)$', pattern_y: str = '^predict.*?$')[source]

Bases: EstimatorWrapperBase[TEstimator], Generic[TEstimator]

estimator: TEstimator

pattern_x: str

y_columns_or_name: Index[Any] | Hashable | None = None

class sklearn_utilities.pandas.ExcludedColumnTransformerPandas(estimator: Any = IdTransformer(), exclude_columns: Sequence[str] | Callable[[Sequence[str]], Sequence[bool]] = [])[source]

Bases: BaseEstimator, TransformerMixin

A transformer that excludes columns from the input data frame.

feature_names_in_: Sequence[str]

feature_names_out_: Sequence[str]

fit(X: DataFrame, **fit_params: Any) → Self[source]

fit_transform(X: DataFrame, y: Any = None, **fit_params: Any) → DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: DataFrame, y: Any = None, **transform_params: Any) → DataFrame[source]

class sklearn_utilities.pandas.FeatureUnionPandas(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]

Bases: FeatureUnion

fit_transform(X: Any, y: Any = None, **fit_params: Any) → Any[source]

Fit all transformers, transform the data and concatenate results.

Parameters:

X (iterable or array-like, depending on transformers) – Input data to be transformed.
y (array-like of shape (n_samples, n_outputs), default=None) – Targets for supervised learning.
**fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

steps: List[Any]

transform(X: Any) → Any[source]

Transform X separately by each transformer, concatenate results.

Parameters:: X (iterable or array-like, depending on transformers) – Input data to be transformed.
Returns:: X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.
Return type:: array-like or sparse matrix of shape (n_samples, sum_n_components)

class sklearn_utilities.pandas.IncludedColumnTransformerPandas(estimator: Any = IdTransformer(), include_columns: Sequence[str] | Callable[[Sequence[str]], Sequence[bool]] = [])[source]

Bases: BaseEstimator, TransformerMixin

A transformer that includes columns from the input data frame.

feature_names_in_: Sequence[str]

feature_names_out_: Sequence[str]

fit(X: DataFrame, **fit_params: Any) → Self[source]

fit_transform(X: DataFrame, y: Any = None, **fit_params: Any) → DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: DataFrame, y: Any = None, **transform_params: Any) → DataFrame[source]

class sklearn_utilities.pandas.SmartMultioutputEstimator(estimator: TEstimator, *, n_jobs: int | None = -1, verbose: int = 1, pass_numpy: bool = False)[source]

Bases: BaseEstimator, RegressorMixin, Generic[TEstimator]

estimator: TEstimator

estimators_: list[TEstimator]

fit(X: DataFrame, y: DataFrame, **fit_params: Any) → Self[source]

predict(X: DataFrame, **predict_params: Any) → DataFrame | Series | NDArray[Any] | tuple[DataFrame | Series | NDArray[Any], ...][source]

predict_var(X: DataFrame, **predict_params: Any) → DataFrame | Series | NDArray[Any] | tuple[DataFrame | Series | NDArray[Any], ...][source]

score(X: DataFrame, y: DataFrame, **score_params: Any) → ndarray[Any, dtype[Any]][source]

Return the coefficient of determination of the prediction.

The coefficient of determination $R^2$ is defined as $(1 - \frac{u}{v})$, where $u$ is the residual sum of squares ((y_true - y_pred)** 2).sum() and $v$ is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.0.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – $R^2$ of self.predict(X) w.r.t. y.

Return type:

float

Notes

The $R^2$ score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

Submodules

sklearn_utilities.pandas.column_transformer_pandas module

class sklearn_utilities.pandas.column_transformer_pandas.ExcludedColumnTransformerPandas(estimator: Any = IdTransformer(), exclude_columns: Sequence[str] | Callable[[Sequence[str]], Sequence[bool]] = [])[source]

Bases: BaseEstimator, TransformerMixin

A transformer that excludes columns from the input data frame.

feature_names_in_: Sequence[str]

feature_names_out_: Sequence[str]

fit(X: DataFrame, **fit_params: Any) → Self[source]

fit_transform(X: DataFrame, y: Any = None, **fit_params: Any) → DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: DataFrame, y: Any = None, **transform_params: Any) → DataFrame[source]

class sklearn_utilities.pandas.column_transformer_pandas.IncludedColumnTransformerPandas(estimator: Any = IdTransformer(), include_columns: Sequence[str] | Callable[[Sequence[str]], Sequence[bool]] = [])[source]

Bases: BaseEstimator, TransformerMixin

A transformer that includes columns from the input data frame.

feature_names_in_: Sequence[str]

feature_names_out_: Sequence[str]

fit(X: DataFrame, **fit_params: Any) → Self[source]

fit_transform(X: DataFrame, y: Any = None, **fit_params: Any) → DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: DataFrame, y: Any = None, **transform_params: Any) → DataFrame[source]

sklearn_utilities.pandas.dataframe_wrapper module

class sklearn_utilities.pandas.dataframe_wrapper.DataFrameWrapper(estimator: TEstimator, *, pattern_x: str = '^(:?fit|transform|fit_transform)$', pattern_y: str = '^predict.*?$')[source]

Bases: EstimatorWrapperBase[TEstimator], Generic[TEstimator]

estimator: TEstimator

pattern_x: str

y_columns_or_name: Index[Any] | Hashable | None = None

sklearn_utilities.pandas.dataframe_wrapper.to_frame_or_series(array: TArray, base_index: Index[Any], base_columns_or_name: Index[Any] | Hashable | None) → DataFrame | Series | TArray[source]

sklearn_utilities.pandas.dataframe_wrapper.to_frame_or_series_tuple(array: tuple[TArray, ...] | TArray, base_index: Index[Any], base_columns_or_name: Index[Any] | Hashable) → tuple[DataFrame | Series | TArray, ...] | DataFrame | Series | TArray[source]

sklearn_utilities.pandas.feature_union_pandas module

class sklearn_utilities.pandas.feature_union_pandas.FeatureUnionPandas(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]

Bases: FeatureUnion

fit_transform(X: Any, y: Any = None, **fit_params: Any) → Any[source]

Fit all transformers, transform the data and concatenate results.

Parameters:

X (iterable or array-like, depending on transformers) – Input data to be transformed.
y (array-like of shape (n_samples, n_outputs), default=None) – Targets for supervised learning.
**fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

steps: List[Any]

transform(X: Any) → Any[source]

Transform X separately by each transformer, concatenate results.

Parameters:: X (iterable or array-like, depending on transformers) – Input data to be transformed.
Returns:: X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.
Return type:: array-like or sparse matrix of shape (n_samples, sum_n_components)

sklearn_utilities.pandas.multioutput module

class sklearn_utilities.pandas.multioutput.SmartMultioutputEstimator(estimator: TEstimator, *, n_jobs: int | None = -1, verbose: int = 1, pass_numpy: bool = False)[source]

Bases: BaseEstimator, RegressorMixin, Generic[TEstimator]

estimator: TEstimator

estimators_: list[TEstimator]

fit(X: DataFrame, y: DataFrame, **fit_params: Any) → Self[source]

predict(X: DataFrame, **predict_params: Any) → DataFrame | Series | NDArray[Any] | tuple[DataFrame | Series | NDArray[Any], ...][source]

predict_var(X: DataFrame, **predict_params: Any) → DataFrame | Series | NDArray[Any] | tuple[DataFrame | Series | NDArray[Any], ...][source]

score(X: DataFrame, y: DataFrame, **score_params: Any) → ndarray[Any, dtype[Any]][source]

Return the coefficient of determination of the prediction.

The coefficient of determination $R^2$ is defined as $(1 - \frac{u}{v})$, where $u$ is the residual sum of squares ((y_true - y_pred)** 2).sum() and $v$ is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.0.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – $R^2$ of self.predict(X) w.r.t. y.

Return type:

float

Notes

The $R^2$ score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).