sklearn_utilities.pandas package

class sklearn_utilities.pandas.DataFrameWrapper(estimator: TEstimator, *, pattern_x: str = '^(:?fit|transform|fit_transform)$', pattern_y: str = '^predict.*?$')[source]

Bases: EstimatorWrapperBase[TEstimator], Generic[TEstimator]

estimator: TEstimator
pattern_x: str
y_columns_or_name: Index[Any] | Hashable | None = None
class sklearn_utilities.pandas.ExcludedColumnTransformerPandas(estimator: Any = IdTransformer(), exclude_columns: Sequence[str] | Callable[[Sequence[str]], Sequence[bool]] = [])[source]

Bases: BaseEstimator, TransformerMixin

A transformer that excludes columns from the input data frame.

feature_names_in_: Sequence[str]
feature_names_out_: Sequence[str]
fit(X: DataFrame, **fit_params: Any) Self[source]
fit_transform(X: DataFrame, y: Any = None, **fit_params: Any) DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: DataFrame, y: Any = None, **transform_params: Any) DataFrame[source]
class sklearn_utilities.pandas.FeatureUnionPandas(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]

Bases: FeatureUnion

fit_transform(X: Any, y: Any = None, **fit_params: Any) Any[source]

Fit all transformers, transform the data and concatenate results.

Parameters:
  • X (iterable or array-like, depending on transformers) – Input data to be transformed.

  • y (array-like of shape (n_samples, n_outputs), default=None) – Targets for supervised learning.

  • **fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

steps: List[Any]
transform(X: Any) Any[source]

Transform X separately by each transformer, concatenate results.

Parameters:

X (iterable or array-like, depending on transformers) – Input data to be transformed.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

class sklearn_utilities.pandas.IncludedColumnTransformerPandas(estimator: Any = IdTransformer(), include_columns: Sequence[str] | Callable[[Sequence[str]], Sequence[bool]] = [])[source]

Bases: BaseEstimator, TransformerMixin

A transformer that includes columns from the input data frame.

feature_names_in_: Sequence[str]
feature_names_out_: Sequence[str]
fit(X: DataFrame, **fit_params: Any) Self[source]
fit_transform(X: DataFrame, y: Any = None, **fit_params: Any) DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: DataFrame, y: Any = None, **transform_params: Any) DataFrame[source]
class sklearn_utilities.pandas.SmartMultioutputEstimator(estimator: TEstimator, *, n_jobs: int | None = -1, verbose: int = 1, pass_numpy: bool = False)[source]

Bases: BaseEstimator, RegressorMixin, Generic[TEstimator]

estimator: TEstimator
estimators_: list[TEstimator]
fit(X: DataFrame, y: DataFrame, **fit_params: Any) Self[source]
predict(X: DataFrame, **predict_params: Any) DataFrame | Series | NDArray[Any] | tuple[DataFrame | Series | NDArray[Any], ...][source]
predict_var(X: DataFrame, **predict_params: Any) DataFrame | Series | NDArray[Any] | tuple[DataFrame | Series | NDArray[Any], ...][source]
score(X: DataFrame, y: DataFrame, **score_params: Any) ndarray[Any, dtype[Any]][source]

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score\(R^2\) of self.predict(X) w.r.t. y.

Return type:

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

Submodules

sklearn_utilities.pandas.column_transformer_pandas module

class sklearn_utilities.pandas.column_transformer_pandas.ExcludedColumnTransformerPandas(estimator: Any = IdTransformer(), exclude_columns: Sequence[str] | Callable[[Sequence[str]], Sequence[bool]] = [])[source]

Bases: BaseEstimator, TransformerMixin

A transformer that excludes columns from the input data frame.

feature_names_in_: Sequence[str]
feature_names_out_: Sequence[str]
fit(X: DataFrame, **fit_params: Any) Self[source]
fit_transform(X: DataFrame, y: Any = None, **fit_params: Any) DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: DataFrame, y: Any = None, **transform_params: Any) DataFrame[source]
class sklearn_utilities.pandas.column_transformer_pandas.IncludedColumnTransformerPandas(estimator: Any = IdTransformer(), include_columns: Sequence[str] | Callable[[Sequence[str]], Sequence[bool]] = [])[source]

Bases: BaseEstimator, TransformerMixin

A transformer that includes columns from the input data frame.

feature_names_in_: Sequence[str]
feature_names_out_: Sequence[str]
fit(X: DataFrame, **fit_params: Any) Self[source]
fit_transform(X: DataFrame, y: Any = None, **fit_params: Any) DataFrame[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

transform(X: DataFrame, y: Any = None, **transform_params: Any) DataFrame[source]

sklearn_utilities.pandas.dataframe_wrapper module

class sklearn_utilities.pandas.dataframe_wrapper.DataFrameWrapper(estimator: TEstimator, *, pattern_x: str = '^(:?fit|transform|fit_transform)$', pattern_y: str = '^predict.*?$')[source]

Bases: EstimatorWrapperBase[TEstimator], Generic[TEstimator]

estimator: TEstimator
pattern_x: str
y_columns_or_name: Index[Any] | Hashable | None = None
sklearn_utilities.pandas.dataframe_wrapper.to_frame_or_series(array: TArray, base_index: Index[Any], base_columns_or_name: Index[Any] | Hashable | None) DataFrame | Series | TArray[source]
sklearn_utilities.pandas.dataframe_wrapper.to_frame_or_series_tuple(array: tuple[TArray, ...] | TArray, base_index: Index[Any], base_columns_or_name: Index[Any] | Hashable) tuple[DataFrame | Series | TArray, ...] | DataFrame | Series | TArray[source]

sklearn_utilities.pandas.feature_union_pandas module

class sklearn_utilities.pandas.feature_union_pandas.FeatureUnionPandas(transformer_list, *, n_jobs=None, transformer_weights=None, verbose=False)[source]

Bases: FeatureUnion

fit_transform(X: Any, y: Any = None, **fit_params: Any) Any[source]

Fit all transformers, transform the data and concatenate results.

Parameters:
  • X (iterable or array-like, depending on transformers) – Input data to be transformed.

  • y (array-like of shape (n_samples, n_outputs), default=None) – Targets for supervised learning.

  • **fit_params (dict, default=None) – Parameters to pass to the fit method of the estimator.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

steps: List[Any]
transform(X: Any) Any[source]

Transform X separately by each transformer, concatenate results.

Parameters:

X (iterable or array-like, depending on transformers) – Input data to be transformed.

Returns:

X_t – The hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers.

Return type:

array-like or sparse matrix of shape (n_samples, sum_n_components)

sklearn_utilities.pandas.multioutput module

class sklearn_utilities.pandas.multioutput.SmartMultioutputEstimator(estimator: TEstimator, *, n_jobs: int | None = -1, verbose: int = 1, pass_numpy: bool = False)[source]

Bases: BaseEstimator, RegressorMixin, Generic[TEstimator]

estimator: TEstimator
estimators_: list[TEstimator]
fit(X: DataFrame, y: DataFrame, **fit_params: Any) Self[source]
predict(X: DataFrame, **predict_params: Any) DataFrame | Series | NDArray[Any] | tuple[DataFrame | Series | NDArray[Any], ...][source]
predict_var(X: DataFrame, **predict_params: Any) DataFrame | Series | NDArray[Any] | tuple[DataFrame | Series | NDArray[Any], ...][source]
score(X: DataFrame, y: DataFrame, **score_params: Any) ndarray[Any, dtype[Any]][source]

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score\(R^2\) of self.predict(X) w.r.t. y.

Return type:

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).