sparkkgml.feature_selection

Functions

`correlation_feature_selection`(df, threshold, features_arr)	Perform correlation-based feature selection on a DataFrame.
`sequential_feature_selection`(df, ml_model, ...[, ...])	Performs sequential feature selection based on cross-validation accuracy.

Module Contents

sparkkgml.feature_selection.correlation_feature_selection(df, threshold, features_arr)[source]

Perform correlation-based feature selection on a DataFrame.

Parameters:

df (pyspark.sql.DataFrame) – The input DataFrame.
threshold (float) – The correlation threshold.
features_arr (list[str]) – A list of feature column names.

Returns:

The resulting DataFrame with selected non-correlated features.

Return type:

pyspark.sql.DataFrame

Notes

This function performs feature selection based on correlation. It follows these steps:

Assemble the specified feature columns into a vector column.
Compute the correlation matrix of the assembled features.
Identify features that have low absolute correlation with all other features.
Select the identified non-correlated features.
Return the input DataFrame with only the selected non-correlated feature columns.

sparkkgml.feature_selection.sequential_feature_selection(df, ml_model, cross_validator, evaluator, label, feature_cols, threshold=0.1, descending=True)[source]

Performs sequential feature selection based on cross-validation accuracy.

Parameters:

df (pyspark.sql.DataFrame) – The input DataFrame containing the features and label.
ml_model – The machine learning model to be used for training.
cross_validator – The cross-validator for hyperparameter tuning.
evaluator – The evaluator for model performance assessment.
label (str) – The name of the label column.
feature_cols (list) – List of feature column names.
threshold (float, optional) – The threshold for improvement in accuracy to add a feature. Default is 0.1.
descending (bool, optional) – If True, features are sorted in descending order of accuracy. Default is True.

Returns:

A list of selected feature column names.

Return type:

list

Explanation:: The features are initially sorted based on their accuracy, with the option to sort in descending or ascending order. The selection process starts by choosing the feature with the highest accuracy as the initial feature. Subsequently, the function sequentially adds features to the selection based on the sorted list. For each added feature, it evaluates the cross-validation performance and checks if the inclusion of the feature improves the performance by a specified threshold. If the performance improves, the feature is included in the selected set; otherwise, it is excluded. The final output is a list of selected feature column names representing the optimal subset that maximizes the cross-validation performance.