sparkkgml.feature_engineering

Classes

FeatureEngineering

A class for extracting features and their descriptions from a Spark DataFrame.

Module Contents

class sparkkgml.feature_engineering.FeatureEngineering[source]

A class for extracting features and their descriptions from a Spark DataFrame.

_entityColumn

The name of the entity column.

Type:

str

get_entityColumn()[source]

Getter method for the entity column.

set_entityColumn(entityColumn)[source]

Setter method for the entity column.

getFeatures(df)[source]

Extracts features and their descriptions from a DataFrame.

_entityColumn = ''
get_entityColumn()[source]
set_entityColumn(entityColumn)[source]
getFeatures(df)[source]

Extracts features and their descriptions from a DataFrame.

Parameters:

df (pyspark.sql.DataFrame) – The input DataFrame.

Returns:

A tuple containing the collapsed DataFrame and a dictionary of feature descriptions.

Return type:

tuple

Notes

This function analyzes each column in the input DataFrame and extracts features. The resulting features are stored in a collapsed DataFrame where each row represents a unique entity. A dictionary of feature descriptions provides information about each feature’s properties.

Feature Descriptions:

  • featureType (str): The type of the feature, combining information about whether it is a list or a single value, whether it is categorical or non-categorical, and the data type.

  • name (str): The name of the feature column.

  • nullable (bool): A flag indicating if the feature can have null values. Extracted based on the rule that a feature is nullable if it has at least one null value.

  • datatype (spark.DataType): The data type of the feature column.

  • numberDistinctValues (int): The number of distinct values in the feature column.

  • isListOfEntries (bool): A flag indicating if the feature is a list of entries. Extracted based on the rule that a feature is considered a list if it has more than one entry in at least one row.

  • isCategorical (bool): A flag indicating if the feature is categorical. Extracted based on the rule that a feature is considered categorical if the ratio of distinct values to the total number of entities is less than 0.1.