sparkkgml.feature_engineering
Classes
A class for extracting features and their descriptions from a Spark DataFrame. |
Module Contents
- class sparkkgml.feature_engineering.FeatureEngineering[source]
A class for extracting features and their descriptions from a Spark DataFrame.
- _entityColumn
The name of the entity column.
- Type:
str
- _entityColumn = ''
- getFeatures(df)[source]
Extracts features and their descriptions from a DataFrame.
- Parameters:
df (pyspark.sql.DataFrame) – The input DataFrame.
- Returns:
A tuple containing the collapsed DataFrame and a dictionary of feature descriptions.
- Return type:
tuple
Notes
This function analyzes each column in the input DataFrame and extracts features. The resulting features are stored in a collapsed DataFrame where each row represents a unique entity. A dictionary of feature descriptions provides information about each feature’s properties.
Feature Descriptions:
featureType (str): The type of the feature, combining information about whether it is a list or a single value, whether it is categorical or non-categorical, and the data type.
name (str): The name of the feature column.
nullable (bool): A flag indicating if the feature can have null values. Extracted based on the rule that a feature is nullable if it has at least one null value.
datatype (spark.DataType): The data type of the feature column.
numberDistinctValues (int): The number of distinct values in the feature column.
isListOfEntries (bool): A flag indicating if the feature is a list of entries. Extracted based on the rule that a feature is considered a list if it has more than one entry in at least one row.
isCategorical (bool): A flag indicating if the feature is categorical. Extracted based on the rule that a feature is considered categorical if the ratio of distinct values to the total number of entities is less than 0.1.