sparkkgml.vectorization
Classes
A class designed for feature engineering and vectorization of data in Spark DataFrames. |
Module Contents
- class sparkkgml.vectorization.Vectorization[source]
A class designed for feature engineering and vectorization of data in Spark DataFrames.
- _entityColumn
The name of the entity column used for joining and indexing.
- Type:
str
- _stopWordsRemover
A flag indicating whether to remove stop words during text processing.
- Type:
bool
- _word2vecSize
The size of the word vectors in Word2Vec embedding.
- Type:
int
- _word2vecMinCount
The minimum count of words required for Word2Vec embedding.
- Type:
int
- _digitStringStrategy
The strategy for digitizing string values (‘index’ or ‘hash’).
- Type:
str
- vectorize(df2, features)[source]
Applies vectorization transformations to specified columns in the DataFrame based on the provided features.
Note
The vectorize method iterates over the specified columns, applies appropriate transformations based on data type and features, and returns a DataFrame with the vectorized features.
- _entityColumn = ''
- _stopWordsRemover = True
- _word2vecSize = 2
- _word2vecMinCount = 1
- _digitStringStrategy = 'index'
- vectorize(df2, features)[source]
Vectorizes the specified columns in the DataFrame based on the provided features.
- Parameters:
df2 (pyspark.sql.DataFrame) – The input DataFrame.
features (dict) – A dictionary containing information about the features of each column.
- Returns:
The vectorized DataFrame.
- Return type:
fullDigitizedDf (pyspark.sql.DataFrame)
- Raises:
NotImplementedError – If a transformation for a specific data type is not implemented.
- Implementation Flow:
Iterate over each column in the DataFrame.
Check the data type and features of the column to determine the vectorization strategy.
Apply the appropriate transformation based on the column’s data type and features.
- If the column is a Single Categorical String:
Apply string indexing or hashing based on the configured strategy.
- If the column is a List of Categorical Strings:
Explode the list and apply string indexing or hashing based on the configured strategy.
- If the column is a Single Non-Categorical String:
Apply Word2Vec embedding after tokenization and optional stop word removal.
- If the column is a List of Non-Categorical Strings:
Combine the list elements, apply tokenization, optional stop word removal, and Word2Vec embedding.
- If the column is a Numeric type (Integer, Long, Float, Double):
Handle both Single and List types by either joining or exploding the values.
- If the column is a Boolean type:
Cast the Boolean values to Integer (0 or 1).
- If the column is of an unsupported data type:
Raise a NotImplementedError.
Join the transformed column with the vectorized DataFrame using the entity column.
Return the resulting vectorized DataFrame.
Note
The implementation follows a conditional branching based on the data type and features of each column to determine the appropriate vectorization strategy.