sparkkgml.vectorization

Classes

Vectorization

A class designed for feature engineering and vectorization of data in Spark DataFrames.

Module Contents

class sparkkgml.vectorization.Vectorization[source]

A class designed for feature engineering and vectorization of data in Spark DataFrames.

_entityColumn

The name of the entity column used for joining and indexing.

Type:: str

_stopWordsRemover

A flag indicating whether to remove stop words during text processing.

Type:: bool

_word2vecSize

The size of the word vectors in Word2Vec embedding.

Type:: int

_word2vecMinCount

The minimum count of words required for Word2Vec embedding.

Type:: int

_digitStringStrategy

The strategy for digitizing string values (‘index’ or ‘hash’).

Type:: str

get_entityColumn()[source]: Getter method for the entity column.

get_word2vecSize()[source]: Getter method for the Word2Vec vector size.

get_stopWordsRemover()[source]: Getter method for the stop words remover flag.

get_word2vecMinCount()[source]: Getter method for the Word2Vec minimum count.

get_digitStringStrategy()[source]: Getter method for the digit string strategy.

set_entityColumn(entityColumn)[source]: Setter method for the entity column.

set_word2vecSize(word2vecSize)[source]: Setter method for the Word2Vec vector size.

set_word2vecMinCount(word2vecMinCount)[source]: Setter method for the Word2Vec minimum count.

set_stopWordsRemover(stopWordsRemover)[source]: Setter method for the stop words remover flag.

set_digitStringStrategy(digitStringStrategy)[source]: Setter method for the digit string strategy.

vectorize(df2, features)[source]: Applies vectorization transformations to specified columns in the DataFrame based on the provided features.

Note

The vectorize method iterates over the specified columns, applies appropriate transformations based on data type and features, and returns a DataFrame with the vectorized features.

_entityColumn = ''

_stopWordsRemover = True

_word2vecSize = 2

_word2vecMinCount = 1

_digitStringStrategy = 'index'

get_entityColumn()[source]

get_word2vecSize()[source]

get_stopWordsRemover()[source]

get_word2vecMinCount()[source]

get_digitStringStrategy()[source]

set_entityColumn(entityColumn)[source]

set_word2vecSize(word2vecSize)[source]

set_word2vecMinCount(word2vecMinCount)[source]

set_stopWordsRemover(StopWordsRemover)[source]

set_digitStringStrategy(digitStringStrategy)[source]

vectorize(df2, features)[source]

Vectorizes the specified columns in the DataFrame based on the provided features.

Parameters:

df2 (pyspark.sql.DataFrame) – The input DataFrame.
features (dict) – A dictionary containing information about the features of each column.

Returns:

The vectorized DataFrame.

Return type:

fullDigitizedDf (pyspark.sql.DataFrame)

Raises:

NotImplementedError – If a transformation for a specific data type is not implemented.

Implementation Flow:

Iterate over each column in the DataFrame.
Check the data type and features of the column to determine the vectorization strategy.
Apply the appropriate transformation based on the column’s data type and features.
- If the column is a Single Categorical String:
  - Apply string indexing or hashing based on the configured strategy.
- If the column is a List of Categorical Strings:
  - Explode the list and apply string indexing or hashing based on the configured strategy.
- If the column is a Single Non-Categorical String:
  - Apply Word2Vec embedding after tokenization and optional stop word removal.
- If the column is a List of Non-Categorical Strings:
  - Combine the list elements, apply tokenization, optional stop word removal, and Word2Vec embedding.
- If the column is a Numeric type (Integer, Long, Float, Double):
  - Handle both Single and List types by either joining or exploding the values.
- If the column is a Boolean type:
  - Cast the Boolean values to Integer (0 or 1).
- If the column is of an unsupported data type:
  - Raise a NotImplementedError.
Join the transformed column with the vectorized DataFrame using the entity column.
Return the resulting vectorized DataFrame.

Note

The implementation follows a conditional branching based on the data type and features of each column to determine the appropriate vectorization strategy.