Vectorization
Vectorize()
The vectorize()
function is used to vectorize the specified columns in the DataFrame based on the provided features obtained from the getFeatures()
function. It prepares a machine learning-ready DataFrame by applying the appropriate transformations based on the column’s data type and features.
Transformation Strategies
The function employs different strategies to handle various types of columns:
Single Categorical String: For columns containing a single categorical string, the function applies either string indexing or hashing based on the configured strategy.
List of Categorical Strings: When dealing with columns consisting of a list of categorical strings, the function explodes the list and applies string indexing or hashing based on the configured strategy.
Single Non-Categorical String: Columns with a single non-categorical string are processed by applying Word2Vec embedding after tokenization. Optional stop word removal can also be performed.
List of Non-Categorical Strings: In the case of columns containing a list of non-categorical strings, the function combines the list elements, applies tokenization, optional stop word removal, and Word2Vec embedding.
Numeric Type: For columns of numeric types (integer, long, float, double), both single and list types are handled by either joining or exploding the values.
Boolean Type: Columns of boolean type are cast to integers (0 or 1).
Unsupported Data Type: If a column has an unsupported data type, a
NotImplementedError
is raised.
The vectorize()
function provides a flexible and extensible way to vectorize different types of columns based on their data type and features. By leveraging this function, you can easily transform your data into a machine learning-ready format.
Example Usage
1. Prepare the DataFrame
We already have a dataframe and features dictionary from last example:
Click here for the code
# Import the required libraries
from sparkkgml.feature_engineering import FeatureEngineering
# Create an instance of FeatureCollection
featureEngineeringObject=FeatureEngineering()
# Call the getFeatures function with the Spark DataFrame as input
df2,features=featureEngineeringObject.getFeatures(spark_df)
df2.show()
recipe |
calorie |
category |
creamy-orange-cake |
156.1 |
desserts |
summer-chicken-salads |
243.2 |
salad |
orange-raisin-cake |
168.2 |
desserts |
alfredo-blue |
486.7 |
main-dish |
millie-pasquinell… |
621.9 |
meat-and-poultry |
featureDescriptions:
- ‘calorie’: {‘featureType’: ‘Single_NonCategorical_Double’,
‘name’: ‘calorie’, ‘nullable’: False, ‘datatype’: DoubleType, ‘numberDistinctValues’: 193, ‘isListOfEntries’: False, ‘isCategorical’: False},
- ‘category’: {‘featureType’: ‘Single_Categorical_String’,
‘name’: ‘category’, ‘nullable’: False, ‘datatype’: StringType, ‘numberDistinctValues’: 15, ‘isListOfEntries’: False, ‘isCategorical’: True}
2. Call Vectorize
Let’s call vectorize function on top of that:
# Import the required libraries
from sparkkgml.vectorization import Vectorization
# Create an instance of Vectorization
vectorizationObject=Vectorization()
#here we are calling the vectorize function and digitazing all the columns
digitized_df=vectorizationObject.vectorize(df2,features)
digitized_df.show(5)
recipe |
calorie |
category |
creamy-orange-cake |
156.1 |
0.0 |
summer-chicken-salads |
243.2 |
6.0 |
orange-raisin-cake |
168.2 |
0.0 |
alfredo-blue |
486.7 |
3.0 |
millie-pasquinell… |
621.9 |
9.0 |
As you can see, category feature was digitized.
For more details on using vectorize function and its capabilities, please refer to the documentation.