Motif Walks

This module provides functionalities for performing motif walks on a Knowledge Graph and extracting embeddings using the Word2Vec model. The MotifWalks class enables users to generate walks from specified entities, process these walks through motif structures, and train a Word2Vec model to obtain vector embeddings for entities in the graph.

Example Usage

Step 1: Creating the GraphFrame

The first step in using the MotifWalks class is to create a GraphFrame from your RDF data using the KG class. This step is essential as it converts the RDF data into a graph structure that can be used for motif walks and embedding extraction.

# Import the required libraries
from kg import KG
from motif_walks import MotifWalks
from pyspark.sql import SparkSession

# Initialize a Spark session (optional)
spark = SparkSession.builder.appName("MotifWalks-Example").getOrCreate()

# Specify the location of the RDF file
rdf_location = "path_to_your_rdf_file.rdf"

# Create an instance of the KG class
kg = KG(location=rdf_location, fmt='turtle', sparkSession=spark)

# Create the GraphFrame
graph_frame = kg.createKG()

Step 2: Conducting Motif Walks with motif_walk

Once the graph is ready, you can perform motif walks on it using the motif_walk method. Motif walks generate paths on the graph starting from specified entities. These paths can then be used for various downstream tasks, such as training a Word2Vec model to extract embeddings.

The motif_walk function allows you to specify the depth of the walk, which controls how many steps (or hops) the walk will take from each starting entity. It also supports three different walk types: BFS, entity, and predicate walks.

# Specify the starting entities for the motif walks
entities = ["entity1", "entity2", "entity3"]

# Create an instance of the MotifWalks class
motif_walks = MotifWalks(kg_instance=kg, entities=entities, sparkSession=spark)

# Perform motif walks with a specified depth
paths_df = motif_walks.motif_walk(graph_frame, depth=3, walktype='BFS')

# Display the resulting paths
paths_df.show()

Explanation:

The entities list specifies the starting points for the motif walks.

The depth parameter determines the maximum number of steps in each walk.

The walktype parameter supports three different walk types: BFS, entity, and predicate walks.

The motif_walk method processes the graph and returns a DataFrame (paths_df) containing the generated paths.

Step 3: Extracting Embeddings with word2Vec_embeddings

After generating paths using motif walks, you can train a Word2Vec model on these paths to extract embeddings for the entities. The word2Vec_embeddings method allows you to customize various parameters to fine-tune the model’s training process and outputs a DataFrame with the vector representations of the entities.

# Train Word2Vec model and extract embeddings
embeddings_df = motif_walks.word2Vec_embeddings(
    df=paths_df,
    vector_size=100,
    min_count=5,
    num_partitions=1,
    step_size=0.025,
    max_iter=10,
    seed=42,
    input_col="paths",
    output_col="vectors",
    window_size=5,
    max_sentence_length=1000
)

# Display the embeddings
embeddings_df.show()

Explanation:

The word2Vec_embeddings method trains a Word2Vec model using the paths generated by the motif walks.

The method outputs a DataFrame (embeddings_df) containing the vector embeddings for each entity in the graph.

Parameters:

df (DataFrame): The DataFrame containing the paths to train the Word2Vec model.

vector_size (int): The size of the vectors for each entity. Larger sizes capture more information but require more computational resources.

min_count (int): The minimum number of occurrences for a word (entity) to be included in the model’s vocabulary.

num_partitions (int): The number of partitions to use for training, which can impact the model’s performance on distributed systems.

step_size (float): The learning rate for training the Word2Vec model.

max_iter (int): The maximum number of iterations to run the training. More iterations can improve the model but increase training time.

seed (int): A random seed for reproducibility of results.

input_col (str): The name of the input column in the DataFrame that contains the paths.

output_col (str): The name of the output column where the vectors will be stored.

window_size (int): The window size for the skip-gram model. It determines how many words to the left and right of the target word are considered during training.

max_sentence_length (int): The maximum length of a sentence (or path) to be used for training.

By adjusting these parameters, you can control the quality and characteristics of the embeddings produced by the Word2Vec model. These embeddings are useful for various machine learning tasks such as clustering, classification, or further analysis in graph-based applications.

Conclusion

The MotifWalks class in this module provides a powerful and flexible way to generate motif walks on a Knowledge Graph and extract embeddings using Word2Vec. By allowing users to control the depth of the walks and offering additional features for data transformation, it supports a wide range of graph-based learning tasks.

For further customization and advanced usage, please refer to the API documentation.