sparkkgml.motif_walks

Classes

MotifWalks

MotifWalks class generates walks on a graph for given entities, performs motif walks, and extracts embeddings

Module Contents

class sparkkgml.motif_walks.MotifWalks(kg_instance, entities: List[str] = [], sparkSession: pyspark.sql.SparkSession = None)

MotifWalks class generates walks on a graph for given entities, performs motif walks, and extracts embeddings using Word2Vec model.

entities

List of starting entities for Motif Walks.

Type:: List[str]

kg_instance: Instance of the knowledge graph.

sparkSession

SparkSession for Spark operations.

Type:: SparkSession

hashed_entities

Hashed entities for efficient lookup.

Type:: List[str]

entities

kg_instance

hashed_entities

create_motif_string(depth)

Generates a motif string for a given depth.

Parameters:: depth (int) – Depth of the motif.
Returns:: Motif string for the given depth.
Return type:: str

struct_to_list(df, walktype)

Transforms the struct type in a DataFrame to a list of strings based on the specified walk type.

Parameters:

df (DataFrame) – Input DataFrame containing struct types.
walktype (str) – The type of walk to perform. Can be ‘BFS’, ‘entity’, or ‘predicate’.

Returns:

Transformed DataFrame with each row represented as a list of strings.

Return type:

DataFrame

motif_walk(graph, depth, walktype='BFS')

Conducts motif walks on the given graph for the specified depth. This function processes each depth level separately, allowing for more granular control over path filtering, especially based on vertex properties like outgoing edges.

Parameters:

graph (GraphFrame) – The graph on which to perform motif walks. The vertices should have a ‘has_outgoing_edge’ column to facilitate filtering.
depth (int) – The maximum depth (number of steps) of the motif walks.
walktype (str) – The type of walk to perform, such as ‘BFS’,’predicate’,’entity’. Default is ‘BFS’.

Returns:

A DataFrame containing the paths resulting from the motif walks, with one row per path.

Return type:

DataFrame

Notes

This function creates and processes motifs for each depth level from 1 to

the specified maximum depth, providing more refined filtering options. - It allows filtering of paths based on whether the last vertex in the path has outgoing edges, thereby potentially terminating paths early.

motif_walk_depth(graph, depth, walktype='BFS')

Conducts motif walks on the given graph for the specified depth. This function performs a motif walk across the entire specified depth in one go and returns the resulting paths.

Parameters:

graph (GraphFrame) – The graph on which to perform the motif walks.
depth (int) – The depth (number of steps) of the motif walks.

Returns:

A DataFrame containing the paths resulting from the motif walks, with one row per path.

Return type:

DataFrame

Notes

This function creates a single motif string for the entire depth and processes the graph accordingly. It does not account for intermediate filtering based on the properties of vertices encountered during the walk.

word2Vec_embeddings(df, vector_size=100, min_count=5, num_partitions=1, step_size=0.025, max_iter=1, seed=None, input_col='sentences', output_col='vectors', window_size=5, max_sentence_length=1000, **kwargs)

Trains a Word2Vec model on walks and returns the vectors of entities.

Parameters:

df (DataFrame) – DataFrame containing paths for training the Word2Vec model.
vector_size (int) – Size of the word vectors.
min_count (int) – Minimum number of occurrences for a word to be included in the vocabulary.
num_partitions (int) – Number of partitions for Word2Vec estimation.
step_size (float) – Step size (learning rate) for optimization.
max_iter (int) – Maximum number of iterations for optimization.
seed (int) – Random seed for initialization.
input_col (str) – Input column name.
output_col (str) – Output column name.
window_size (int) – Size of the window for skip-gram.
max_sentence_length (int) – Maximum length of a sentence.
**kwargs – Additional arguments for Word2Vec model.

Returns:

DataFrame with vectors of entities.

Return type:

DataFrame