sparkkgml.motif_walks

Classes

MotifWalks

MotifWalks class generates walks on a graph for given entities, performs motif walks, and extracts embeddings

Module Contents

class sparkkgml.motif_walks.MotifWalks(kg_instance, entities: List[str] = [], sparkSession: pyspark.sql.SparkSession = None)

MotifWalks class generates walks on a graph for given entities, performs motif walks, and extracts embeddings using Word2Vec model.

entities

List of starting entities for Motif Walks.

Type:

List[str]

kg_instance

Instance of the knowledge graph.

sparkSession

SparkSession for Spark operations.

Type:

SparkSession

hashed_entities

Hashed entities for efficient lookup.

Type:

List[str]

entities
kg_instance
hashed_entities
create_motif_string(depth)

Generates a motif string for a given depth.

Parameters:

depth (int) – Depth of the motif.

Returns:

Motif string for the given depth.

Return type:

str

struct_to_list(df)

Adjusts the DataFrame to handle vertices and extract the ‘relationship’ from edges.

Parameters:

df (DataFrame) – Input DataFrame.

Returns:

Transformed DataFrame with structured lists.

Return type:

DataFrame

struct_to_list2(df)

Adjusts the DataFrame to handle transform the struct type to a list of strings

Parameters:

df (DataFrame) – Input DataFrame.

Returns:

Transformed DataFrame with structured lists.

Return type:

DataFrame

motif_walk(graph, depth)

Conducts motif walks on the given graph for the specified depth. This function processes each depth level separately, allowing for more granular control over path filtering, especially based on vertex properties like outgoing edges.

Parameters:
  • graph (GraphFrame) – The graph on which to perform motif walks. The vertices should have a ‘has_outgoing_edge’ column to facilitate filtering.

  • depth (int) – The maximum depth (number of steps) of the motif walks.

Returns:

A DataFrame containing the paths resulting from the motif walks, with one row per path.

Return type:

DataFrame

Notes

  • This function creates and processes motifs for each depth level from 1 to the specified maximum depth, providing more refined filtering options.

  • It allows filtering of paths based on whether the last vertex in the path has outgoing edges, thereby potentially terminating paths early.

motif_walk_depth(graph, depth)

Conducts motif walks on the given graph for the specified depth. This function performs a motif walk across the entire specified depth in one go and returns the resulting paths.

Parameters:
  • graph (GraphFrame) – The graph on which to perform the motif walks.

  • depth (int) – The depth (number of steps) of the motif walks.

Returns:

A DataFrame containing the paths resulting from the motif walks, with one row per path.

Return type:

DataFrame

Notes

  • This function creates a single motif string for the entire depth and processes the graph accordingly. It does not account for intermediate filtering based on the properties of vertices encountered during the walk.

word2Vec_embeddings(df, vector_size=100, min_count=5, num_partitions=1, step_size=0.025, max_iter=1, seed=None, input_col='sentences', output_col='vectors', window_size=5, max_sentence_length=1000, **kwargs)

Trains a Word2Vec model on walks and returns the vectors of entities.

Parameters:
  • df (DataFrame) – DataFrame containing paths for training the Word2Vec model.

  • vector_size (int) – Size of the word vectors.

  • min_count (int) – Minimum number of occurrences for a word to be included in the vocabulary.

  • num_partitions (int) – Number of partitions for Word2Vec estimation.

  • step_size (float) – Step size (learning rate) for optimization.

  • max_iter (int) – Maximum number of iterations for optimization.

  • seed (int) – Random seed for initialization.

  • input_col (str) – Input column name.

  • output_col (str) – Output column name.

  • window_size (int) – Size of the window for skip-gram.

  • max_sentence_length (int) – Maximum length of a sentence.

  • **kwargs – Additional arguments for Word2Vec model.

Returns:

DataFrame with vectors of entities.

Return type:

DataFrame