sparkkgml.data_augmentation

Functions

`clean_column_names`(df)	Clean column names of a Pandas DataFrame by removing invalid characters.
`spark_dbpedia_lookup_linker`(sparkDataFrame, column[, ...])	Perform DBpedia entity linking on a Spark DataFrame using the DBpedia Spotlight service.
`spark_specific_relation_generator`(sparkDataFrame, columns)	Generate attributes from a specific direct relation on a Spark DataFrame.

Module Contents

sparkkgml.data_augmentation.clean_column_names(df)[source]

Clean column names of a Pandas DataFrame by removing invalid characters.

Parameters:: df (pandas.DataFrame) – The input Pandas DataFrame.
Returns:: The DataFrame with cleaned column names.
Return type:: pandas.DataFrame

sparkkgml.data_augmentation.spark_dbpedia_lookup_linker(sparkDataFrame, column, new_attribute_name='new_link', progress=True, base_url='https://lookup.dbpedia.org/api/search/', max_hits=1, query_class='', lookup_api='KeywordSearch', caching=True)[source]

Perform DBpedia entity linking on a Spark DataFrame using the DBpedia Spotlight service.

Parameters:

sparkDataFrame (pyspark.sql.DataFrame) – The input Spark DataFrame.
column (str) – Name of the column whose entities should be found.
new_attribute_name (str, optional) – Name of the column containing the link to the knowledge graph. Defaults to ‘new_link’.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress. Defaults to True.
base_url (str, optional) – The base URL of the DBpedia Lookup API. Defaults to ‘https://lookup.dbpedia.org/api/search/’.
max_hits (int, optional) – Maximal number of URIs that should be returned per entity. Defaults to 1.
query_class (str, optional) – Specifies whether the entities that occur first (‘first’), that have the highest support (‘support’), or that have the highest similarity score (‘similarityScore’) should be chosen. Defaults to ‘’.
lookup_api (str, optional) – The DBpedia Lookup API to use. Defaults to ‘KeywordSearch’.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns:

DataFrame with new column(s) containing the DBpedia URIs.

Return type:

pyspark.sql.DataFrame

Notes

This function performs DBpedia entity linking on a Spark DataFrame using the DBpedia Spotlight service. It follows these steps:

Convert the Spark DataFrame to a Pandas DataFrame.
Apply the dbpedia_lookup_linker function from the kgextension library to the Pandas DataFrame.
Convert the resulting Pandas DataFrame back to a Spark DataFrame.
Return the Spark DataFrame with the new column(s) containing the DBpedia URIs.

sparkkgml.data_augmentation.spark_specific_relation_generator(sparkDataFrame, columns, endpoint=DBpedia, uri_data_model=False, progress=True, direct_relation='http://purl.org/dc/terms/subject', hierarchy_relation=None, max_hierarchy_depth=1, prefix_lookup=False, caching=True)[source]

Generate attributes from a specific direct relation on a Spark DataFrame.

Parameters:

sparkDataFrame (pyspark.sql.DataFrame) – The input Spark DataFrame.
columns (str or list) – Name(s) of column(s) that contain(s) the link(s) to the knowledge graph.
endpoint (Endpoint, optional) – SPARQL Endpoint to be queried; ignored when ‘uri_data_model’ is True. Defaults to DBpedia.
uri_data_model (bool, optional) – If enabled, the URI is directly queried instead of a SPARQL endpoint. Defaults to False.
progress (bool, optional) – If True, progress bars will be shown to inform the user about the progress made by the process. Defaults to True.
direct_relation (str, optional) – Direct relation used to create features. Defaults to ‘http://purl.org/dc/terms/subject’.
hierarchy_relation (str, optional) – Hierarchy relation used to connect categories. Defaults to None.
max_hierarchy_depth (int, optional) – Maximal number of hierarchy steps taken. Defaults to 1.
prefix_lookup (bool/str/dict, optional) – True: Namespaces of prefixes will be looked up at prefix.cc and added to the SPARQL query. str: User provides the path to a JSON file with prefixes and namespaces. dict: User provides a dictionary with prefixes and namespaces. Defaults to False.
caching (bool, optional) – Turn result-caching for queries issued during the execution on or off. Defaults to True.

Returns:

DataFrame with additional features.

Return type:

pyspark.sql.DataFrame

Notes

This function generates attributes from a specific direct relation on a Spark DataFrame. It follows these steps:

Convert the Spark DataFrame to a Pandas DataFrame.
Apply the specific_relation_generator function from the kgextension library to the Pandas DataFrame.
Convert the resulting Pandas DataFrame back to a Spark DataFrame.
Return the Spark DataFrame with the additional features.