Harnessing Multi-Model Capabilities with Spotify – Processing Semi-Structured Data with SAP HANA Cloud/SAP Datasphere – Part 4

This blog is part of a series explaining the multi-model capabilities of SAP HANA Cloud /SAP Datasphere with one end-to-end scenario using Spotify data.

Here are the links for the other blogs of this series

Part 1 – Architecture
Part 2 – Processing Semi-Structured data in SAP HANA Cloud
Part 3 – Processing Semi-structured data in SAP Datasphere

Part 2 discussed the options of extracting, transforming, and loading JSON documents into SAP HANA Cloud.

Now that the JSON documents are loaded and stored in SAP HANA Cloud, we can further utilize the Multi-Modeling features of SAP HANA Cloud. In this part, we are going to transform the semi-structured data into structured data to be utilized by the Graph Engine to show the relationships within the Spotify data. We will create reporting scenarios using calculation views and procedures created within SAP Business Application Studio.

This blog will cover the following:

Describe the datasets
Modify the JSON collections
Create Spotify relationships
Query JSON collections to form structured SQL Views
Form the graph tables and graph workspace
Query on the graph workspace with different scenarios

What is SAP HANA Graph?

Refer to blog Create graphs on SAP HANA Cloud by Maxime Simon that gives a great introduction to the graph capabilities of SAP HANA Cloud if you are unfamiliar.

Describe the datasets

Using the Spotify API, Get Category’s Playlists, we pulled playlist data for categories: Mood, Workout, Party and Gaming.

Get%20Categorys%20Playlists

Get Category’s Playlists API

The JSON response from the API contains the playlist id, the name and the description along with the number of tracks in each playlist.

Get%20Categorys%20Playlist%20JSON%20Response

Get Category’s Playlist JSON Response

In the section, “Executing Python Scripts” in Part 2, we were able to capture all the tracks per playlist. In this scenario, we utilized the same scripts to pull all the tracks per playlist and the track audio features by using the Spotify APIs.

A new JSON collection was created for each playlist, one for all the track information and one containing the audio features for each track.

Track%20Information

Playlist Track Information

Playlist%20Track%20Audio%20Features

Playlist Track Audio Features

Modify the JSON collections

In this scenario, it is necessary to update the JSON collections with identifiers which will help when creating the views and tables needed to build the graph workspace.

For the playlist category JSON collections, we will create a new identifier as category and type, nested in playlists.

Update%20the%20playlist%20JSON%20collections

Update the playlist category JSON collections

Updated%20playlist%20category%20JSON

Updated playlist category JSON collections

For the track JSON collections, we will create new identifiers to help in the process of creating the graph tables. We will add three different types as dummy variables: album, artist and track. We will also add the playlist id for each track.

Update%20the%20playlist%20track%20information%20JSON%20collections

Update the playlist track information JSON collections

Updated%20playlist%20track%20information%20JSON%20collections

Updated playlist track information JSON collections

Create Spotify relationships

Based on the datasets, we need to determine the relationships between the core entities. From the track JSON collections, we can learn the playlist the track belongs too, details about the track, the artist of the track, the album the track is part of and the creators of the album.

The below diagram represents the relationships we will utilize for graph capabilities. Relationships in graph can be bidirectional but, in this case, we are representing in one direction.

Spoitfy%20relationship%20diagram

Spotify relationship diagram

Query JSON collections to form structured SQL views

We will create views for each core entity: track, artist, album, and playlist. We will have to create individual views for each playlist, then will union all together to form a final view for each entity.

Track Views

For the tracks, we are pulling all the attributes for a track as well as the relationships with playlist, artist and album. A track can have more than one artist, therefore we unnested the artists so there may be more than one row per track to account for multiple artists.

select  "playlist_id", "type1", "type3", "track"."id" as track_id,"track"."name" as track_name, "track"."popularity" as popularity,
 "track"."duration_ms" as duration_ms, "track"."album"."id", a."id", a."name"
from "DEMOJSON"."json.synonyms::80s Party" UNNEST "track"."artists" as a

Select for track information

Track view

After the individual views were created for each playlist, we combined all the views together using a union.

Consolidated track view

Track Audio Views

For the tracks, we are also pulling audio features for the track from each playlist JSON collection.

select "id" as id, "acousticness", "danceability", "genres", "energy","key","loudness","mode","speechiness", "instrumentalness", "liveness", "valence", "tempo"
from "DEMOJSON"."json.synonyms::All The Feels_AUD"

Select%20for%20Track%20Audio

Select for track audio features

Track%20Audio%20View

Track audio view

After the individual views were created for each playlist, we combined all the views together using a union.

Consolidated%20Track%20Audio%20View

Consolidated track audio view

Track and Audio Feature View

Consolidating further, all the tracks and their audio features were combined into one view.

with x as (select  "playlist_id", "type1", "type3",  track_id, track_name,   popularity,  duration_ms, "track.album.id", "id", "name"
from "DEMOJSON"."json.Views.tracks::tracks_all"), y as (select "ID" , "acousticness", "danceability", "genres", "energy","key","loudness","mode","speechiness", 
"instrumentalness", "liveness", "valence", "tempo" from "DEMOJSON"."json.Views.audio::tracks_audio_all")  
select "playlist_id", "type1", "type3",  track_id, track_name,   popularity,  duration_ms, "track.album.id", "id", "name", "acousticness", 
"danceability", "genres", "energy","key","loudness","mode","speechiness", "instrumentalness", "liveness", "valence", "tempo" 
from x left outer join y on x.track_id = y.ID

Track%20and%20Audio%20Features

Select for track and audio features

Track%20and%20Audio%20Features%20View

Track and audio features view

Playlist Views

For the playlists, we are selecting all of the attributes from each of the four JSON collections (gaming, mood, party, workout).

select "playlists"."category" as category,"playlists"."type" as type, p."id" as playlist_id, p."name" as playlist_name, p."description" as description, p."tracks"."total" as total_tracks 
from "DEMOJSON"."json.synonyms::PLAYLIST_CATEGORY_GAMING" unnest "playlists"."items" as p

Select%20for%20playlist%20information

Select for playlist information

Playlist%20View

Playlist view

After the individual views were created, we combined all the views together using a union.

Consolidated%20Playlist%20View

Consolidated playlist view

Album Views

For the albums, we are pulling all the attributes for each album from each playlist.

select "track"."album"."id" as album_id, "type2", "track"."album"."name" as album_name, "track"."album"."release_date" as release_date,
"track"."album"."total_tracks" as total_tracks,  a."id" as artist_id, a."name" as artist_name 
from "DEMOJSON"."json.synonyms::80s Party" unnest "track"."album"."artists" as a

Select%20for%20album%20information

Select for album information

Album%20View

Album view

A consolidated view was created to hold all the album information from each playlist.

Consolidated%20Album%20View

Consolidated album view

Form the graph tables and graph workspace

Create the nodes table

With the help of a union in a SAP HANA calculation view, we are able to combine all the entities together with each of their specific attributes.

Album

The album attributes can be found in the consolidated albums view. The columns selected from the albums view are album_id, album_name, type2 (label for Album), release_date and total tracks.

Album%20Attributes

Album attributes

Playlist

The playlist attributes can be found in the consolidated playlists view. The columns selected from the playlist view are playlist_id, playlist_name, category, type (label for playlist), and total_tracks.

Playlist%20Attributes

Playlist attributes

Track

The track attributes can be found in the consolidated track and audio view. The columns selected from the track and audio view are track_id, track_name, type1 (label for track), popularity, duration_ms, acousticness, danceability, and genres.

Track%20Attributes

Track attributes

Artist

The artist attributes can be found in the consolidated track and audio view. The columns selected from the track and audio view are id, name and type3 (label for artist).

Artist%20Attributes

Artist attributes

A union was created to map all the separate entities and their attributes into one. Because not every entity has the same columns there will be some results with blank columns.

Union%20Node

Union node

For the case that a song may exist in more than one playlist, we need to remove duplicates for graph to work properly as the ID of a track in the nodes table needs to be unique. A calculated column IDforRank was created specifically to partition based on ID in the rank function.

Rank%20Logic

Rank logic

After the row number function generates a new column, we filter on the column to only get the top record.

Filter%20out%20duplicates

Filter out duplicates

The output of the nodes calculation view looks like this:

Data%20Preview%20of%20Nodes%20Calculation%20View

Data preview of nodes calculation view

Create the edges table

With a similar design to the nodes calculation view, the edges calculation view combines all of the entity relationships together.

Album and Artist Relationship

This relationship can be found in the consolidated albums view. The columns selected from the albums view are album_id and artist_id. A calculated column was created to define the relationship type, which in the case of album and artist is “iscreatedby”.

Album%20and%20artist%20IDs

Album and artist IDs

Album%20and%20artist%20relationship

Album and artist relationship

Playlist and Track Relationship

This relationship can be found in the consolidated tracks and audio view. The columns selected from the view are playlist_id and track_id. A calculated column was created to define the relationship type, which in the case of playlist and track is “contains”.

Playlist%20and%20track%20IDs

Playlist and track IDs

Playlist%20and%20Track%20Relationship

Playlist and track relationship

Album and Track Relationship

This relationship can be found in the consolidated tracks and audio view. The columns selected from the view are track_album_id and track_id. A calculated column was created to define the relationship type, which in the case of album and track is “contains”.

Album and track IDs

Album and track relationship

Track and Artist Relationship

This relationship can be found in the consolidated tracks and audio view. The columns selected from the view are track_id and artist_id. A calculated column was created to define the relationship type, which in the case of track and artist is “iscreatedby”.

Track%20and%20Artist%20IDs

Track and artist IDs

Track%20and%20Artist%20Relationship

Track and artist relationship

A union was created to map all the separate entity relationships into one.

Union%20Node

Union node

A rank function of row number was used to create an ID which is needed for the edges table.

Rank%20Function

Rank function

The output of the edges calculation view look like this:

Data%20preview%20of%20edges%20calculation%20view

Data preview of edges calculation view

Create the graph workspace

Tables, views and synonyms can be used as sources in the graph workspace. In this scenario, we created the sources to the graph workspace with calculation views. They can be used however because of the data volume, we decided to load the results of the calculation views into edges and nodes tables.

Using hdbtable database artifacts, we created the edges and nodes tables with the appropriate columns.

Edges%20table

Edges table

Nodes%20table

Nodes table

A simple procedure was created to load the data from the calculation view into the respective tables.

Load%20graph%20tables%20procedure

Load graph tables procedure

The graph workspace was created with the edges and nodes tables with both having ID as their unique identifier.

Graph%20workspace

Graph workspace

Vertices, edges, and workspaces

Query the graph workspace with different scenarios

OpenCypher is a declarative graph query language for pattern matching. In SAP HANA, we can use openCpher directly in SQL querying on the graph workspace we created.

If we want to understand how many tracks for a specific genre exist in each playlist and category, we can use the OPENCYPHER_TABLE SQL function that enables the embedding of an openCypher query in an SQL query.

SELECT DISTINCT *
    FROM OPENCYPHER_TABLE (
         GRAPH WORKSPACE "DEMOJSON"."json.graphs::SPOTIFY"
         QUERY '
          MATCH   (a)-[e]-(b)
            WHERE   a.GENRE contains ''country''
            and b.TYPE = ''Playlist''
             RETURN  b.PLAYLISTCATEGORY as playlist_category, 
            b.NAME as playlist_name, count(a.ID) as number_of_tracks'
    )

Result%20of%20openCypher%20query

Result of openCypher query

We can visually represent the results of the openCypher MATCH operation in Database Explorer. Playlists are represented in pink and tracks represented in blue.

Cypher%20in%20Database%20Explorer

Cypher in Database Explorer

Business users might not understand the openCypher language. We can create table functions which can be used within SAP HANA calculation views which can be exposed to the business users.

The table function would have the input parameter “genre” and execute the openCypher SQL using OPENCYPER_TABLE.

Table%20function%20with%20openCypher

Table function with openCypher

This table function will be an input into the calculation view that will have the same input parameter which is mapped to the table function.

Table%20function%20as%20input%20into%20calculation%20view

Table function as input for calculation view

Input%20parameter%20mapped%20to%20table%20function

Input parameter mapped to table function

The business user can run a select statement on the calculation view with the genre of their choice.

SELECT 
	"playlist_category",
	"playlist_name",
    "number_of_tracks"
FROM "DEMOJSON"."json.cvs::genres_tbf"
	(placeholder."$$IP_Genre$$"=>'country')
order by "number_of_tracks" desc

Select%20on%20calculation%20view

Select on calculation view

GraphScript is an imperative programming language with built in functions that can be integrated into SQL-based data processing. Graphscript can be used within stored procedures, anonymous blocks, and table functions.

We can use the built-in GraphScript function in an SAP HANA stored procedure to identify all of a artists tracks and albums based on the graph workspace.

Stored%20procedure%20with%20neighbors%20algorithm

Stored procedure with neighbors algorithm

Based on the relationships we created for Spotify, track and album are one node away from artist or depth of one. Playlist would be two nodes away or depth of two. The neighbors algorithm takes the graph workspace and a starting vertex, in this example an artist id, the minimum depth as 1 and the maximum depth as 1 and the direction as “any”. The Spotify relationships are only have one direction so “ANY” and “INCOMING” will produce the same results and “OUTGOING” will produce zero results.

CALL "DEMOJSON"."json.procedures::NEIGHBORS_P"(
i_startVertex => '58lV9VcRSjABbAbfWS6skp', 
i_minDepth => 1,
i_maxDepth => 1,
i_dir => 'ANY',
o_vertices => ?,
o_verticesCount => ?,
o_edges => ?);

Results%20of%20procedure%20call

Results of procedure call

Using the graph workspace in Database Explorer, we can use the neighborhood algorithm to visually represent the relationships. Albums are represented in green and tracks in blue.

Graph%20workspace%20neighborhood%20algorithm

Graph workspace neighborhood algorithm

If we wanted to understand what playlists Bon Jovi has tracks in, we can change the maximum depth to 2.

Result%20of%20procedure%20call%20with%20max%20depth%20of%202

Result of procedure call with max depth of 2

Graph%20workspace%20neighborhood%20algorithm%20with%20max%20depth%20of%202

Graph workspace neighborhood algorithm with max depth of 2

Conclusion

We have taken semi-structured data in the form of JSON documents and transformed it into relational data in SQL views and SAP HANA calculation views. Based on the structured data, we were able to create a graph network of relationships within the Spotify data based on artists, tracks, albums, and playlists. Once the graph workspace was created, we utilized the languages of openCypher and GraphScript to query the data for analytical processing.

What’s Next

Part 3 of the series discussed data ingestion option within SAP Datasphere for semi-structured data. If there is a scenario where SAP HANA Cloud is not part of the landscape but the need for graph analysis still exists, we can do the same in SAP Datasphere. Stay tuned for the next blog that covers this process.

Tags: SAP Business Application Studio SAP Business Technology Platform SAP HANA Cloud SAP HANA multi-model processing

Sara Sampaio

Author Since: March 10, 2022

0 0 votes

Article Rating

0 Comments

Inline Feedbacks

View all comments

Sara Sampaio

Company

Explore More