If you are comfortable with Azure Machine Learning, you might enjoy the option to use your familiar Azure front ends to trigger data processing in your SAP Data Warehouse Cloud / SAP HANA Cloud systems.
Connect from Azure to the data in SAP Data Warehouse Cloud / SAP HANA Cloud and carry out data explorations, preparations, calculations, and Machine Learning. The data remains in place, thereby avoiding unnecessary data movement and data duplication. Save your outcomes as semantic views or physical tables. And if needed, you can still extract the data. However, you might not need to move all granular records, just the aggregated / prepared / filtered data that you require.
If you would like to stay instead within the SAP Business Technology Platform, you have the familiar choices to deploy your Python / R code with SAP Data Intelligence, CloudFoundry and Kyma.
Either way, SAP’s Python package hana_ml and R package hana.ml.r make it easy to trigger such an advanced analysis from any Python or R environment on the data held in SAP Data Warehouse Cloud.
Let’s go through and an example, which uses a notebook in Azure Machine Learning to explore the data in SAP Data Warehouse Cloud, aggregate it and create a time-series forecast on the aggregates. All triggered from Azure, executed in SAP Data Warehouse Cloud. All necessary code is in this blog.
Prerequisite
In this scenario we are assuming that you are working with SAP Data Warehouse Cloud. However, an implementation with SAP HANA Cloud would be very similar:
- You need access to a SAP Data Warehouse Cloud system with three virtual CPUs (free trial is not sufficient)
- The script server must be activated, see the documentation
- Create a Database User in your SAP Data Warehouse Cloud space, with the options “Enable Automated Predictive Library and Predictive Analysis Library”, “Enable Read Access” and “Enable Write Access” selected, see the documentation
- To trigger the Machine Learning in SAP Data Warehouse Cloud, you need that user’s password and the system’s host name, see this tutorial
Azure Machine Learning
How to use that SAP Data Warehouse Cloud system now from Microsoft Azure? Open the Azure portal (the free trial is sufficient for this scenario) and go into the “Azure Machine Learning” service. Create a workspace called “DWCworkspace” with a new resource group “DWCresourcegroup”. When the deployment is complete, click on “Go to resource” to open the “DWCworkspace”. From there, launch the “Azure Machine Learning Studio”.
Create first a Compute instance. Since the number crunching will be delegated to SAP Data Warehouse Cloud, select the smallest possible, “2 cores, 8 GB RAM” should do. Name it “DWCcompute”.
When the compute instance is running, click on it. Take note of the “Public IP address”. Typically this needs to be added to the “IP Allowlist” in SAP Data Warehouse Cloud. See the documentation.
Then create a Notebook “DWC time series” in the Microsoft Azure ML Studio. You see on top, that the notebook is running on our new “DWCcompute” instance.
Through the notebook install the hana_ml library from PyPI, as well as the shapely package which is needed for geo-spatial analysis.
!pip install hana-ml==2.13.22072200
!pip install shapely
import hana_ml
print(hana_ml.__version__)
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='[YOURDBUSERSHOSTNAME]',
port=443,
user='[YOURDBUSER]',
password='[YOURDBUSERSPASSWORD]')
conn.connection.isconnected()
import pandas as pd
df_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00616/Tetuan%20City%20power%20consumption.csv')
df_data.columns = map(str.upper, df_data.columns)
df_data.columns = df_data.columns.str.replace(' ', '_')
df_data.DATETIME = pd.to_datetime(df_data.DATETIME)
import hana_ml.dataframe as dataframe
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn,
pandas_df=df_data,
table_name='POWERCONSUMPTION',
force=True,
replace=False)
df_remote = conn.table('POWERCONSUMPTION')
df_remote.head(5).collect()
You can check on the column types.
df_remote.dtypes()
Have SAP Data Warehouse Cloud calculate some statistics about the data content.
df_remote.describe().collect()
The above distribution statistics were all calculated in SAP Data Warehouse Cloud. Only the results were transferred to your notebook. You can see the SELECT statement that was created by the describe()-method.
print(df_remote.describe().select_statement)
Explore the data with a UnifiedReport of the hana_ml library.
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(df_remote).build().display()
You can also request individual calculations on the data, or ready-made exploratory plots. Have SAP Data Warehouse Cloud calculate a correlation matrix on the granular data. As you would expect, there is for instance a fair positive correlation between temperature and power consumption.
import matplotlib.pyplot as plt
from hana_ml.visualizers.eda import EDAVisualizer
f = plt.figure()
ax1 = f.add_subplot(111) # 111 refers to 1x1 grid, 1st subplot
eda = EDAVisualizer(ax1)
ax, corr_data = eda.correlation_plot(data=df_remote, cmap='coolwarm')
df_data = df_remote.tail(n=6*24*7, ref_col='DATETIME').collect()
df_data.drop('DATETIME', axis=1).plot(subplots=True,
figsize=(10, 14));
df_remote = conn.table('POWERCONSUMPTION')
df_remote = df_remote.select('*', ('TO_DATE(DATETIME)', 'DATE'))
df_remote = df_remote.select('*', ('LEFT(DATE, 7)', 'MONTH'))
df_remote.head(5).collect()
import matplotlib.pyplot as plt
from hana_ml.visualizers.eda import EDAVisualizer
f = plt.figure()
ax1 = f.add_subplot(111) # 111 refers to 1x1 grid, 1st subplot
eda = EDAVisualizer(ax1)
ax, cont = eda.box_plot(data = df_remote,
column='ZONE_1_POWER_CONSUMPTION', groupby='MONTH', outliers=True)
ax.legend(bbox_to_anchor=(1, 1));
df_remote_daily = df_remote.agg([('sum', 'ZONE_1_POWER_CONSUMPTION', 'ZONE_1_POWER_CONSUMPTION_SUM')], group_by='DATE')
df_data = df_remote_daily.tail(n=21, ref_col='DATE').collect()
import seaborn as sns
sns.set_theme()
sns.set(rc={'figure.figsize':(15,5)})
sns.lineplot(data=df_data, x="DATE", y="ZONE_1_POWER_CONSUMPTION_SUM")
plt.xticks(rotation=45);
from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(growth='linear')
amf.fit(data=df_remote_daily.select('DATE', 'ZONE_1_POWER_CONSUMPTION_SUM'))
import pandas as pd
strLastDate = str( df_remote_daily.select('DATE').max())
df_topredict = pd.date_range(strLastDate, periods=1 + 21, freq="D", closed='right').to_frame()
df_topredict.columns = ['DATE']
df_topredict['DUMMY'] = 0
df_topredict
import hana_ml.dataframe as dataframe
df_rem_topredict = dataframe.create_dataframe_from_pandas(connection_context=conn,
pandas_df=df_topredict,
table_name='#TOPREDICTDAILY',
force=True,
replace=False)
df_rem_predicted = amf.predict(data=df_rem_topredict)
df_data = df_rem_predicted.collect()
from matplotlib import pyplot as plt
plt.plot(df_data['DATE'], df_data['YHAT'])
plt.fill_between(df_data['DATE'],df_data['YHAT_LOWER'], df_data['YHAT_UPPER'], alpha=.3);
df_rem_predicted = df_rem_predicted.rename_columns(['DATE', 'PREDICTION', 'PREDICTION_LOWER', 'PREDICTION_UPPER'])
df_rem_predicted.collect()
df_rem_predicted.save('POWERCONSUMPTION_21DAYPRED', force=True)
SAP Analytics Cloud can use a live connection to SAP Data Warehouse Cloud. Just create a view in SAP Data Warehouse Cloud on top of the table. Semantic usage for the view is “Analytical Dataset” and expose it for consumption. SAP Analytics Cloud can then access the data.
You have used the Azure Machine Learning Studio to explore data in SAP Data Warehouse Cloud, prepare the data, create a forecast and share the predictions to SAP Analytics Cloud. No data got duplicated along the way, thereby reducing architectural and governmental complexity. This logic can be further automated of course through Azure to regularly produce and share the latest predictions.
Rounding things off
Database user password
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='[YOURDBUSERSHOSTNAME]',
port=443,
user='[YOURDBUSER]')
conn.connection.isconnected()
Environment
name: hanaml2.13.22072200
channels:
- conda-forge
- defaults
dependencies:
- python=3.9
- ipykernel=6.2.0
- jupyterlab=3.2.5
- matplotlib=3.4.3
- seaborn=0.11.2
- shapely=1.8.0
- numba=0.54.1
- shap=0.39.0
- plotly=5.5.0
- openpyxl=3.0.9
- pip=21.3.1
- pip:
- hana-ml==2.13.22072200
Upload the file into the folder where you have the notebook. Then, click the little “Open terminal” icon above and create an environment with this command:
conda env create -f hanaml2.13.22072200azure.yml
When the environment has been build, add it as kernel to the system.
python -m ipykernel install --user --name=hanaml2.13.22072200
Now create a new notebook and on the top right you can select the new kernel. Without any further pip install, you have access to the hana_ml package.