Large-scale distributed data has become the foundation for analytics and informed decision-making processes in most businesses. A large amount of this data is also utilized for predictive modeling and building machine learning models.
There has been a rise in the number and variety of ML platforms providing machine learning and modeling capabilities, along with data storage and processing. Businesses that use these platforms can now seamlessly utilize them for efficient training and deployment of machine learning models.
Data Scientists training ML models using Databricks, have a challenge of accessing and working with SAP data. A data scientist has to rely on a data engineer to build a pipeline to extract data from SAP source systems and prepare the data for use in ML experimentation. Extraction and migration of data from the source systems is both expensive and time-consuming. Moreover, the data scientist may need additional non-SAP data modeled together with SAP data for use in ML experimentation.
Proposed Solution:
FedML Databricks is a library built to address these issues. The library applies the data federation architecture of SAP Datasphere and provides functions that enable businesses and data scientists to build, train and deploy machine learning models on ML platforms, thereby eliminating the need for replicating or migrating data out from its original source.
By abstracting the data connection, data load, model deployment and model inference on these ML platforms, the FedML Databricks library provides end-to-end integration with just a few lines of code.
In this blog, we use the FedML Databricks library to train a ML model with the data from SAP Datasphere and deploy the model to Databricks and SAP BTP, Kyma runtime. We also inference the deployed model and store the inference data back to SAP Datasphere for further analysis.
The data can be federated to SAP Datasphere from numerous data sources including SAP and non-SAP data sources. The data from various data sources can also be merged to create a view, which can be used for the FedML experiment. Please ensure that the view used for the FedML experiment has consumption turned on.
Train and deploy the model using the FedML Databricks library:
Pre-requisites:
- Create a Databricks workspace in any of the three supported hyperscalers (AWS, Azure, GCP).
- Create a cluster in the Databricks Workspace by referring to the guide.
- Create a notebook in the Databricks Workspace by referring to the guide.
- Whitelist the Databricks cluster IP in SAP Datasphere by referring to this guide.
Using the FedML Databricks Library:
1. Install the FedML Databricks library.
%pip install fedml-databricks --no-cache-dir --upgrade --force-reinstall
Import the necessary libraries:
from fedml_databricks import DbConnection,predict
It may also be useful to import the following libraries if you are using them in your notebook
import numpy as np
import pandas as pd
import json
2. Create a secure connection to SAP Datasphere and retrieve the data.
Create a Databricks secret scope by referring to the article Create a Databricks-backed secret scope on Databricks website. Then, create the Databricks secret containing SAP Datasphere connection details in the form of json, as described in the article. The SAP Datasphere json connection credentials can be obtained using the method described in this Github documentation – DbConnection class.
config_str=dbutils.secrets.get('<secret-scope>','<secret-key>')
config=json.loads(config_str)
Now, create a DbConnection instance to connect to SAP Datasphere:
dsp = DbConnection(dict_obj=config)
We can now retrieve the data. There are multiple ways of retrieving the data from SAP Datasphere. The following code gets the data from SAP Datasphere in the form of a Pandas DataFrame. The appropriate schema and view name must be entered below:
df=dsp.execute_query('SELECT * FROM "<schema>"."<view>"')
df
3. Train the ML model using MLflow.
You can train a ML model using the Mlflow library managed by Databricks. Follow this MLflow guide to get started.
Import the MLflow library
import mlflow
Here is a sample linear regression model being trained using MLflow:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
def train_model(x_train,x_test, y_train, y_test,experiment_name,model_name):
mlflow.set_experiment(experiment_name)
with mlflow.start_run() as run:
model = LinearRegression().fit(x_train, y_train)
score = model.score(x_test, y_test)
mlflow.log_param("score",score)
mlflow.sklearn.log_model(model,model_name,
registered_model_name = model_name)
run_id = run.info.run_id
return run_id
x_train, x_test, y_train, y_test = train_test_split(dataframe , y, test_size=0.3)
experiment_name,model_name='/Users/<user>/<experiment-name>','<model_name>'
run_id=train_model(x_train,x_test, y_train, y_test,experiment_name,model_name)
model_uri=f"runs:/{run_id}/{model_name}"
4. Deploy the ML model as a webservice endpoint and inference the deployed model.
Option 1: Deploy the trained MLflow model to Databricks:
You can log, register and deploy MLFlow models using the Databricks managed Mlflow library. More information on Databricks Machine Learning capabilities can be found in this guide.
Executing the notebook inside Databricks workspace will register the model in the managed MLflow, if you trained the model outside of Databricks you can register the model in the MLflow model registry:
import time
model_version = mlflow.register_model(model_uri=model_uri,name=model_name)
# Registering the model takes a few seconds, so add a small delay
time.sleep(15)
Transition the model to Production:
You can do that in the Managed MLflow on Databricks, or inside the notebook.
from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
name=model_name,
version=model_version.version,
stage="Production",
)
You can use MLflow to deploy models for batch or streaming inference or to set up a REST endpoint to serve the model. Batch inference the MLflow model deployed in Databricks:
model = mlflow.pyfunc.load_model(f"models:/{model_name}/production")
infererence_result=model.predict(<test_data>)
Option 2: Deploy the MLflow model to SAP BTP Kyma Kubernetes:
The MLflow model trained in Databricks can be deployed to SAP BTP, Kubernetes environment using the hyperscaler container registry. Currently, deployment of MLflow model to SAP BTP, Kubernetes environment is supported in AWS and Azure, with support for GCP in the pipeline.
You can deploy the MLflow model using the same hyperscaler infrastructure used by Databricks. For example, if you use Azure Databricks, you can use Azure to deploy the MLflow model trained in Azure Databricks to SAP BTP, Kyma runtime.
4.2.1. Complete the pre-requisite steps for SAP BTP, Kyma runtime by referring to the guide.
4.2.2. Take note of the ‘DATABRICKS_URL’ and ‘MODEL_URI’ by running the below cell in the databricks notebook:
print("The DATABRICKS_URL is 'https://{}'".format(spark.conf.get("spark.databricks.workspaceUrl")))
print("The MODEL_URI is '{}'".format(model_uri))
For ease of use, you can perform steps 4.2.3 & 4.2.4 in the hyperscaler jupyter notebook (AzureML notebook or Sagemaker notebook):
4.2.3. Create a configuration file with the necessary details for SAP BTP, Kyma runtime deployment for AWS or Azure using the AWS template or Azure template. The values for the configuration file can be obtained by completing the above two steps.
4.2.4. Deploy the Databricks MLflow model to SAP BTP, Kubernetes environment using the below method. The ‘databricks_config_path’ refers to the path of the configuration file created in the previous step:
from fedml_databricks import deploy_to_kyma
endpoint_url=deploy_to_kyma(databricks_config_path='<databricks-config-json-file-path>')
print("The kyma endpoint url is '{}'".format(endpoint_url))
Take note of the SAP BTP, Kubernetes environment endpoint.
Inference the MLflow model deployed in SAP BTP, Kubernetes environment within the Databricks notebook as follows:
inference_dataframe=predict(endpoint_url=<kyma-endpoint>,content_type=<content-type>,data=<test-data>)
5. FedML Databricks library allows for bi-directional data access. You can store the inference result in SAP Datasphere for further use and analysis.
5.1 Create a table in SAP Datasphere:
dsp.create_table("CREATE TABLE <table_name> (ID INTEGER PRIMARY KEY, <column_name> <data_type>,..)")
5.2 You can now restructure the data to write back to SAP Datasphere in your desired format and insert the data in the table:
dsp.insert_into_table('<table_name>',<pandas_dataframe_containing_datasphere_data>)
Now, that the data is inserted into the local table in SAP Datasphere, you can create a view and deploy it in SAP Datasphere. You can then use the view to perform further analysis using SAP Analytics Cloud.
More information on the use of the library and end-to-end sample notebooks can be found in our Github repo here.
In summary, the FedML Databricks library provides an effective and convenient way to federate the data from multiple SAP and non-SAP source systems, without the overhead of any data migration or replication. It enables the Data scientists to effectively model SAP and non-SAP data in real-time, for use in ML experimentation. It also provides the capabilities to deploy models to SAP BTP, Kyma runtime, perform inferencing on the deployed webservice and store the inference data back to SAP Datasphere for further use and analysis.
Please read our blog here to learn how external data from Databricks delta tables can be federated live and combined with data from SAP Applications via SAP Datasphere unified models, for doing real-time Analytics using SAP Analytics Cloud.
If you have any questions, please leave a comment below or contact us at paa@sap.com.