LinkedIn Profile: https://www.linkedin.com/in/danish-meraj-5b3138200
In my previous blog, I discussed an algorithm for creating a risk prediction tool by combining machine learning and an aggregation algorithm. In this blog, I will take a different approach to creating a predictive analytic application, leveraging the R interface in the SAP analytics cloud. You can follow along with me in this blog to achieve similar results.
Data: The data used for this demonstration is publicly available and easily accessible. You can download the data from here.
You can also read the blog here to get the basic information regarding multiple linear regression. Additionally, it will provide the background of the data used in this demonstration.
Data Model: Let’s start by creating a data model using the dataset. In this step, the data model is created in the SAC modeller. We will add a column, i.e., the date dimension, using the calculated column formula. It will allow us to create a planning model.
Note: It is not mandatory to have a planning model to create an analytic application. I chose a planning model, thinking if, in future, I decided to write a blog about making the analytic application demonstrated in this blog more interactive, this blog would still be relevant.
Figure 1: The figure shows the data in SAC modeller; Source: Author’s own illustration.
Configuring R widget: In this step, the data source is configured in the R widget as shown in the below figure:
Figure 2: The figure shows the input data in R visualization widget; Source: Author’s own illustration.
Scripting in R: After configuring the data source, we will leverage the scripting capability of R to train a machine-learning model.
First, we will load some of the required packages to our R environment and create a data frame of the source data. as shown in the below code snippet.
library(ggplot2)
library(dplyr)
df<- HeartData
head(df)
summary(df)
After that, we will check the correlation between different variables. The code snippet below shows the step to perform the correlation.
#Check correlation between two independent variables
cor(df$biking, df$smoking)
#Histogram for heart disease
hist(df$heart.disease)
For reproducing the results, I have set the seed to 1. It helps in reproducing similar results, as shown in this blog.
The next step is data partitioning. The data is divided into two parts. i.e., Training and Testing. In this example, 70% of the dataset is used for training and 30% of the data for testing purposes.
#make this example reproducible
set.seed(1)
sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3))
train <- df[sample, ]
test <- df[!sample, ]
#Training
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = train)
#Summary of Training
summary <-summary(heart.disease.lm)
summary
#Prediction using the test data
heart.disease.predictions <- predict(heart.disease.lm,test)
After training the model on the training dataset, the model is tested to make a prediction using the test dataset. The “cbind” function is used to creates a table of the testing results, which contains actual and predicted values from the testing phase. It gives an overview of the model’s performance on the training dataset. as shown in below code snippet.
#creating a result table using column bind function
results <- cbind(heart.disease.predictions,test$heart.disease) #Taking the predicted values and actual values of test data
colnames(results)<- c('predicted', 'actual') #Naming the columns of the result
results <-as.data.frame(results)
#Visualising the result of actual test value and predicted test values
head(results)
The next step is retrieving the parameters from the trained model and using these parameters in the analytic designer environment to leverage this mathematical formula for prediction.
To retrieve the values from the summary of training data, as shown in the figure 3. We need to create a matrix of coefficients from the summary of the training data. It will allow us to retrieve the required parameters from this matrix. After retrieving the parameters, we can save them in a variable that can be accessed from the analytic designer environment.
Figure 3: The figure shows the summary output in R console; Source: Author’s own illustration.
#Creating a matrix of coefficients from training data
matrix_coef <- summary$coefficients
matrix_coef
#Grabbing the values of coefficents from the matrix
Intercept<-matrix_coef[1,1]
Biking<-matrix_coef[2,1]
Smoking<-matrix_coef[3,1]
#Printing the coefficients
Intercept
Biking
Smoking
Figure 4 shows the parameter values in the R console which will be used in mathematical equation of multiple linear regression to calculate prediction values.
Figure 4: The figure shows the parameters values as output in the R console; Source: Author’s own illustration.
Creating front end for analytic application:
In this step, the front end for the application will be created using widgets such as text, text area, input field etc., as shown in the figure 5 to create an analytic application as shown in the figure 6.
Figure 5: The figure shows the widgets used to create application front end; Source: Author’s own illustration.
Figure 6: The figure shows the application front end and a histogram; Source: Author’s own illustration.
Button_1 OnClick script: It gets triggered when the user clicks the “Predict” button after entering the required input values, such as the Smoking% and Biking%.
//getting the parameter values from R Environment.
var Intercept= RVisualization_2.getEnvironmentValues().getNumber("Intercept");
var PBiking= RVisualization_2.getEnvironmentValues().getNumber("Biking");
var PSmoking= RVisualization_2.getEnvironmentValues().getNumber("Smoking");
console.log(Intercept);
console.log(PBiking);
console.log(PSmoking);
//converting the user input values to strings
var Biking =ConvertUtils.stringToNumber(InputField_1.getValue());
var Smoking =ConvertUtils.stringToNumber(InputField_2.getValue());
//Calculating the prediction using Multiple linear regression equation
var formula = Intercept + PBiking*Biking+ PSmoking*Smoking;
//Rounding off the output to nearest integer
var predictionFormula =Math.round(formula);
//Printing the out to text box
Text_1.applyText("Based on input values the prediction for %people having heart disease in the city is: "+ConvertUtils.numberToString(predictionFormula)+ "%");
You might be thinking, the histogram visualization (fig. 6) is not adding any value to the analytic application then why do we have this unnecessary visualization? Well, unfortunately! It is required to have output in the R visualization widget to pass the environment values to the analytic designer environment.
One solution is to make the visualization widget small (only small enough such that the visualization is still there) and hide it using the shape widget with white background color, as shown in the figure 7. 😀
Let me know if you have other ideas!! Seriously
Figure 7: The figure shows hiding the R visualization widget containing the histogram using the ‘Shape’ widget; Source: Author’s own illustration.
Testing the prediction tool:
As shown in the figure 8, Our analytic application is now ready. After the user enters the required input values, it will give a predicted value, as shown in the figure 9.
Figure 8: The figure shows the final front end of the analytic application; Source: Author’s own illustration.
Figure 9: The figure shows the predicted values based on the user input; Source: Author’s own illustration.
Complete R Script used in this demonstration:
library(ggplot2)
library(dplyr)
df<- HeartData
head(df)
summary(df)
#Check correlation between two independent variables
cor(df$biking, df$smoking)
#Histogram for heart disease
hist(df$heart.disease)
#make this example reproducible
set.seed(1)
sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3))
train <- df[sample, ]
test <- df[!sample, ]
#Training
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = train)
#Summary of Training
summary <-summary(heart.disease.lm)
summary
#Prediction uisng the test data
heart.disease.predictions <- predict(heart.disease.lm,test)
#creating a result table using column bind function
results <- cbind(heart.disease.predictions,test$heart.disease) #Taking the predicted values and actual values of test data
colnames(results)<- c('predicted', 'actual') #Naming the columns of the result
results <-as.data.frame(results)
#Visualising the result of actual test value and predicted test values
head(results)
#Transfer the values of coefficients to the Analytic designer environment
#Creating a matrix of coefficients from training data
matrix_coef <- summary$coefficients
matrix_coef
#Grabbing the values of coefficents from the matrix
Intercept<-matrix_coef[1,1]
Biking<-matrix_coef[2,1]
Smoking<-matrix_coef[3,1]
#Printing the coefficients
Intercept
Biking
Smoking
Conclusion: This blog demonstrates how we can leverage R visualization widgets to build a prediction tool in SAC. The prediction tool shown in this blog solves a regression problem using a multiple linear regression algorithm.
If you like this blog, please like this blog post and follow me for more similar content related to SAP Analytics Cloud. If you have any questions or feedback, please leave a comment below.
Further study on similar topics:
https://blogs.sap.com/2022/09/12/automated-machine-learning-automl-using-analytic-application/
https://blogs.sap.com/2020/06/08/r-visualizations-in-sap-analytics-cloud/
https://www.scribbr.com/statistics/simple-linear-regression/