I participated in the SAP HANA ML Challenge – Employee Churn 🤖 and I came in second place 🏆.

I would like to thank the entire SAP expert team for the initiative and organization! 👏

Presentation of the winners’ solutions on YouTube:

How to prevent Employee Churn using SAP HANA Cloud | SAP Community Call

My achievement post on LinkedIn.

Table of content

1. Getting started
2. Understanding the challenge
3. Exploratory data analysis and pre-processing
4. Baseline model with initial data
5. Understanding statistics
6. Training model with additional data
7. Reducing the number of features
8. Balancing training data with SMOTE
9. Balancing test data and increasing training data
10. Finding top SMOTE
11. Building top models
12. Analyzing top models
13. Visualizing statistics
14. Further steps

1. Getting started

The challenge was an opportunity for me to get hands-on experience with SAP Python Machine Learning Client for SAP HANA. The participants had continuous support from SAP team experts during open office hours, discussions on SAP Community Groups, and tutorials on GitHub. I gained my previous experience from Kaggle and SAP open courses. Participants without SAP HANA Cloud instance, myself included, received credentials to execute Jupyter Notebooks in the cloud. The first step was to set up the environment. I installed Anaconda, created the environment, and installed SciPy and hana-ml.

conda create --name ml_hana python=3.8
conda install -c conda-forge scikit-learn
conda install -c anaconda pandas
conda install -c anaconda jupyter
pip install hana-ml

After successfully testing the connection, I was ready to dig deeper into the problem we needed to solve.

2. Understanding the challenge

The task is to build a model that predicts if an employee will stay or quit. The collected data were provided in a CSV file. To look into the data we can use python libraries locally for light tasks and hana_ml for resource-consuming tasks. For data processing locally I used Pandas DataFrame and for the cloud the HANA ML DataFrame, libraries are aligned in terms of methods, but the syntax is specific for each. I received all the required support from the expert team and I moved on with my exploratory data analysis. I focused on building a model with the best results and alternatives to it and providing a possibility to apply A/B testing on deployed models. The results of the final model I presented with visualization libraries. The machine learning problem of the challenge is a classification problem with two classes. One class is Stay labeled with No and another one is Quit labeled with Yes. I call them class No and class Yes. The class we have to predict is Yes. Counting the rows we can find that class No : Class Yes = 17015:2100 = 8:1. Looking at proportions I can see that data are highly imbalanced and I have a majority class No and a minority class Yes. Minority class Yes is the one I have to predict.

The final code in Jupyter Notebook contains 270 cells. In the blog, I will only go through the main parts. The entire code is on GitHub, published with my last run. As with any code, it can be improved further if I had to maintain it in the long run.

3. Exploratory data analysis and pre-processing

Analyzing feature distributions.

# In [26]
# EDA features distributions
col_exc = ['ID']
for col in df_remote_add.columns:
    if col not in col_exc:
        print(df_remote_add.agg([('count', col, 'Count')], group_by=col).collect())

Data contains Null values and I filled them with ‘-999’.

# In [31]:
df_remote = df_remote.fillna(-999)
df_remote = df_remote.fillna("-999")

Feature engineering.

# In [34]
# TIMEINPREVPOSITIONMONTH - feature engineering group in intervals
# I will check by importance if this mattters
    
df_remote = df_remote.select 
    ('*',("""CASE WHEN TIMEINPREVPOSITIONMONTH <=12 THEN 'PREV_POS (,12]'
                  WHEN TIMEINPREVPOSITIONMONTH > 12 AND TIMEINPREVPOSITIONMONTH <=24 THEN 'PREV_POS (12,24]'
                  WHEN TIMEINPREVPOSITIONMONTH > 24 AND TIMEINPREVPOSITIONMONTH <=36 THEN 'PREV_POS (24,36]'
                  WHEN TIMEINPREVPOSITIONMONTH > 36 AND TIMEINPREVPOSITIONMONTH <=48 THEN 'PREV_POS (36,48]'
                  WHEN TIMEINPREVPOSITIONMONTH > 48 AND TIMEINPREVPOSITIONMONTH <=60 THEN 'PREV_POS (48,60]'
                  WHEN TIMEINPREVPOSITIONMONTH > 60 THEN 'TIMEPREV_POS (60,]'
              END""",'TIMEINPREVPOS_INT'))
df_remote = df_remote.to_tail('FLIGHT_RISK')
df_remote.head(3).collect()

4. Baseline model with initial data

I split data into train data and test data and fit train data to build the first baseline model.

# In [42]:
from hana_ml.algorithms.pal.partition import train_test_val_split as split
d_train, d_test, d_val = split( data=df_remote, partition_method='stratified', stratified_column=target, validation_percentage = 0)

To build the model I used HybridGradientBoostingTree wrapped in a function.

I didn’t have to encode features the model did it for me. The model did the magic, but I had to feed it with data in the most efficient way step by step.

# In [47]:
# Function to use repeatedly
def model_fit(d_train):
    HGBT_MODEL = UnifiedClassification('HybridGradientBoostingTree')

# In [56]:
HGBT_MODEL.statistics_.collect() # MODEL statistics

Out [56]:

1	RECALL	0.9959588537839824	No
2	PRECISION	0.9249402934152167	No
3	F1_SCORE	0.9591367415531578	No
4	SUPPORT	2722	No
5	RECALL	0.34523809523809523	Yes
6	PRECISION	0.9133858267716536	Yes
7	F1_SCORE	0.5010799136069115	Yes
8	SUPPORT	336	Yes
9	ACCURACY	0.9244604316546763	None

Looking into model statistics at first glance I can consider myself very lucky with ACCURACY = 0.92, but looking at SUPPORT No = 2722 and SUPPORT Yes = 336 I recognize that validation data used for models metrics calculations is imbalanced and this high ACCURACY is because of the majority class No with RECALL No = 0.99, while RECALL Yes = 0.34. Since I am interested in the class Yes I can see that the results are very poor. What is a RECALL? I spent some time breaking down the confusion matrix into basic elements. Because if I don’t feel confident with metrics I would not be able to improve the prediction model. Computers are fast and accurate at calculations, but the reasoning is a human task.

5. Understanding statistics

To calculate the metrics of the confusion matrix all we need is algebra, first, we count TP (true positive), FP (false positive), FN (false negative), and TN (true negative), then with these four elements we can calculate all the metrics. If classes are imbalanced I can rely only on RECALL because calculations are performed with the prediction from each class RECALL Positive = TP / TP + FP and RECALL Negative = TN / TN + FN. To calculate PRECISION I need true prediction (TP) from one class and false predictions (FP) from another class. To calculate ACCURACY I need all four elements TP, FP, FN, and TN, and with imbalanced classes, results are misleading because the extremely good results of the majority class and poor results of the minority class will make overall results look good. You can also read this blog Failure of Classification Accuracy for Imbalanced Class Distributions.

Let’s deep into confusion matrix principles, best explained with an example.

	Predicted No	Predicted Yes
Actual No	True No	False Yes
Actual Yes	False No	True Yes

Assume one built a model with extremely good metrics for majority class No.

Majority class No, Minority class Yes = 900, 100

Assume confusion matrix results.

	Predicted No	Predicted Yes
Actual No	900	9
Actual Yes	82	9

Model statistics.

RECALL No	0.99	extremely good
PRECISION No	0.92	very good
F1_SCORE No	0.95	very good
SUPPORT No	909
RECALL Yes	0.10	very poor
PRECISION Yes	0.50	poor
F1_SCORE Yes	0.17	very poor
SUPPORT Yes	91
ACCURACY	0.91	very good

If I wanted to rely on all metrics I would force balancing classes in the confusion matrix dividing Actual No by 9.

	Predicted No	Predicted Yes
Actual No	90	1
Actual Yes	82	9

Model statistics adjusted.

RECALL No	0.99	very good (same)
PRECISION No	0.52	poor
F1_SCORE No	0.68	good
SUPPORT No	91
RECALL Yes	0.10	very poor (same)
PRECISION Yes	0.90	very good
F1_SCORE Yes	0.18	very poor
SUPPORT Yes	91
ACCURACY	0.54	poor

Now I can see that with balanced metrics RECALL No and RECALL Yes remain the same, PRECISON No drops significantly, PRECISION Yes increases significantly, but F1_SCORE Yes drops to very poor, and ACCURACY drops from 0.91 (very good) to 0.54 (poor).

You can read more about balanced accuracy score here.

Model statistics calculates metrics with validations data split from train data. Validation data is unseen data however used in fitting for iterative optimization.

To see how the model performs we have to calculate metrics for absolutely unseen data the test data.

Prediction statistics wrapped in a local function.

# In [61]:
f_score_res(HGBT_MODEL,d_test)[1].collect()

Out [61]:

	STAT_NAME	STAT_VALUE	CLASS_NAME
1	RECALL	0.9964747356051704	No
2	PRECISION	0.9247546346782988	No
3	F1_SCORE	0.9592760180995475	No
4	SUPPORT	1702	No
5	RECALL	0.34285714285714286	Yes
6	PRECISION	0.9230769230769231	Yes
7	F1_SCORE	0.5000000000000001	Yes
8	SUPPORT	210	Yes
9	ACCURACY	0.9246861924686193	None

RECALL Yes = 0.34 for predicted data is very poor and ACCURACY = 0.92 doesn’t tell the true story.

In order to analyze the performance of subsequent approaches I will collect results in a local pandas DataFrame wrapped in a local function to add subsequent models results.

# In [64]:
def f_stat_all(p_model, p_d_test, p_col_name):
df_model_stat_last # model metrics
df_pred_stat_last  # prediction metrics

I will store the initial baseline results in column INIT.

# In [65]:
df_model_stat_last, df_pred_stat_last = f_stat_all(HGBT_MODEL, d_test, 'INIT')

I will take a look at the initial feature importance.

# In [68]:
# Feature importance
HGBT_MODEL.importance_.sort('IMPORTANCE', desc=True).collect()

Out [68]:

The top feature by importance is FUNCTIONALAREACHANGETYPE.

6. Training model with additional data

Meanwhile, additional data has been provided; the shape of the data is consistent.

# In [19]:
df_data_add.shape, df_remote.shape # check rows and columns

Out [19]:

((19115, 4), [19115, 43])

Joining of initial data with additional data.

# In [70]:
df_remote_new = df_remote.set_index(tab_id).join(df_remote_add.set_index(tab_id))

After fitting the model statistics are the following:

Out [75]:

1	RECALL	0.979059515062454	No
4	SUPPORT	2722	No
5	RECALL	0.6101190476190477	Yes
8	SUPPORT	336	Yes

I observe that the RECALL Yes of model statistics increased to a good one 0.61.

Analyzing feature importance I observe that the top feature HRTRAINING comes from additional data.

# Out [76]:
df_imp = HGBT_MODEL.importance_.sort('IMPORTANCE', desc=True).collect()

Prediction statistics.

Out [78]:

1	RECALL	0.9800235017626322	No
4	SUPPORT	1702	No
5	RECALL	0.5714285714285714	Yes
8	SUPPORT	210	Yes

RECALL Yes of prediction statistics is 0.57 is acceptable and less than model statistics 0.61.

I conclude that predictions improve.

7. Reducing the number of features

Looking further into the data I notice dependent features by description and content.

# In [87]:
col_age_dep = ['AGE', 'AGE_GROUP5', 'AGE_GROUP10', 'GENERATION'] # age group

# In [90]:
col_imp_prev_loc = ['PREVIOUS_COUNTRY', 'PREVCOUNTRYLAT', 'PREVCOUNTRYLON', 'PREVIOUS_REGION'] # previous country group

# In [92]:
col_imp_curr_loc = ['CURRENT_REGION', 'CURRENT_COUNTRY', 'CURCOUNTRYLAT', 'CURCOUNTRYLON'] # current country group

# In [94:]
col_imp_tenure = ['TENURE_MONTHS','TENURE_INTERVAL_YEARS','TENURE_INTERVALL_DESC'] # tenure group

From each group, I select the first top feature and fit again the model.

RECALL Yes from prediction statistics improves to 0.61 from 0.57 which means that model performs better with only top features from groups.

# In [104]:
f_score_res(HGBT_MODEL, d_test[cols_imp])[1].collect()

Out [104]:

1	RECALL	0.9782608695652174	No
4	SUPPORT	1702	No
5	RECALL	0.6190476190476191	Yes
8	SUPPORT	210	Yes

For countries, I prefer adjusted selection with top PREVIOUS_COUNTRY, and CURRENT_COUNTRY.

# In [110]:
# Reducing features with meaning preference
COL_IMP_PREVIOUS_COUNTRY_MEANING = ['PREVIOUS_COUNTRY','PREVCOUNTRYLAT', 'PREVCOUNTRYLON', 'PREVIOUS_REGION']
COL_IMP_CURRENT_REGION_MEANING = ['CURRENT_COUNTRY','CURRENT_REGION', 'CURCOUNTRYLAT', 'CURCOUNTRYLON']

Fitting with meaning preference.

# In [114]:
f_score_res(HGBT_MODEL, d_test[cols_imp_mn])[1].collect() # meaning preference

Prediction statistics.

Out [114]:

1	RECALL	0.981198589894242	No
4	SUPPORT	1702	No
5	RECALL	0.6238095238095238	Yes
8	SUPPORT	210	Yes

I observe that RECALL Yes is almost the same 0.62.

8. Balancing training data with SMOTE

A quick introduction to imbalanced data you could find in the blog Data Yoga-It is all about finding the right balance. For oversampling of minority class Yes I will use SMOTE (Synthetic minority over-sampling technique).

Checking class proportions from train data.

# In [124]:
f_class_count(d_train).collect()

Out [124]:

	FLIGHT_RISK	Count
0	No	13612
1	Yes	1680

I know that classes are highly imbalanced and I want to find a way to fix this with SMOTE.

I wrapped the calculation of max SMOTE into a local function.

# In [126]:
# return max smote amount
def f_smote_nmax(dfh):

The max SMOTE is 710.

Generating train data with max SMOTE to get balanced classes.

# In [129]:
d_train_sm = f_smote(d_train,f_smote_nmax(d_train)) # generate smote with max amount

Checking the proportions of the classes with max SMOTE.

# In [130]:
f_class_count(d_train_sm).collect() #check classes count for Majority No:Minority Yes = 1:1

Out [130]:

	FLIGHT_RISK	Count
0	No	13612
1	Yes	13608

# In [134]:
# Fit with smote balanced No:Yes with all features
HGBT_MODEL=model_fit(d_train_sm)

Prediction statistics.

Out [137]:

1	RECALL	0.945358401880141	No
3	F1_SCORE	0.9600238663484487	No
4	SUPPORT	1702	No
5	RECALL	0.8047619047619048	Yes
8	SUPPORT	210	Yes

RECALL Yes jumps from 0.62 to 0.80, which means SMOTE improves the model.

The fitting of the model with max SMOTE and important features.

# In [144]:
# Fit with smote balanced No:Yes with important features
HGBT_MODEL=model_fit(d_train_sm[cols_imp])

Prediction statistics

Out [146]:

1	RECALL	0.9195064629847238	No
2	PRECISION	0.9726538222498446	No
4	SUPPORT	1702	No
5	RECALL	0.7904761904761904	Yes
8	SUPPORT	210	Yes

RECALL Yes with important features and max SMOTE almost same 0.79

9. Balancing test data and increasing training data

I notice that SUPPORT Yes < SUPPORT No. To fix it I will keep for test rows SUPPORT No = SUPPORT Yes. The rest of the SUPPORT No rows I will add to train data because I want to use precious data.

Balanced test data.

# In [164]:
f_class_count(d_test_bl).collect() # check test classes are balanced

Out [164]:

	FLIGHT_RISK	Count
0	Yes	210
1	No	210

Increased rows of train data.

# In [166]:
d_train_target_count = f_class_count(d_train_bl) # added No from d_test that are not used (unseen)
d_train_target_count.collect()

Out [166]:

	FLIGHT_RISK	Count
0	No	15104
1	Yes	1680

Applying SMOTE to train data.

# In [168]:
f_class_count(d_train_bl_sm).collect() # #check classes count for Majority No:Minority Yes = 1:1

Out [168]:

	FLIGHT_RISK	Count
0	No	15104
1	Yes	15103

# In [171]:
# Fit with balanced d_train_bl_sm important features
HGBT_MODEL=model_fit(d_train_bl_sm[cols_imp])

Prediction statistics.

# In [173]:
f_score_res(HGBT_MODEL, d_test_bl[cols_imp])[1].collect()

Out [173]:

	STAT_NAME	STAT_VALUE	CLASS_NAME
1	RECALL	0.919047619047619	No
2	PRECISION	0.8427947598253275	No
3	F1_SCORE	0.8792710706150342	No
4	SUPPORT	210	No
5	RECALL	0.8285714285714286	Yes
6	PRECISION	0.9109947643979057	Yes
7	F1_SCORE	0.8678304239401496	Yes
8	SUPPORT	210	Yes
9	ACCURACY	0.8738095238095238	None

RECALL Yes has a small improvement to 0.82.

SUPPORT Yes = SUPPORT No that means I can rely on all statistics. I notice that ACCURACY is 0.87 and that is a very good one.

10. Finding top SMOTE

I want to execute SMOTE with the step of 100 and analyze ACCURACY from prediction statistics.

# In [188]:
# SMOTE with steps
# fire on all cylinders
dict_models_smote, df_model_statistics_sm, df_predicted_statistics_sm = 
f_smote_models(d_train_bl, d_test_bl, cols_imp, 100)
ls_smote: [0, 100, 200, 300, 400, 500, 600, 700, 799]

This is an intensive processing in the HANA ML cloud with Wall time: 8min 34s.

Model statistics are collected in local pandas DataFrame.

# In [190]:
df_model_statistics_sm

Prediction statistics are collected in the local pandas DataFrame.

# In [191]:
df_predicted_statistics_sm

Selecting SMOTE for best prediction ACCURACY with local function.

# In [194]:
# SUPPORT Yes = SUPPORT No - ACCURACY reliable
f_accuracy_descending(df_predicted_statistics_sm)

Out [194]:
STAT_NAME                   ACCURACY
STAT_VALUE_700    0.8738095238095238
STAT_VALUE_200    0.8666666666666667
STAT_VALUE_500    0.8666666666666667
STAT_VALUE_400    0.8642857142857143
STAT_VALUE_600    0.8642857142857143
STAT_VALUE_799    0.8642857142857143
STAT_VALUE_300    0.8547619047619047
STAT_VALUE_100    0.8523809523809524
STAT_VALUE_0 0.7928571428571428

11. Building top models

I fit models with smote_best_amount adding the features one by one in order of importance.

# In [207]:
# FEATURES IMPORTANCE TOP n
# fire on all cylinders
def f_smote_model_top_f(p_d_train, p_cols_imp, p_d_test, p_smote_best_amount, p_cols_imp_start, p_cols_imp_step, p_verbosity):
# ...
   return lc_dict_models_top_f, lc_df_model_statistics, lc_df_predicted_statistics

# In [212]:
%%time
# FEATURES IMPORTANCE TOP n
# fire on all cylinders
dict_models_top_f, df_model_statistics_top_f, df_predicted_statistics_top_f = 
f_smote_model_top_f(d_train_bl, sm_best_cols_imp, d_test_bl, smote_best_amount, 1, 1, True)

This is very intensive processing in the HANA ML cloud with Wall time: 21min 6s.

Relax and enjoy a cup of coffee or tea! 🙂

Selecting top features with top prediction ACCURACY with a local function.

# In [217]:
# SUPPORT Yes = SUPPORT No ACCURACY is reliable
f_accuracy_descending(df_predicted_statistics_top_f)[:10]

Out[217] STAT_NAME ACCURACY
STAT_VALUE_top15 0.8928571428571429
STAT_VALUE_top6 0.888095238095238
STAT_VALUE_top29 0.8857142857142857
STAT_VALUE_top7 0.8833333333333333
STAT_VALUE_top18 0.8833333333333333
STAT_VALUE_top8 0.8833333333333333
STAT_VALUE_top19 0.8833333333333333
STAT_VALUE_top25 0.8809523809523809
STAT_VALUE_top23 0.8761904761904762

Great surprise, with the best SMOTE and best top15 features prediction ACCURACY increases to 0.89!

I stored all models and other dependencies in the dictionary dict_models_top_f to access them easily.

# In [218]:
def f_sel_top_pred_acc(p_df,p_n):
    s =  f_accuracy_descending(p_df) # sorted series
    ls_s = s[:p_n+1].index.to_list() # sorted list with last n+1
    key = ls_s[-1] # last element
    model = dict_models_top_f[key][0]
    cols = dict_models_top_f[key][1]
    cols_len = dict_models_top_f[key][2]
    cols_name = dict_models_top_f[key][3]

    return [key, model, cols, cols_len, cols_name]

# In [219]:
# key, model, cols, cols_len, cols_name Top 1
top_1_imp_ls = f_sel_top_pred_acc(df_predicted_statistics_top_f,1)
top_1_imp_ls[0], top_1_imp_ls[1], top_1_imp_ls[2], top_1_imp_ls[3], top_1_imp_ls[4]

Out [219]:

(‘STAT_VALUE_top15’,
<hana_ml.algorithms.pal.unified_classification.UnifiedClassification at 0x25c02d19430>,
[‘EMPLOYEE_ID’,
‘HRTRAINING’,
‘FUNCTIONALAREACHANGETYPE’,
‘SICKDAYS’,
‘PROMOTION_WITHIN_LAST_3_YEARS’,
‘TIMEINPREVPOSITIONMONTH’,
‘EMPLOYMENT_TYPE_2’,
‘PREVIOUS_CAREER_PATH’,
‘RISK_OF_LOSS’,
‘CURRENT_COUNTRY’,
‘SALARY’,
‘JOBLEVELCHANGETYPE’,
‘AGE’,
‘CHANGE_IN_PERFORMANCE_RATING’,
‘PREVIOUS_COUNTRY’,
‘LINKEDIN’,
‘FLIGHT_RISK’],
17,
‘top15’)

Predictions statistics in Pandas DataFrame with results from subsequent steps.

# In Out[230]:
df_pred_stat_last

Out[230]:

	STAT_NAME	CLASS_NAME	INIT	ADD	cols_imp	cols_imp_mn	sm_max_cols_all	sm_max_cols_imp	sm_max_bl_cols_imp	sm_best_bl_cols_imp	top15
0	AUC	None	0.9733	0.9834	0.9857	0.9870	0.9784	0.9709	0.9422	0.9474	0.9498
1	RECALL	No	0.9964	0.9800	0.9782	0.9811	0.9453	0.9195	0.9190	0.9476	0.9380
2	PRECISION	No	0.9247	0.9488	0.9541	0.9548	0.9751	0.9726	0.8427	0.8257	0.8602
3	F1_SCORE	No	0.9592	0.9641	0.9660	0.9678	0.9600	0.9453	0.8792	0.8824	0.8974
4	SUPPORT	No	1702	1702	1702	1702	1702	1702	210	210	210
5	RECALL	Yes	0.3428	0.5714	0.6190	0.6238	0.8047	0.7904	0.8285	0.8	0.8476
6	PRECISION	Yes	0.9230	0.7792	0.7784	0.8036	0.6450	0.5478	0.9109	0.9385	0.9319
7	F1_SCORE	Yes	0.5000	0.6593	0.6896	0.7024	0.7161	0.6471	0.8678	0.8637	0.8877
8	SUPPORT	Yes	210	210	210	210	210	210	210	210	210
9	ACCURACY	None	0.9246	0.9351	0.9388	0.9419	0.9299	0.9053	0.8738	0.8738	0.8928
10	KAPPA	None	0.4683	0.6244	0.6562	0.6708	0.6766	0.5945	0.7476	0.7476	0.7857
11	MCC	None	0.5363	0.6336	0.6615	0.6774	0.6820	0.6079	0.7506	0.7559	0.7889

12. Analyzing top models

First I analyze the top 1 reason for the top 1 model prediction.

# In [233]:
# Top 1 contribution feauture to prediction
f_score_res(top_1_imp_ls[1], d_test_bl)[0].
    select('EMPLOYEE_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE', 
                      ('json_query("REASON_CODE", '$[0].attr')', 'Top1'), 
                      ('json_query("REASON_CODE", '$[0].pct')', 'PCT_1') ).head(5).collect()

Out[233]:

	EMPLOYEE_ID	SCORE	CONFIDENCE	REASON_CODE	Top1	PCT_1
0	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“SICKDAYS”	28.49847521037374
1	10034	No	0.997572	[{“attr”:”SICKDAYS”,”val”:-2.1461261874013997,…	“SICKDAYS”	28.604720941656255
2	10036	Yes	0.884731	[{“attr”:”HRTRAINING”,”val”:1.2600622007181038…	“HRTRAINING”	17.883981590687534
3	10056	Yes	0.992384	[{“attr”:”SICKDAYS”,”val”:3.2852997983870959,”…	“SICKDAYS”	30.97615277828276
4	10064	No	0.985765	[{“attr”:”HRTRAINING”,”val”:-1.738132753015509…	“HRTRAINING”	27.416934166112573

I can compare test data FLIGHT_RISK (actual) with prediction SCORE.

# In [236]:
print(f"Top 1 voting with features {top_1_imp_ls[4]}") # Top 1 voting
f_actual_vs_score(top_1_imp_ls[1], d_test_bl).head(5).collect()

Top 1 model (top15 features ) prediction and top 1 reason code.

Out [236]:

	EMPLOYEE_ID	FLIGHT_RISK	SCORE	CONFIDENCE	REASON_CODE
0	10033	No	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…
1	10034	No	No	0.997572	[{“attr”:”SICKDAYS”,”val”:-2.1461261874013997,…
2	10036	Yes	Yes	0.884731	[{“attr”:”HRTRAINING”,”val”:1.2600622007181038…
3	10056	Yes	Yes	0.992384	[{“attr”:”SICKDAYS”,”val”:3.2852997983870959,”…
4	10064	No	No	0.985765	[{“attr”:”HRTRAINING”,”val”:-1.738132753015509…

I can analyze further top 5 REASON_CODE with code wrapped in a local function.

# In [240]:
# Top 5 contribution feauture to prediction for rows
def f_contrib_top5f_head(p_model, p_d_test, p_rows):

# In [241]:
f_contrib_top5f_head(top_1_imp_ls[1], d_test_bl, 3)

Out [241]:

	EMPLOYEE_ID	SCORE	CONFIDENCE	REASON_CODE	Top	PCT
0	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“SICKDAYS”	28.49847521037374
3	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“HRTRAINING”	20.004226264298525
6	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“PREVIOUS_CAREER_PATH”	13.468698519836725
9	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“LINKEDIN”	6.355941761599763
12	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“FUNCTIONALAREACHANGETYPE”	6.23681257187179

13. Visualizing statistics

Presenting model results with Visualizing Model Report as explained in the blog “ML in SAP Data Warehouse Cloud”.

# In [243]:
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(top_1_imp_ls[1]).build().display()

Variable%20Importance

Variable Importance

I can analyze the contributions of features with shapely_explainer as presented in the library example.

# In [257]:
shapley_explainer = TreeModelDebriefing.shapley_explainer(feature_data=d_test_bl.head(500).select(features_shapely),
reason_code_data=pred_res_shapely.head(500).select('REASON_CODE'))
shapley_explainer.summary_plot()

Shapley%20explainer

Shapley explainer

Confusion matrix for validation data from model training as explained in the tutorial on GitHub.

In [244]:
import matplotlib.pyplot as plt
from hana_ml.visualizers.metrics import MetricsVisualizer
f, ax1 = plt.subplots(1,1)
mv1 = MetricsVisualizer(ax1, title = 'Modedl confusion matrix')
ax1 = mv1.plot_confusion_matrix(top_1_imp_ls[1].confusion_matrix_, normalize=False)

Model confusion matrix

Model%20confusion%20matrix

Model confusion matrix

# In [245]:
f, ax1 = plt.subplots(1,1)
mv1 = MetricsVisualizer(ax1, title = 'Model confusion matrix %')
ax1 = mv1.plot_confusion_matrix(top_1_imp_ls[1].confusion_matrix_, normalize=True)

Model%20confusion%20matrix

Model confusion matrix

Confusion matrix for predictions wrapped in local functions.

# In [259]:
def f_cf_mx_values(p_model, p_d_test):

# In [267]:
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

Prediction%20confusion%20matrix

Prediction confusion matrix

14. Further steps

Regarding to further models exploring I could use as starting point Automated Predictive Libraries GradientBoostingBinaryClassifier, tutorial Fraud Detection Auto ML Script, blog Hands-On Tutorial: Automated Predictive (APL) in SAP HANA Cloud and blog SAP HANA Cloud Machine Learning Challenge. I would increase the test data percentage from split and use ensemble techniques for voting.

Thinking in terms of transfer learning as in computer vision, what if we can collect data across different companies and industries, build specialized or generalized models, shake hands with developers, and integrate prediction in applications to use internally or externally as on demand service. Data required for monthly evaluation could be collected from internal sources and external sources with coded extractors or scraping tools.

Since I am an end-to-end developer of Business Intelligence reports, I am looking forward to getting hands-on experience regarding how to integrate extraction of data from an SAP environment or other external sources to build models, consume machine learning models, and integrate prediction into the business apps.

Tags: Machine Learning

Sara Sampaio

Author Since: March 10, 2022

0 0 votes

Article Rating

0 Comments

Inline Feedbacks

View all comments

Table of content

1. Getting started

2. Understanding the challenge

3. Exploratory data analysis and pre-processing

4. Baseline model with initial data

5. Understanding statistics

6. Training model with additional data

7. Reducing the number of features

8. Balancing training data with SMOTE

9. Balancing test data and increasing training data

10. Finding top SMOTE

11. Building top models

12. Analyzing top models

13. Visualizing statistics

14. Further steps

Sara Sampaio

Company

Explore More