I am writing this blog to show training with APL using python package hana_ml. With APL, you can automate preprocessing to some extent.
Environment
Environment is as below.
- Python: 3.7.14(Google Colaboratory)
- HANA: Cloud Edition 2022.16
- APL: 2209
Python packages and their versions.
- hana_ml: 2.14.22091801
- pandas: 1.3.5
- scikit-learn: 1.0.2
As for HANA Cloud, I activated scriptserver and created my users. Though I don’t recognize other special configurations, I may miss something since our HANA Cloud was created long time before.
I didn’t use HDI here to make environment simple.
Python Script
1. Install Python packages
Install python package hana_ml, which is not pre-installed on Google Colaboratory.
As for pandas and scikit-learn, I used pre-installed ones.
!pip install hana_ml
2. Import modules
Import python package modules.
import pprint
from hana_ml.algorithms.apl.apl_base import get_apl_version
from hana_ml.algorithms.apl.gradient_boosting_classification
import GradientBoostingBinaryClassifier
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.dataframe import ConnectionContext, create_dataframe_from_pandas
from hana_ml.model_storage import ModelStorage
from hana_ml.visualizers.unified_report import UnifiedReport
import pandas as pd
from sklearn.datasets import make_classification
3. Connect to HANA Cloud
Connect to HANA Cloud and check its version.
ConnectionContext class is for connection to HANA. You can check the APL version with get_apl_version function.
HOST = '<HANA HOST NAME>'
SCHEMA = USER = '<USER NAME>'
PASS = '<PASSWORD>'
conn = ConnectionContext(address=HOST, port=443, user=USER,
password=PASS, schema=SCHEMA)
print(conn.hana_version())
# APL.Version.ServicePack is APL
print(get_apl_version(conn))
4.00.000.00.1660640318 (fa/CE2022.16)
name value
0 APL.Version.Major 4
1 APL.Version.Minor 400
2 APL.Version.ServicePack 2209
3 APL.Version.Patch 1
4 APL.Info Automated Predictive Library
5 AFLSDK.Version.Major 2
6 AFLSDK.Version.Minor 16
7 AFLSDK.Version.Patch 0
8 AFLSDK.Info 2.16.0
9 AFLSDK.Build.Version.Major 2
10 AFLSDK.Build.Version.Minor 13
11 AFLSDK.Build.Version.Patch 0
12 AutomatedAnalytics.Version.Major 10
13 AutomatedAnalytics.Version.Minor 2209
14 AutomatedAnalytics.Version.ServicePack 1
15 AutomatedAnalytics.Version.Patch 0
16 AutomatedAnalytics.Info Automated Analytics
17 HDB.Version 4.00.000.00.1660640318
18 SQLAutoContent.Date 2022-04-19
19 SQLAutoContent.Version 4.400.2209.1
20 SQLAutoContent.Caption Automated Predictive SQL Library for Hana Cloud
4. Create test data
Create test data using scikit-learn.
There are 3 features and 1 target variable.
def make_df():
X, y = make_classification(n_samples=1000,
n_features=3, n_redundant=0)
df = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])
df['CLASS'] = y
return df
df = make_df()
print(df)
df.info()
Here is dataframe overview.
X1 X2 X3 CLASS
0 0.964229 1.995667 0.244143 1
1 -1.358062 -0.254956 0.502890 0
2 1.732057 0.261251 -2.214177 1
3 -1.519878 1.023710 -0.262691 0
4 4.020262 1.381454 -1.582143 1
.. ... ... ... ...
995 -0.247950 0.500666 -0.219276 1
996 -1.918810 0.183850 -1.448264 0
997 -0.605083 -0.491902 1.889303 0
998 -0.742692 0.265878 -0.792163 0
999 2.189423 0.742682 -2.075825 1
[1000 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 X1 1000 non-null float64
1 X2 1000 non-null float64
2 X3 1000 non-null float64
3 CLASS 1000 non-null int64
dtypes: float64(3), int64(1)
memory usage: 31.4 KB
5. define table and upload data
Define HANA Table and upload data using function “create_dataframe_from_pandas”.
The function is very useful, since it automatically define table and upload at the same time. Please check options for further detail.
TRAIN_TABLE = 'PAL_TRAIN'
dfh = create_dataframe_from_pandas(conn, df, TRAIN_TABLE,
schema=SCHEMA,
force=True, # True: truncate and insert
replace=True) # True: Null is replaced by 0
6. Split data into train and test dataset
Split dataset using function “train_test_val_split”. The function needs key columns, so I added key column using function “add_id”.
train, test, _ = train_test_val_split(dfh.add_id(),
testing_percentage=0.2,
validation_percentage=0)
print(f'Train shape: {train.shape}, Test Shape: {test.shape}')
Train shape: [8000, 5], Test Shape: [2000, 5]
7. Training
Train with random forest by using class “GradientBoostingClassifier”. Please make sure class AutoClassifier is deprecated.
model = GradientBoostingBinaryClassifier()
model.fit(train, label='CLASS', key='ID', build_report=True)
8. Training result
8.1. Unified Report
Model report shows with the below code. Please see another article “Python hana_ml: PAL Classification Training(UnifiedClassification)” for the report content, which is basically same.
model.generate_notebook_iframe_report()
model.generate_html_report('apl')
8.2. Score
Score function returns mean average accuracy.
# score: mean average accuracy. cannot output other metrics
score = model.score(test)
print(score)
8.3. Summary
get_summary function returns model summary.
model.get_summary().deselect('OID').collect()
8.4. Metrics
get_performance_metrics function returns metrics information.
>> pprint.pprint(model.get_performance_metrics())
{'AUC': 0.991,
'BalancedClassificationRate': 0.964590677634156,
'BalancedErrorRate': 0.03540932236584404,
'BestIteration': 69,
'ClassificationRate': 0.9646017699115044,
'CohenKappa': 0.9291813552683117,
'GINI': 0.4823,
'KS': 0.9195,
'LogLoss': 0.12414480396790141,
'PredictionConfidence': 0.991,
'PredictivePower': 0.982,
'perf_per_iteration': {'LogLoss': [0.617163,
0.554102,
0.499026,
<omit>
0.125448,
0.125588]}}
8.5. Statistical Report
get_debrief_report function returns several type of statistical reports. Please See Statistical Reports in the SAP HANA APL Reference Guide.
reports = ['Statistics_Partition',
'Statistics_Variables',
'Statistics_CategoryFrequencies',
'Statistics_GroupFrequencies',
'Statistics_ContinuousVariables',
'ClassificationRegression_VariablesCorrelation',
'ClassificationRegression_VariablesContribution',
'ClassificationRegression_VariablesExclusion',
'Classification_BinaryClass_ConfusionMatrix']
for report in reports:
print('n'+report)
display(model.get_debrief_report(report).deselect('Oid').head(3).collect())
8.6. Indicators
get_indicators function returns all indicators with unified format.
model.get_indicators().collect()
8.7. Model info
get_model_info function returns several type of reports.
for model_info in model.get_model_info():
print('n', model_info.source_table['TABLE_NAME'])
display(model_info.deselect('OID').head(3).collect())
9. Predict
You can predict with function predict.
>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>> apply_out = model.predict(test)
>> print(apply_out.head(3).collect())
ID TRUE_LABEL PREDICTED gb_score_CLASS gb_contrib_X1 gb_contrib_X2 gb_contrib_X3 gb_contrib_constant_bias
0 12 0 0 2.592326 -0.222146 3.193908 -0.383197 0.003759
1 13 1 1 -4.876161 0.141867 -4.717393 -0.304394 0.003759
2 19 1 1 -4.074210 0.433828 -4.438335 -0.073464 0.003759
10. Save model
Just save model with class “ModelStorage” and function “save_model”.
ms = ModelStorage(conn)
# ms.clean_up()
model.name = 'My classification model name'
ms.save_model(model, if_exists='replace')
You can see the saved model.
# display(ms.list_models())
pprint.pprint(ms.list_models().to_dict())
{'CLASS': {0: 'hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier'},
'JSON': {0: '{"model_attributes": {"name": "My classification model name", '
'"version": 1, "log_level": 8, "model_format": "bin", "language": '
'"en", "label": "CLASS", "auto_metric_sampling": false}, '
'"fit_params": {}, "artifacts": {"schema": "I348221", '
'"model_tables": ["HANAML_APL_MODELS_DEFAULT"], "library": '
'"APL"}, "pal_meta": {}}'},
'LIBRARY': {0: 'APL'},
'MODEL_REPORT': {0: None},
'MODEL_STORAGE_VER': {0: 1},
'NAME': {0: 'My classification model name'},
'SCHEDULE': {0: '{"schedule": {"status": "inactive", "schedule_time": "every '
'1 hours", "pid": null, "client": null, "connection": '
'{"userkey": "your_userkey", "encrypt": "false", '
'"sslValidateCertificate": "true"}, "hana_ml_obj": '
'"hana_ml.algorithms.pal.xx", "init_params": {}, '
'"fit_params": {}, "training_dataset_select_statement": '
'"SELECT * FROM YOUR_TABLE"}}'},
'STORAGE_TYPE': {0: 'default'},
'TIMESTAMP': {0: Timestamp('2022-09-21 08:57:33')},
'VERSION': {0: 1}}
11. Close connection
Last but not least, close the connection.
conn.close()