SAP HANA
Jupyter Notebook
Jupyter Notebook, along with the R programming language, is the tool of choice when it comes to tackling data science challenges. In Jupyter Notebook, Pandas DataFrames are often used. Jupyter Notebook actually just provides the graphical web interface, which makes interactive programming in python possible. The actual libraries are Pandas, Numpy and Matplotlib. In Pandas there is a popular class called “DataFrame” which fits the needs data scientists often have in their work. These three python libraries have good integration with each other and provide a powerful tool.
DBeaver
Example
Let’s start with the demo! I want to introduce the solution by providing a step-by-step guide. You can do these steps on your own and test it out. If you have an existing SAP HANA instance with credentials, Jupyter Notebook on your computer and DBeaver, you can clone my repository, execute the notebook and do the same steps as I do here.
I downloaded and used the CSV source of this dataset.
from machine-readable file formats like XML, JSON or CSV is a one-liner.
import pandas as pd
import numpy as np
df = pd.read_csv('bevolkerungsbestand-in-mannheim-2009-2020.csv', sep=';')
df.head()
Here you can see in the first command block the declaration of the dependencies. In the second block the CSV file is read into a DataFrame. In the third block a part of the data set is output, so that you can check if everything is correct.
Now comes the exciting part. I have to admit, I experimented around for a while until I figured out how it works. SAP HANA has security requirements that necessitate certain parameters. That’s also the real reason I’m sharing this post: In case others are working on a similar problem, you can use this example to get your quickstart.
My requirement was to use the “to_sql” interface provided by DataFrame to persist the dataset. This way you don’t have to worry about the data types and you get to the target faster.
import platform
from hdbcli import dbapi
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
address=''
port='443'
user=''
password = "";
For seamless integration the libraries “hdbcli” and “sqlalchemy” are needed. You can install them with the Python package manager pip. You need to add the credentials of your HANA instance into the according variables. You need the port, address, user and password.
connection_string = 'hana://%s:%s@%s:%s' % (user,password, address, port)
ssl_args = {'encrypt': 'True', 'sslValidateCertificate':'False'}
print(connection_string)
The connection does not work without SSL. This means that one must also pass the SSL arguments along with the connection information. This happens in this example via the sqlalchemy library.
engine = create_engine(connection_string, connect_args = ssl_args)
connection = engine.connect()
df.to_sql('MannheimTest',connection,index =False, if_exists ='replace')
You create a connection and pass this to the “to_sql” command. There is a warning message which comes from sqlalchemy. This can be ignored. That’s it! It is this simple to persist a DataFrame in SAP HANA.
Check persisted data
Now let us look with DBeaver, whether everything was persisted correctly. I open the according connection’s table.
Voila! Hereby we could confirm that everything worked fine.
Next Steps
What I have presented here is of course only the prerequisite for many other potential use cases. For fun, I persisted the table created in the above example to the HANA instance of an SAP Data Warehouse Cloud instance and built a small data pipeline.
The use case I needed this solution for is a data pipeline where I couldn’t automate the data source to SAP Data Warehouse Cloud (DWC) because it was sitting around in multiple Excel sheets and the data was changing regularly. Each time I run the Notebook, my local table in SAP DWC is updated, making this a semi-automated solution for former local datasets (e.g. in Excel, CSV). Highly relevant if you are working in a transformation project.
Using the notebook, I perform the merge of several Excel files from my file system and persist them in the DWC HANA table. In DWC I can perform further operations using the Graphical Views or Data Flows.
Finally here I visualized Mannheims population data in SAP Analytics Cloud using existing live connectors. This demonstrates a data-science use case based on sample data. I don’t know if I will write a separate post for this, to go more into detail. But for now, that’s it!
Conclusion
The ability to perform notebook operations on SAP HANA with Jupyter Notebook is a powerful tool, which has potential. In the context of data-science, for me it seems this potential is not explored sufficiently. Have fun exploring use cases with Jupyter Notebook and SAP HANA!
Further Read
- Database and Data Management: Find latest content on SAP HANA here.
- SAP HANA Questions and Answers: Ask questions or find answers here.
- Blogs about SAP HANA: Find more interesting blog entries here.
I hope you liked my first blogpost on SAP Blogs. Feel free to provide me some feedback and ideas on how to improve my content. That’s highly appreciated!
Feel encouraged to comment, ask questions and be curious. Did you faced similar challenges in your data-science activities? Is this content useful for you? I plan to share more about my data-science activities. Follow me on SAP Blogs for more such content.