Dataflow – Python Script Item

This blog-post will introduce a helper that facilitates the development and debugging of python script items of SAP Data Warehouse Cloud’s Data Flow.

Introduction

Assuming the following simple data flow, where the script item should process the Customers data:

Upon creation, the follow sample code snipped is added to the script item. (You find the source code on the details tab. Pressing the edit symbol, will open the full screen editor):

def transform(data):
    """
    :param data: Pandas DataFrame
    :return: Pandas DataFrame
    """
    #####################################################
    # Provide the function body for data transformation #
    #####################################################
    return data

 

For my example, I will use a script that adds a row the data parameter.

def transform(data):
    # adding a row
    newRow = {'Col1':'Text-NewRow','Col2': 6}
    data = data.append(newRow, ignore_index=True)
    return data

Development Helper – Overview

At this point of time of development you typically like to perform a test run and verify the script implementation. This requires test data and debugging capabilities.

While test data is available from the source table, we see that the debugging capabilities are behind of what is offered by client tools like e.g. MS Visual Studio Code.

Here the helper code will step in: it simulates the framework of SAP Data Warehouse Cloud and allows testing of your python code using MS Visual Studio Code.

Known Limitations:

  • Some objects and functions of python are not allowed in the context of the script item (e.g. input/output or the breakpoint statement. See documentation for full details). This deny list is not considered in the development helper.
  • The same holds true for the NumPy and Panda libraries.
  • There is no packaging of data implemented.

Here is an overview of what the development helper is doing: It converts your local test data into a pandas DataFrame object. This is the technical type of the data parameter passed into your script.

""" parameter data contains local test data """
def transform(data):

Once your script has done its work, it will return the result as pandas DataFrame object. This is the object type SAP Data Warehouse Cloud is expecting. Based on this result, you could validate your script using standard means.

Development Helper – The implementation

You find an identical function definition for transforming your data as used by SAP Data Warehouse Cloud. Here is the place for your script that you like to validate. Once it works as expected in your local environment, copy & paste it to the script editor and give it a final test.

For the sample data, two options have been implemented – one for static data defined in the source code, one for accessing a CSV file containing the data.

Hint: You could download the (in my example) customer data into a CSV file, by using the SAP HANA Database Explorer . Perform a select * from view and chose the download option from the result table.

Note: You need to install the pandas library before you can run the helper.

#install via terminal/command: pip install pandas
import pandas as pd 

# This definition needs to be copied to DWC:
# -----------------------------------------------------------
# def - start

def transform(data):

    # adding a row
    newRow = {'Col1':'Text-NewRow','Col2': 6}
    data = data.append(newRow, ignore_index=True)

    return data

# def - end
# -----------------------------------------------------------

# -----------------------------------------------------------
# Create Sample Data
# >> Please uncomment according to your source <<
# -----------------------------------------------------------

# source: local test data - hard coded
sampleData = {
    'Col1': ['Text1', 'Text2'],
    'Col2': [2, 4]
    }

# source: local test data - import local csv CSV File
# filepath        = './DWC_Panda_CSV_Test.csv'
# sampleData      = pd.read_csv(filepath, delimiter=';')

sampleDf   = pd.DataFrame(data=sampleData)

# Show the first 10 rows of the data input
print ('n','Input Data:')
print (sampleDf.head(10))

# Call the transformation
resultDf = transform(sampleDf)

# Show the first 10 rows of the result
print ('n','Output Data:')
print (resultDf.head(10),'n')

Conclusion

Using this helper reduces the development and test effort when writing scripts for the data flow.

Take this sample and tailor and enrich it to your daily needs.

I’d be happy to hear your thoughts, ideas and comments on this helper. Let me know in the comments!

Sara Sampaio

Sara Sampaio

Author Since: March 10, 2022

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x