As Solution Advisors, we often need to create custom datasets to support customer opportunities.
We can create more engaging customer experiences if we had more realistic datasets that more closely resembled their own data.
Ideally, we would be able to create a dataset of any size easily and able to specify constraints on the data, such as matching data formats the customer may use or specifying the statistical distribution of the random data. Also, it would be nice to generate realistic looking PII data in case you needed to demonstrate data masking.
We can easily create such datasets in Python, and this blog will serve as a guide on how to use the Faker, numpy, and pandas libaries in Python to generate any datasets you need.
Once we create the datasets, we have a lot of flexibility with how we use them. For this demo, we’ll upload the newly created datasets to SAP HANA Cloud as tables.
Faker is a Python fake data generator
Faker is a Python library that generates fake data for you. It is useful to create realistic looking datasets and can generate all types of data. We’ll explore those most relevant for customer demos but the documentation details all the “providers” of fake data available in the library.
We will also use the Python numpy library since it will allow to create numeric fields (e.g. sales) based on a distribution or randomly select from a list.
To begin, let’s make sure we have the necessary libraries installed. In addition to Faker and numpy, we’ll also need the handy pandas library. The hana_ml library will be used to upload the dataset we create to SAP HANA Cloud.
!pip install numpy
!pip install faker
!pip install pandas
!pip install hana_ml
Next, let’s instantiate the Faker library. For this demo, we’ll create an instance of Faker called fake and use that instance to generate all our fake data.
import pandas as pd
from faker import Faker
import numpy as np
fake = Faker()
Once we have our instance, we can use that instance to call any number of fake data “providers” Faker includes. For example, we can easily generate 5 fake first names:
# First name
for _ in range(5):
print(fake.first_name())
There are providers for different types of data we can generate on a fake “customer” by calling the appropriate Faker provider. For example:
# There are specific versions of these generators
# It can generate names
print('Male first names: ' + fake.first_name_male())
print('Female first names: ' + fake.first_name_female())
print('Last names: ' + fake.last_name())
print('Full names: ' + fake.name())
# Generate prefixes and suffixes (there are also gender specific versions e.g. prefix_female())
print('Prefix: ' + fake.prefix())
print('Suffix: ' + fake.suffix())
# Generate emails
print('Company emails: ' + fake.ascii_company_email())
print('Safe emails: ' + fake.ascii_safe_email())
print('Free emails: ' + fake.ascii_free_email())
print('ASCII Emails: ' + fake.ascii_email())
print('Emails: ' + fake.email())
If you prefer to create a company focused dataset, you can do that too.
# Company names
print('Company name: ' + fake.company())
print('Company suffix: ' + fake.company_suffix())
# Generate Address components
print('Street address: ' + fake.street_address())
print('Bldg #: ' + fake.building_number())
print('City: ' + fake.city())
print('Country: ' + fake.country())
print('Postcode: ' + fake.postcode())
# Or generate full addresses
print('Full address: ' + fake.address())
# Even generate motto, etc.
print('Catch phrase: ' + fake.catch_phrase())
print('Motto: ' + fake.bs())
Generate columns that match specific formats
If you needed to create fake data that needed a specific format, such as a product code or iPhone model, you can do that too:
# Use bothify to generate random numbers(#) or letters(?). Can limit the letters used with letters=
print(fake.bothify('PROD-??-##', letters='ABCDE'))
print(fake.bothify('iPhone-#'))
Generating categorical columns based on probabilities/weights
You can even specify the percent the random value is likely to be “True” with boolean columns.
# Create fake True/False values
# Random True/False
print(fake.boolean())
# Specify % True
print(fake.boolean(chance_of_getting_true=25))
For categorical columns, you can specify a list of values to randomly choose from. Optionally, you can also specify the weights to give to each value if you don’t want each element in the list to have an equal chance of being selected.
import numpy as np
industry = ['Automotive','Health Care','Manufacturing','High Tech','Retail']
# Specify probabilities of each category (must sum to 1.0)
weights = [0.6, 0.2, 0.1, 0.07, 0.03]
# p= specifies the probabilities of each category. Must sum to 1.0
print(np.random.choice(industry, p=weights))
# Generating choice without weights (equal probability on all elements)
print(np.random.choice(industry))
Generate numeric columns centered around a distribution
For columns that represent information such as sales, you can create numeric columns by specifying the mean and standard deviation. Alternatively, you can also generate random integers by specifying the maximum value.
# 1st argument is mean of distribution, 2nd is standard deviation
print(np.random.normal(1000, 100))
# Rounded result
print(round(np.random.normal(1000, 100)))
# Generate random integer between 0 and 4
print(np.random.randint(5))
Generate Dates between a range
Dates or datetimes can also be created multiple ways. You can specify and date within this century, decade, year, or month or between a date range.
print(fake.date_this_century().strftime('%m-%d-%Y'))
print(fake.date_this_decade().strftime('%m-%d-%Y'))
print(fake.date_this_year().strftime('%m-%d-%Y'))
print(fake.date_this_month().strftime('%m-%d-%Y'))
print(fake.time())
import pandas as pd
# Start and end dates to generate data
my_start = pd.to_datetime('01-01-2021')
my_end = pd.to_datetime('12-31-2021')
print(f'Random date between {my_start} & {my_end}')
fake.date_between_dates(my_start, my_end).strftime('%m-%d-%Y')
Or even parts of dates or specifying dates relative to today:
print(fake.year())
print(fake.month())
print(fake.day_of_month())
print(fake.day_of_week())
print(fake.month_name())
print(fake.past_date('-1y'))
print(fake.future_date('+1d'))
Let’s Put It All Together!
Now that we have some familiarity with how to use Faker and numpy to generate different types of columns, let’s put everything together to create a full dataset we can use.
Note: To create categorical columns based on a choice, you can also use faker’s random_element method as an option to numpy’s random.choice when you don’t need to specify weights. Your choice! In the example below, I used both for industry and industry2.
We’ll create a function that creates a row of fake data, then call it 5 times and save as a Pandas dataFrame (df).
from faker import Faker
import numpy as np
import pandas as pd
industry = ['Automotive','Health Care','Manufacturing','High Tech','Retail']
fake = Faker()
def create_data(x):
# dictionary
b_user ={}
for i in range(0, x):
b_user[i] = {}
b_user[i]['name'] = fake.name()
b_user[i]['job'] = fake.job()
b_user[i]['birthdate'] = fake.date_of_birth(minimum_age=18,maximum_age=65)
b_user[i]['email'] = fake.company_email()
b_user[i]['company'] = fake.company()
b_user[i]['industry'] = fake.random_element(industry)
b_user[i]['city'] = fake.city()
b_user[i]['state'] = fake.state()
b_user[i]['zipcode'] = fake.postcode()
b_user[i]['netNew'] = fake.boolean(chance_of_getting_true=65)
b_user[i]['sales_rounded'] = round(np.random.normal(1000,200))
b_user[i]['sales_decimal'] = np.random.normal(1000,200)
b_user[i]['priority'] = fake.random_digit()
b_user[i]['industry2'] = np.random.choice(industry)
return b_user
df = pd.DataFrame(create_data(5)).transpose()
df.head(5)
Multi-Country Support! Localize to other Countries/Languages
Saving the best for last, I think the coolest thing about the Faker library is the ability to generate fake datasets in any localization. Although Faker set to use US English by default, we can easily set the localization when we initialize the library. We can even specify multiple localization if we wanted to generate datasets that are truly multi-lingual.
For example, we can generate first names in a number of locales by specifying the language code
fake = Faker('en-US')
print(fake.name())
fake = Faker('ja-JP')
print(fake.name())
fake = Faker('ru_RU')
print(fake.name())
fake = Faker('it_IT')
print(fake.name())
fake = Faker('de_DE')
print(fake.name())
fake = Faker('pt_BR')
print(fake.name())
Generate multi-lingual datasets easily
We can generate datasets in any language by specifying the language codes when instantiating Faker.
# Instantiate Faker with multiple locales
fake = Faker(['en_US','de_DE','pt_BR','ja_JP','zh-CN'])
This modified code will create 1000 profiles across English, German, Portuguese, Japanese, and Chinese.
from faker import Faker
import numpy as np
import pandas as pd
industry = ['Automotive','Health Care','Manufacturing','High Tech','Retail']
# Instantiate Faker with multiple locales
fake = Faker(['en_US','de_DE','pt_BR','ja_JP','zh-CN'])
def create_data(x):
# dictionary
b_user ={}
for i in range(0, x):
b_user[i] = {}
b_user[i]['name'] = fake.name()
b_user[i]['job'] = fake.job()
b_user[i]['birthdate'] = fake.date_of_birth(minimum_age=18,maximum_age=65)
b_user[i]['email'] = fake.company_email()
b_user[i]['company'] = fake.company()
b_user[i]['industry'] = fake.random_element(industry)
b_user[i]['city'] = fake.city()
b_user[i]['state'] = fake.state()
b_user[i]['zipcode'] = fake.postcode()
b_user[i]['netNew'] = fake.boolean(chance_of_getting_true=65)
b_user[i]['sales_rounded'] = round(np.random.normal(1000,200))
b_user[i]['sales_decimal'] = np.random.normal(1000,200)
b_user[i]['priority'] = fake.random_digit()
b_user[i]['industry2'] = np.random.choice(industry)
return b_user
df = pd.DataFrame(create_data(1000)).transpose()
df.head(10)
The resulting dataset will generate data across the locales listed:
Upload our Dataset to SAP HANA Cloud
Since we’re already in Python, let’s leverage the hana_ml library to connect to SAP HANA Cloud and upload our newly created dataset as a table.
# Create connection to HANA Cloud
import hana_ml.dataframe as dataframe
# Instantiate connection object
conn = dataframe.ConnectionContext(address = '<Your HANA tenant info.hanacloud.ondemand.com>',
port = 443,
user = '<USERNAME',
password = "<PASSWORD>",
encrypt = 'true',
sslValidateCertificate = 'false')
# Display HANA version to test connection
print('HANA version: ' + conn.hana_version())
# Print APL version to confirm PAL/APL are enabled
import hana_ml.algorithms.apl.apl_base as apl_base
v = apl_base.get_apl_version(conn)
v.head(5)
# Upload Pandas dataframe to HANA Cloud
dataframe.create_dataframe_from_pandas(connection_context = conn,
pandas_df = df,
table_name = 'FAKER',
force = True)
The resulting table, including multi-language records are now in SAP HANA Cloud:
I hope this blog was helpful in learning more how Python and it’s powerful libraries can allow you to create flexible datasets to support your customer engagements easily. Thanks for your interest in this topic! Please let me know if you have any comments/questions in the Q&A section below. Thanks!