“Have you ever wished to ask an AI agent questions about selected thousands of pages after you have woken up from a good sleep?” 🙂

I wrote the blog “Hello, world!” your crafted chat GPT bot!” about how to use OpenAI API to submit completions and ask further questions. The questions are restricted to the content the OpenAI was trained and that is not always enough as we want to extend the capabilities with actual online-specific content or local content.

Asking questions from additional content requires data augmentation. Let’s see what is possible nowadays with OpenAI API and LlamaIndex.

Table of content

Simple content embedding
SAP Machine Learning Embedding in OpenAI
  Collect HTML from URLs
  Collect Notebooks with git
  Collect HTML from Notebooks
  Collect HTML to TXT
  SAP HANA Machine Learning Challenge Embedding in OpenAI
Conclusions

Simple content embedding

I have created a class llama_context() with methods to prepare the structure of folders required for LlamaIndex, estimate the costs, create a vector index, start the query engine, and ask questions about content embedded in OpenAI. The entire code with links to resources is on GitHub 00 Simple content embedding.

class llama_context():
    def __init__(self, path=None):
	# code
    def load_data(self):
        self.documents = SimpleDirectoryReader(self.data_dir).load_data()
        print(f"Documents loaded: {len(self.documents)}.")
    def create_vector_store(self):
        self.index = GPTVectorStoreIndex.from_documents(self.documents)
        print("GPTVectorStoreIndex complete.")
    def save_index(self):
        self.index.storage_context.persist(persist_dir=self.perisit_dir)
        print(f"Index saved in path {self.perisit_dir}.")
    def load_index(self):
        storage_context = StorageContext.from_defaults(persist_dir=self.perisit_dir)
        self.index = load_index_from_storage(storage_context)
    def start_query_engine(self):
        self.query_engine = self.index.as_query_engine()
        print("Query_engine started.")
    def post_question(self, question, sleep = None):
	# code
    def del_data_dir(self):
	# code 
    def copy_file_to_data_dir(self, file_extension ='.txt', verbose = 0):
	# code
    def copy_path_from_to_data_dir(self, path_from, file_extension ='.txt', verbose = 0):
	# code
    def estimate_tokens(self, text):
	# code
    def estimate_cost(self):
	# code
# Create lct object from calls llama_context() with the working path
path_llama = "llama_mvp"
lct = llama_context(path=path_llama)

 

# Delete data directory
lct.del_data_dir()

 

# Copy files from source to data directory
path_from = "llama_mvp/source"
lct.copy_path_from_to_data_dir(path_from) # default extension *.txt

The folder contains the file mvp.txt with the content “Bogdan was born in 1990″.

# Load documents
# Content "Bogdan was born in 1990"
lct.load_data()

 

# Vector create does embedding and costs tokens
lct.create_vector_store()

# Out:
# INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
# > [build_index_from_nodes] Total LLM token usage: 0 tokens
# INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 7 tokens
# > [build_index_from_nodes] Total embedding token usage: 7 tokens
# GPTVectorStoreIndex complete.

 

# Save index
lct.save_index()

 

# Method load_index() costs as method create_vector_store()
# so that you don't need to upload data every time
# The index is content knowledge
lct.load_index()

Starting query engines with the content knowledge stored in the vector index. 🧠

# Start query engine
lct.start_query_engine()

We are ready to ask questions. 🧐

question = "What is content about?"
lct.post_question(question)
print(lct.response)
# Out:
# The content is about Bogdan and the year he was born.
question = "How old is he?"
# Out:
# Bogdan is 30 years old.
question = "What date is today?"
# Out:
# Today's date is August 8, 2020.
from datetime import date
today = date.today()
question = f"Consider current date {today}"
# Out:
# Consider current date 2023-05-15
# Bogdan is 33 years old.

question = "Where is the name commonly used as a given name?"
# Out:
# The name Bogdan is commonly used as a given name in Eastern European countries such as Romania, Bulgaria, and Ukraine.

 

As you can see the AI agent is aware of the augmented content, updated content about the current date, and further pubic content about countries.

SAP Machine Learning Embedding in OpenAI

Now it is time to move to SAP Machine Learning Embedding in OpenAI. The laborious part is collecting, storing, and converting data from various sources and formats.

In this experiment, I will collect data from URLs in HTML formats and from GitHub in Notebooks in IPYNB format, then convert data to raw TXT format. 🧪⚗️💎

Collect HTML from URLs

Collected URL.

Blogs:
https://blogs.sap.com/2022/11/07/sap-community-call-sap-hana-cloud-machine-learning-challenge-i-quit-how-to-prevent-employee-churn/
https://blogs.sap.com/2022/11/28/i-quit-how-to-predict-employee-churn-sap-hana-cloud-machine-learning-challenge/
https://blogs.sap.com/2022/12/22/sap-hana-cloud-machine-learning-challenge-2022-the-winners-are/

https://blogs.sap.com/2023/01/09/sap-hana-cloud-machine-learning-challenge-i-quit-understanding-metrics/

Documentation:
https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.04/en-US/hana_ml.dataframe.html
https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.07/en-US/pal/algorithms/hana_ml.algorithms.pal.trees.HybridGradientBoostingClassifier.html

Collection of HTML from URLs is performed with the class collect_html(). The entire code with links to resources is on GitHub 01 Collect html from urls.

import urllib.request
import os

class collect_html():
    def __init__(self):
        pass
    def read_save_html(self, url, path_save = None, filename = None, mode = 0):
        # mode: 0 - save, 1 - content, 2 - save and content
        response = urllib.request.urlopen(url)
        html_file = response.read()
        # code
        if mode == 1 or mode == 2:
            return html_file

     # code

 

# List html files
repo_path = path_save
list_ipynb(repo_path, "html")

# URLs content is stored in files:
# llama_challengehtml_challengeunderstanding_metrics_blog.html
# llama_challengehtml_challengechallenge_20221107.html
# llama_challengehtml_challengechallenge_20221128.html
# llama_challengehtml_challengechallenge_20221222.html
# llama_challengehtml_challengehana_ml.dataframe.html
# llama_challengehtml_challengehana_ml.algorithms.pal.trees.HybridGradientBoostingClassifier.html

 

Collect Notebooks with git

Collected repositories from GitHub:

https://github.com/itsergiu/sapcommunity-hanaml-challenge
https://github.com/SAP-samples/hana-ml-samples

You have to install Git from here to run it in notebooks. The entire code with links to resources is on GitHub 02 Collect notebooks ipynb.

# https://github.com/SAP-samples/hana-ml-samples
# https://github.com/SAP-samples/hana-ml-samples/tree/main/Python-API/usecase-examples/sapcommunity-hanaml-challenge
# folder = "Python-APIusecase-examplessapcommunity-hanaml-challenge"

REPO_URL = "https://github.com/itsergiu/sapcommunity-hanaml-challenge"
DOCS_FOLDER = "llama_challenge/ipynb_blog"
!git clone $REPO_URL $DOCS_FOLDER

REPO_URL = "https://github.com/SAP-samples/hana-ml-samples"
DOCS_FOLDER = "llama_challenge/ipynb_hana_ml_samples"
!git clone $REPO_URL $DOCS_FOLDER

 

repo_path = "ipynb_blog"
list_ipynb(repo_path, "ipynb")
# ipynb_blogSAP HANA ML challendge - CHURN  v2.3 max.ipynb

 

repo_path = "ipynb_hana_ml_samples/Python-API/usecase-examples/sapcommunity-hanaml-challenge"
list_ipynb(repo_path, "ipynb")
# ipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challenge10 Connectivity Check.ipynb
# ipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challenge20 Data upload.ipynb
# ipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challengePAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.ipynb
# ipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challengeUpload and explore Employee Churn data.ipynb

 

repo_path = "ipynb_blog"
list_ipynb(repo_path, "ipynb")
# ipynb_blogSAP HANA ML challendge - CHURN  v2.3 max.ipynb

Collect HTML from Notebooks

LlamaIndex provides a lot of data connectors to load dcuments from different formats in LlamaHub.AI for instance file-ipynb could be used for notebooks and file-unstructured for HTML.

I will use custom conversion. Collection (conversion) of Notebooks into HTML is performed with the class collect_ipynb(). The entire code is on GitHub 03 Collect html from notebook ipynb.

class collect_ipynb():
    def __init__(self):
        pass
    
    def ipynb_to_html(self, ipynb_file, path_save = None, encoding = None, content = False, verbose = 0):
        # verbose: 0 - Completion, 1 - Source & Destination
	# code
    def ipynb_path_to_html(self, repo_path = None, path_save = None, encoding = None, verbose = 0):
        # verbose: 0 - Complete message | 1 - Source file & Saved file

	# code

 

# Converted notebooks are stored in files:

# Out:
# llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challenge10 Connectivity Check.html
# llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challenge20 Data upload.html
# llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challengePAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.html
# llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challengeUpload and explore Employee Churn data.html

# Out:
# llama_challengeipynb_blogSAP HANA ML challendge - CHURN  v2.3 max.html

Collect HTML to TXT

Collection (conversion) of HTML into TXT is performed with the class collect_text().

The entire code is on GitHub 04 Collect html to txt.

class collect_text():
    def __init__(self, mask_ext = None):
	# code
    def open_html(self, html_file, encoding_read = None):
	# code
    def html_to_text(self, html_content):
	# code
    def html_to_text_file(self, html_file, path_save = None, content = False, verbose = 0, encoding_read=None, 
                           encoding_write = None):
	# code    
    def html_path_to_text(self, repo_path = None, path_save = None, encoding_read = None, encoding_write = None, verbose = 0):
	# code

 

# Converted files from HTML into TXT are stored in same location:
# Out:
llama_challengehtml_challengeunderstanding_metrics_blog.txt
llama_challengehtml_challengechallenge_20221107.txt
llama_challengehtml_challengechallenge_20221128.txt
llama_challengehtml_challengechallenge_20221222.txt
llama_challengehtml_challengehana_ml.dataframe.txt
llama_challengehtml_challengehana_ml.algorithms.pal.trees.HybridGradientBoostingClassifier.txt
# Out:
llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challengereadme.txt
llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challenge10 Connectivity Check.txt
llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challenge20 Data upload.txt
llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challengePAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.txt
llama_challengeipynb_hana_ml_samplesPython-APIusecase-examplessapcommunity-hanaml-challengeUpload and explore Employee Churn data.txt
# Out:
llama_challengeipynb_blogSAP HANA ML challendge - CHURN  v2.3 max.txt

SAP Machine Learning Challenge Embedding in OpenAI

In previous steps, content has been collected and converted to text files. Now we can use the same class llama_context() used for simple content to load data, create the index, start the engine, and ask questions.

The entire code is on GitHub 05 SAP HANA Machine Learning content embedding.

Defining the working folder and displaying the folders in object lct created from class llama_context().

# lct = llama_context(path='llama')
path_llama = "llama_challenge"
lct = llama_context(path=path_llama)

display(lct.path)
display(lct.data_dir)
display(lct.perisit_dir)
# Out:
# 'llama_challenge'
# 'llama_challenge\data'
# 'llama_challenge\storage'

Specifying the paths for content.

path_from1 = "llama_challenge//html_challenge"
path_from2 = "llama_challenge//ipynb_blog"
path_from3 = "llama_challenge//ipynb_hana_ml_samples//Python-API//usecase-examples//sapcommunity-hanaml-challenge"

lct.copy_path_from_to_data_dir(path_from1) # default extension *.txt
lct.copy_path_from_to_data_dir(path_from2) # default extension *.txt
lct.copy_path_from_to_data_dir(path_from3) # default extension *.txt

Converted files in text format are saved in the same folders as the source file.

# Converted files into TXT are saved in folders:
# html_challenge
# ipynb_blog
# ipynb_hana_ml_samples

 

Loading data from files in TXT converted from initial format HTML and IPYNB.

lct.load_data()
# Out:
# Documents loaded: 12.

Estimating the minimum and maximum costs for tokens.

lct.estimate_cost()
# Out:
# Total estimated costs with model ada: $0.0175276
# Total estimated costs with model davinci: $1.31457

Ready to create the index vector with OpenAI! 🧠

lct.create_vector_store()
# API key is required. Embedding cost tokens!
# https://platform.openai.com/account/api-keys

# Out:
# Total embedding token usage: 147741 tokens GPTVectorStoreIndex complete.

# https://platform.openai.com/account/usage
# Usage - $0.35
# text-embedding-ada-002-v2, 24 requests
# 103,950 prompt + 0 completion = 103,950 tokens

Saving index for next use.

lct.save_index()
# Out:
# Index saved in path llama_challengestorage.

We can continue with the already in-memory index of the object lct, however, for the purpose of example index is loaded from the file saved before.

lct.load_index()
# API key is required. Loading and embedding cost tokens!
# https://platform.openai.com/account/api-keys
# Out:
# Loading all indices.

Starting query engine! Ready to ask questions!  🤖

lct.start_query_engine()

 

Asking general questions about SAP HANA Machine Learning Challenge – “I quit!”. 🧐

question = "What is content about?"
lct.post_question(question)
print(lct.response)

# Out:
The content is about SAP HANA and its related technologies, such as SAP HANA Cloud's Auto ML 
capabilities, SAP HANA Python Client API for Machine Learning Algorithms, and SAP HANA Predictive 
Analysis Library (PAL). It also includes information about a book related to SAP HANA and a blog post 
about SAP HANA Machine Learning with ABAP Managed Database Procedures in SAP BW/4HANA.

question = "Who organized the Community Call?"
# Out:
The SAP HANA Cloud Machine Learning Challenge team organized the Community Call.

question = "What problem participants must solve?"
# Out:
Participants must solve the problem of predicting employee churn.

question = "Explain data for predicting employee churn"
# Out:
Data for predicting employee churn can include information about the employee such as their job title, years of experience, salary, performance reviews, and other factors that may influence their decision to stay or leave the company. Additionally, data can be collected from the company itself, such as the onboarding process, company culture, learning opportunities, and other factors that may influence employee churn. By analyzing this data, patterns can be identified that can help predict employee churn and inform decisions about how to improve the company's retention rate.

question = "Can you tell me which machine learning models were utilized by the participants?"
# Out:
The participants utilized the HybridGradientBoostingTree model for their machine learning.

question = "Which are the top 5 important features discoverd by the model?"
# Out:
The top 5 important features discovered by the model are: SICKDAYS, HRTRAINING, PREVIOUS_CAREER_PATH, LINKEDIN, and FUNCTIONALAREACHANGETYPE.

 

Asking questions for HANA Machine Learning code generation. 🧐

# QUESTION 1
question = "Python full code SAP HANA Machine learning HGBT example"
# Out:
he following code is an example of using the SAP HANA Python Client API for Machine Learning Algorithms to implement a HGBT (Hierarchical Gradient Boosting Tree) model. 

# Import the necessary libraries
import hana_ml
from hana_ml.algorithms.apl.hgbt import HGBT

# Create a connection to the SAP HANA system
connection_context = hana_ml.dataframe.ConnectionContext(address='<hostname>:<port>',
                                                        user='<username>',
                                                        password='<password>')

# Load the data into a dataframe
df = connection_context.table('<schema>.<table>')

# Create the HGBT model
hgbt = HGBT(conn_context=connection_context)

# Fit the model
hgbt.fit(data=df, key='<key_column>', label='<label_column>')

# Make predictions
predictions = hgbt.predict(data=df)

# Evaluate the model
hgbt.evaluate(data=df, label='

 

# QUESTION 2
question = "Python full code hana_ml dataframe example"
# Out:
In [1]:
# Import the necessary libraries
import hana_ml
import pandas as pd

# Load the CSV file into a Python object (Pandas DataFrame)
df_data = pd.read_csv(r'Emp_Churn_Train.csv', sep = ',')

# Create a connection to the HANA system
connection_context = hana_ml.dataframe.ConnectionContext(address='<HANA_SYSTEM_ADDRESS>', port=<HANA_SYSTEM_PORT>, user='<HANA_SYSTEM_USER>', password='<HANA_SYSTEM_PASSWORD>')

# Create a dataframe object from the Pandas DataFrame
df_remote = connection_context.table('EMP_CHURN_TRAIN', schema='<HANA_SYSTEM_SCHEMA>', data=df_data)

# Create training and testing set
from hana_ml.algorithms.pal import partition
hdf_train, hdf_test, hdf_val = partition.train_test_val_split( random_seed = 1017

Conclusions

The results are not impressive; however, they are good and promising. The response_mode is default and the data are raw text without further cleaning and preprocessing. As expected Content Embedding manages better general information than technical coding which is of higher complexity. Most probably with other LLMs results would be the same and fine-tuning could increase results marginally. However sizeable improvements require a machine learning Data-Centric approach work performed by Data Scientists in improving data quality, gathering more data, or engineering domain-specific content. What would it take in terms of resources and time to embed all the content of blogs.sap.com and GitHub SAP?  😉😊😪🤔

 

Sara Sampaio

Sara Sampaio

Author Since: March 10, 2022

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x