This blog will provide  a simple overview of the process of fine tuning Large Language Models (LLMs) with Enterprise data to help it produce tailored  HANA SQL statements.

 

It’s a continuation of my previous 2 blogs:

Data Wizardry – Unleashing Live Insights with OpenAI, LangChain & SAP HANA Demonstrates how questions on live Enterprise data can be asked with natural language can be translated to SQL and executed
Under the Hood – Text to SQL with LangChain, LLM’s and Hana Provides some further details of the underlying Architecture and highlights some of the risks and challenges

 

Once again I’m trying to ask billing questions on my custom table ZSDBILLGDOCITM  which is a simplified representation of the S/4 CDS View I_BillingDocumentItemCube , which I’ve populated with dummy data.

In the earlier blogs I already explored asking questions using OpenAI and ChatGPT and it does a very good job, if your provide a lot of enterprise info along with the questions.

The Azure’s OpenAI  offering does provide the opportunity to fine-tune the language model for your specific needs.

In recent months there has been an explosion of Open Source Lanaguage Models available so rather than demonstrate finetune with Azure I thought it would be more interesting to show how a free model, would perform before and after some training.

Obviously it’s not completely free, you’ll still need to run it on a machine with some decent GPU (e.g. a Nividia A100). Time is money  😉

After a bit of searching in hugging face I opted to test the StarCoder model, it’s tiny fraction of GPT-3 size but is specialised in coding, so the hope was that it has a resonable command of SQL.

Disclaimer: Always check the models comerical usage guidelines as restrictions may vary.

 

Before Fine-Tuning

So lets give it a try with the same question from the earlier blog:

“Which was our best selling product in Austrlia during January?”

 

Not a bad attempt, but remember it doesn’t know anything about the table, field or the dataset.

There’s no chance it will run in Hana so not even worth a try.

Once again I don’t want to pass the table structures etc along with the question so instead I want to train the LMM on the enterprise data before hand.

 

 

Data Preparation

The most time consuming part of the activity is preparing some training files.

For example I need to create a list of questions and answers in the following JSON form:

 

[{
   "prompt": "What is the total quantity and amount sold for each division in the billing dataset for the North America region?",
   "completion": "SELECT division, SUM(billingquantity), SUM(itemnetamountofbillingdoc) FROM zsdbillgdocitm WHERE region = 'NA' GROUP BY division limit 10;"
}]

 

NOTE: the exact form and structure of your training data can vary with custom LLM’s, since you’re writing the logic, but to keep it simple I opted to use a similar format as that used by Azure OpenAI.

If you want to see the small training set I created you can find it here.

 

Train and Run

Next I load the dataset, tweaked the format, tokenized the data then train the model on the new dataset with the necessary transformer libraries in Python.

For a broad overview of the steps see the hugging face docs.

Training on an A100 with this tiny dataset of 100 examples took under 10min.

 

So lets ask the question again:

“Which was our best selling product in Austrlia during January?”

 

Cool.. It produced valid SQL:

SELECT product, SUM(itemnetamountofbillingdoc) AS total_sold_in_january FROM zsdbillgdocitm WHERE fiscalperiod = ‘001’ AND country = ‘AU’ GROUP BY product limit 10;

Not optimal but it runs and if the response was plugged into the LangChain logic  from my earlier blog it would likely produce the expected answer.

 

Simple right?

Well actually it wasn’t so straight forward. The SQL being generated wasn’t quite what I wanted after much iterations of training and testing, so rather than create more training data I tweaked the prompts just a little bit to give some hints.

e.g.

text = ”’Create a syntactically correct SAP Hana SQL query.
Use the following format:
Question: “Question here”
SQLQuery: “SQL Query to run”
tip:
fiscalperiod is 3 digits long e.g. 012 for December
The country code for Australia is AU
Question: “From zsdbillgdocitm table, which was our best selling Product in Austrlia for January?”
”’

The tips helped a lot, since the LLM knew nothing of the underlying data.

It’s not as fast and as accurate OpenAI but at the same time it’s potentially more flexibile depending on the use case.

The more training data you can supply the better it can be. 🙂

I hope you found this informative.

 

 

What Enterprise data would you want to train it on?

What concerns do you have about data security?

Could this be a viable option?

Sara Sampaio

Sara Sampaio

Author Since: March 10, 2022

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x