Introduction
Today, companies leverage data from a variety of internal and external data sources. In order to integrate these successfully, the data must not only fit semantically, but also have a common key in order to be able to merge them in joins.
This endeavor often fails because the data contains different data qualities (cleansing problem) or because there is no common key, especially when external data sources are involved (matching problem).
SAP HANA Smart Data Quality (integrated with SAP HANA) provides a high-performance, rule-based solution to cleanse and merge the data (e.g. address data) even if no common key exists or to identify duplicates in the data sources. In this blogpost, the data cleansing and data matching functionalities of SAP HANA Smart Data Quality are demonstrated using person data and company data.
The blogpost is structured as follows: First, an example is presented, this includes two data sets with person data and two data sets with company data. Then, data cleansing is introduced and demonstrated using the data sets as examples. In the next step, data matching is shown and demonstrated on the examples. Finally, a conclusion is drawn.
Please note: The following information is presented from my point of view and does not reflect the view of SAP. All data shown is generated sample data.
Prerequisite
To implement the use case shown, a HANA on-premise instance is required. Currently, the matching and cleansing functionality is not available in HANA Cloud. For the example, a HANA database in version 2.0 SPS05 was used.
SAP HANA Smart Data Integration (SDI for short) and Smart Data Quality (SDQ for short) are the data integration and data quality capabilities of the HANA platform. These are embedded in SAP HANA and can be used via an intuitive UI using so-called SDI flow graphs. More information can be found in this blog post. To use SDI and SDQ, the script server of the HANA database must be activated (see here).
Basic knowledge of building HANA SDI flowgraphs (see here) and of HANA 2.0 (see here) is advantageous for implementing the use case.
Example Use Case
A company wants to combine data from two different data sources. Data source 1 is an ERP system with transactional data of the company, data source 2 is a CRM system with customer relationship data. Both data sources contain data on customer companies (company data) and their contacts (person data). Both data sources have stored the company and personal data with different IDs and also the names and address data are not created identically. Therefore a simple join is not possible.
The example data records of the data sources are presented below:
Example 1 Person Data
MEMBER_NO | FIRST_NAME | LAST_NAME | BIRTH_DATE | COUNTRY | POST_CODE | CITY | ADDRESS_LINE | SOURCE |
1 | Ted | Mosby | 04/11/1983 | US | 10977 | Spring Valley | 656 N. Stonybrook Street | 1 |
2 | Lilly | Aldrin | 01/12/1994 | US | 6095 | Windsor | 8354 South Drive | 1 |
3 | Marshall | Eriksen | 12/14/1989 | US | 6095 | Windsor | 8354 South Drive | 1 |
4 | Robin | Scherbatsky | 03/18/1987 | US | 20707 | Laurel | 599 Penn Ave. | 1 |
5 | Barney | Stinson | 03/23/1968 | US | 1801 | Woburn | 498 Pawnee Road | 1 |
6 | Donna | Paulsen | 08/04/1993 | US | 33030 | Homestead | 8586 S. Winchester Dr. | 1 |
7 | Harvey | Specter | 12/04/1994 | US | 33030 | Homestead | 8586 S. Winchester Dr. | 1 |
8 | Mike | Ross | 12/23/1996 | US | 12203 | Albany | 8992 East Harvey St. | 1 |
9 | Rachel | Zane | 01/20/1980 | US | 12203 | Albany | 8992 East Harvey St. | 1 |
10 | Louis | Litt | 06/24/1985 | US | 7712 | Asbury Park | 8701 Bear Hill Street | 1 |
11 | Jessica | Pearson | 09/10/2000 | US | 18940 | Newtown | 319 Brown Rd. | 1 |
12 | Charlie | Harper | 11/25/1984 | US | 7110 | Malibu | 758 Jones Rd. | 1 |
13 | Alan | Harper | 05/08/1992 | US | 7110 | Malibu | 758 Jones Rd. | 1 |
14 | Jake | Harper | 08/05/1980 | US | 7103 | Newark | 7709 El Dorado Court | 1 |
15 | Walden | Schmidt | 11/13/1987 | US | 8080 | Sewell | 5 Pacific Drive | 1 |
16 | Harry | Potter | 06/16/1981 | UK | 8701 | Lakewood | 66 Myrtle Dr. | 1 |
17 | Hermione | Granger | 07/15/1991 | UK | 46307 | Crown Point | 874 Rockcrest St. | 1 |
18 | Ron | Weasley | 10/28/1987 | UK | 60452 | Oak Forest | 7835 Crescent St. | 1 |
19 | Rubeus | Hagrid | 11/25/1982 | UK | 30701 | Calhoun | 37 Woodsman Lane | 1 |
20 | Albus | Dumbledore | 01/11/1968 | UK | 34231 | Sarasota | 10 Lilac Court | 1 |
Person Data – Data Source 1
ID | FIRST | LAST | BIRTHDAY | COUNTRY | ZIP | CITY | STREET | STREETNUMBER | SOURCE |
932093 | Teddy | Mosby | 04/11/1983 | US | 10977 | Spring Valley | North Stonybrook Street | 656 | 2 |
83294 | Lilly | Aldrin | 01/12/1994 | US | 6095 | Windsor | South Driv | 8354 | 2 |
83209841 | Marshall | Eriksen | 12/14/1989 | US | 6095 | Windsor | S. Drive | 8354 | 2 |
129321 | Robin | Scherbatsky | US | 20707 | Laurel | Penn Avenue | 2 | ||
219409 | Barney | Stinson | 03/23/1968 | US | 1801 | Woburn | Pawnee Rd. | 498 | 2 |
4329 | Donna | Paulsen | 08/04/1993 | US | 33030 | Homestead | South Winchester Drive | 8586 | 2 |
3432 | Michael James | Ross | 12/23/1996 | US | 12203 | Albany | E. Harvey Street | 8992 | 2 |
82302 | Rachel | Zane | 01/20/1980 | US | 12203 | Albany | 8992 | 2 | |
1290383 | Louis | Litt | 06/24/1985 | US | 7712 | Asbury Park | Bear Hill St. | 8701 | 2 |
3129321 | Jessica | Pearson | 09/10/2000 | US | 18940 | Newtown | Brown Rd. | 319 | 2 |
43213 | Charles | Harper | 11/25/1984 | US | 7110 | Malibu | Jones Road | 758 | 2 |
1243221 | Alan | Harper | 05/08/1992 | US | 7110 | Jones Rd. | 758 | 2 | |
2342322 | Jake | Harper | 08/05/1980 | US | 7103 | Newark | El Dorado Court | 7709 | 2 |
213234 | Harry | Potter | 06/16/1981 | UK | 8701 | Lakewood | Myrtle Drive | 66 | 2 |
234212 | Hermione | Granger | 07/15/1991 | UK | 46307 | Crown Point | Rockcrest St. | 2 | |
2321321 | Ronald | Weasley | UK | 60452 | Oak Forest | Crescent St. | 7835 | 2 | |
234232 | Rubeus | Hagrid | 11/25/1982 | UK | 30701 | Woodmans Ln. | 37 | 2 | |
219410 | Barney | Stins | 03/23/1968 | US | 1801 | Woburn | Pawnee Road | 498 | 2 |
Person Data – Data Source 2
It can be recognized that both records contain name information (First Name, Last Name), date of birth, as well as address information (country, postal code, city, street, house number). Nevertheless, the records have different IDs (MEMBER_NO, ID). For storing the street and house number, two fields are used in data source 2 (STREET, STREETNUMBER) and one field in source 1 (ADDRESS_LINE).
Example 2 Company Data
COMPANY_ID | COMPANY_NAME | COUNTRY | POST_CODE | CITY | ADDRESS_LINE | SOURCE |
1 | Tim Hortons | CA | 10977 | Spring Valley | 656 N. Stonybrook Street | 1 |
2 | Burger King | US | 6095 | Windsor | 8354 South Drive | 1 |
3 | McDonalds | US | 6095 | Windsor | 8354 South Drive | 1 |
4 | Kentucky Fried Chicken | US | 20707 | Laurel | 599 Penn Ave. | 1 |
5 | Pizza Hut | US | 1801 | Woburn | 498 Pawnee Road | 1 |
Company Data – Data Source 1
ID | COMPANY | COUNTRY | ZIP | LOCATION | STREETNAME | STREETNUMBER | SOURCE |
1 | Tim Hortons | CA | 10977 | Spring Valley | North Stonybrook Street | 656 | 2 |
2 | Burger King Food | US | 6095 | Windsor | South Drive | 8354 | 2 |
3 | BURGER KING | US | Windsor | South Drive | 8354 | 2 | |
4 | KFC | US | 20707 | Laurel | Penn Avenue | 599 | 2 |
5 | PIZZA HUT INC | US | 1801 | Woburn | Pawnee Road | 2 |
Company Data – Data Source 2
The two data sets have different IDs (COMPANY_ID, ID). The column names differ in both datasets. For storing the street and house number, two fields are used in data source 2 (STREETNAME, STREETNUMBER) and one field in source 1 (ADDRESS_LINE). In some cases, individual values are missing from the data records.
Step 1 – Data Cleansing
In order to be able to cleanse data using HANA SDQ, the “Cleanse” node (from the Data Quality section) is used in HANA Flowgraphs. This node always has exactly one input port and one output port for the data flow. It is used for identifying, parsing, validating and formatting data, these include: Addresses, people, company names, job titles, phone numbers, email addresses.
The node looks like this:
The documentation can be found here. In addition to the pure rule-based validation, address reference data can be purchased with additional licenses (per country) to achieve address validation and to use geocoding and reverse geocoding. In addition, there is the node “DQMm Cleanse” with which data can be cleaned based on a microservice (see here).
Components + Content Types
A component comprises a category of data to be cleansed and/or matched. Components include Firm, Address, Person, Email, Phone, Custom (custom defined components e.g. matching of proprietary IDs). Each component consists of 1 to n content types.
A content type is a type of data contained in a column of the data source, e.g. city, region, postal code, email address. The mapping of the content types to the source data is done automatically based on the name. The mapping can be adjusted manually if a component is not recognized or is recognized incorrectly.
The following graphic shows how a content type mapping can be adjusted in the “Input Components” tab so that the component (in this case Address) is correctly identified.
Cleanse Settings
For each component there are different special settings that can change the behavior of the cleanse, e.g. Casing, Diacritics, Script Conversion. The cleanse domain and the output format can be freely defined, e.g. based on special country formats.
The Side Effect Data Level can also be configured in the Settings. This allows additional statistics on the cleanse behavior. Depending on the level set, the information is stored in additional tables in the “_SYS_TASK” schema.
The following figure shows how the cleanse settings can be adjusted (using companies, persons and addresses as examples).
Configuration of the Cleanse Settings (Source: Own Image)
Cleansed Output
In the tab “Cleansed Output” you can define which fields are added to the output. Cleansed columns that are used specifically for SDQ matching must be explicitly switched on. Uncleansed source data can also be passed on through the output. Standardized fields get the prefix “STD_” and match fields get the prefix “MATCH_“.
The following graphic shows how the cleaned fields can be added for matching.
Example 1: Cleansing Person Data
The following example shows the cleansing of the person data from our example. The SDI flowgraph for the cleansing of the Company Data Source 1 looks as follows:
The result of the cleansing is shown in the following table (some cleansing columns have been hidden):
MATCH_PERSON | STD_PERSON_PRE | STD_PERSON_GN_FULL | STD_PERSON_FN_FULL | MATCH_PERSON_GN | MATCH_PERSON_GN_STD | MATCH_PERSON_GN_STD2 | MATCH_PERSON_FN | STD_ADDR_COUNTRY_NAME | STD_ADDR_ADDRESS_DELIVERY | STD_ADDR_LOCALITY_FULL | STD_ADDR_POSTCODE_FULL | MATCH_ADDR_COUNTRY | MATCH_ADDR_POSTCODE1 | MATCH_ADDR_REGION | MATCH_ADDR_LOCALITY | MATCH_ADDR_PRIM_NAME | MATCH_ADDR_PRIM_TYPE | MATCH_ADDR_PRIM_DIR | MATCH_ADDR_PRIM_NUMBER | … | CLEANSING_SOURCE_1_TABLE_ID | CLEANSING_SOURCE_1_ROW_ID |
TED MOSBY | MR. | TED | MOSBY | TED | EDWARD | THEODORE | MOSBY | UNITED STATES | 656 N STONYBROOK ST | SPRING VALLEY | 10977 | US | 10977 | SPRING VALLEY | STONYBROOK | ST | N | 656 | … | 4 | 0 | |
LILLY ALDRIN | MS. | LILLY | ALDRIN | LILLY | LILLIAN | ALDRIN | UNITED STATES | 8354 SOUTH DR | WINDSOR | 6095 | US | 6095 | WINDSOR | SOUTH | DR | 8354 | … | 4 | 1 | |||
MARSHALL ERIKSEN | MR. | MARSHALL | ERIKSEN | MARSHALL | ERIKSEN | UNITED STATES | 8354 SOUTH DR | WINDSOR | 6095 | US | 6095 | WINDSOR | SOUTH | DR | 8354 | … | 4 | 2 | ||||
ROBIN SCHERBATSKY | ROBIN | SCHERBATSKY | ROBIN | ROBBIN | ROBINSON | SCHERBATSKY | UNITED STATES | 599 PENN AVE | LAUREL | 20707 | US | 20707 | LAUREL | PENN | AVE | 599 | … | 4 | 3 | |||
BARNEY STINSON | MR. | BARNEY | STINSON | BARNEY | BARNABAS | BERNARD | STINSON | UNITED STATES | 498 PAWNEE RD | WOBURN | 1801 | US | 1801 | WOBURN | PAWNEE | RD | 498 | … | 4 | 4 | ||
DONNA PAULSEN | MS. | DONNA | PAULSEN | DONNA | PAULSEN | UNITED STATES | 8586 S WINCHESTER DR | HOMESTEAD | 33030 | US | 33030 | HOMESTEAD | WINCHESTER | DR | S | 8586 | … | 4 | 5 | |||
HARVEY SPECTER | MR. | HARVEY | SPECTER | HARVEY | SPECTER | UNITED STATES | 8586 S WINCHESTER DR | HOMESTEAD | 33030 | US | 33030 | HOMESTEAD | WINCHESTER | DR | S | 8586 | … | 4 | 6 | |||
MIKE ROSS | MR. | MIKE | ROSS | MIKE | MICHAEL | ROSS | UNITED STATES | 8992 E HARVEY ST | ALBANY | 12203 | US | 12203 | ALBANY | HARVEY | ST | E | 8992 | … | 4 | 7 | ||
RACHEL ZANE | MS. | RACHEL | ZANE | RACHEL | ZANE | UNITED STATES | 8992 E HARVEY ST | ALBANY | 12203 | US | 12203 | ALBANY | HARVEY | ST | E | 8992 | … | 4 | 8 | |||
LOUIS LITT | MR. | LOUIS | LITT | LOUIS | LITT | UNITED STATES | 8701 BEAR HILL ST | ASBURY PARK | 7712 | US | 7712 | ASBURY PARK | BEAR HILL | ST | 8701 | … | 4 | 9 | ||||
JESSICA PEARSON | MS. | JESSICA | PEARSON | JESSICA | PEARSON | UNITED STATES | 319 BROWN RD | NEWTOWN | 18940 | US | 18940 | NEWTOWN | BROWN | RD | 319 | … | 4 | 10 | ||||
CHARLIE HARPER | MR. | CHARLIE | HARPER | CHARLIE | CHARLENE | CHARLES | HARPER | UNITED STATES | 758 JONES RD | MALIBU | 7110 | US | 7110 | MALIBU | JONES | RD | 758 | … | 4 | 11 | ||
ALAN HARPER | MR. | ALAN | HARPER | ALAN | HARPER | UNITED STATES | 758 JONES RD | MALIBU | 7110 | US | 7110 | MALIBU | JONES | RD | 758 | … | 4 | 12 | ||||
JAKE HARPER | MR. | JAKE | HARPER | JAKE | JACOB | JAKOB | HARPER | UNITED STATES | 7709 EL DORADO CT | NEWARK | 7103 | US | 7103 | NEWARK | EL DORADO | CT | 7709 | … | 4 | 13 | ||
WALDEN SCHMIDT | MR. | WALDEN | SCHMIDT | WALDEN | SCHMIDT | UNITED STATES | 5 PACIFIC DR | SEWELL | 8080 | US | 8080 | SEWELL | PACIFIC | DR | 5 | … | 4 | 14 | ||||
HARRY POTTER | MR. | HARRY | POTTER | HARRY | HAROLD | HENRY | POTTER | UNITED KINGDOM | 66 MYRTLE DRIVE | LAKEWOOD | 8701 | GB | 8701 | LAKEWOOD | MYRTLE | DR | 66 | … | 4 | 15 | ||
HERMIONE GRANGER | MS. | HERMIONE | GRANGER | HERMIONE | GRANGER | UNITED KINGDOM | 874 ROCKCREST STREET | CROWN POINT | 46307 | GB | 46307 | CROWN POINT | ROCKCREST | ST | 874 | … | 4 | 16 | ||||
RON WEASLEY | MR. | RON | WEASLEY | RON | RONALD | WEASLEY | UNITED KINGDOM | 7835 CRESCENT STREET | OAK FOREST | 60452 | GB | 60452 | OAK FOREST | CRESCENT | ST | 7835 | … | 4 | 17 | |||
RUBEUS HAGRID | RUBEUS | HAGRID | RUBEUS | HAGRID | UNITED KINGDOM | 37 WOODSMAN LANE | CALHOUN | 30701 | GB | 30701 | CALHOUN | WOODSMAN | LA | 37 | … | 4 | 18 | |||||
ALBUS DUMBLEDORE | ALBUS | DUMBLEDORE | ALBUS | DUMBLEDORE | UNITED KINGDOM | 10 LILAC COURT | SARASOTA | 34231 | GB | 34231 | SARASOTA | LILAC | CT | 10 | … | 4 | 19 |
Cleansed Person Data (Example Data Source 1)
It can be observed that in the column “STD_PERSON_PRE” the prenames are filled (for some persons this has not been identified). In the columns “MATCH_PERSON_GN“, “MATCH_PERSON_GN_STD” and “MATCH_PERSON_GN_STD2” alternative first names are maintained (e.g. “TED”, “EDWARD”, “THEODORE”), which can be used for matching. Furthermore, it can be seen that parts of the street name are stored in the fields “MATCH_ADDR_PRIM_NAME“, “MATCH_ADDR_PRIM_TYPE” and “MATCH_ADDR_PRIM_DIR” (e.g. “South Winchester Drive” becomes “WINCHESTER”, “DR” and “S”). In the field “MATCH_ADDR_COUNTRY” you can see that “UK” is converted to “GB”.
Example 2: Cleansing Company Data
The following example shows the cleansing of the organization data from our example.
After the flowgraph is deployed, a procedure is created with the name of the flowgraph including the suffix “_SP“. This can be called as follows:
CALL "<SCHEMA_NAME>". "<NAME_FLOWGRAPH>_SP"(<NAME_FLOWGRAPH>_<NAME_DATASOURCE_TABLE> _TAB => '<NAME_INPUT_TABLE> ‘);
In the SAP WebIDE, information about the started tasks can be obtained in the “Tasks” tab (the TASK_EXECUTION_ID can also be seen here, this uniquely identifies a run of a task). Furthermore, this information can be taken from the “TASK_EXECUTIONS” table of the “_SYS_TASK” schema.
The result of the cleansing is shown in the following table (some cleansing columns have been hidden):
STD_FIRM | STD_ADDR_COUNTRY_NAME | STD_ADDR_ADDRESS_DELIVERY | STD_ADDR_LOCALITY_FULL | STD_ADDR_POSTCODE_FULL | MATCH_FIRM | MATCH_FIRM_STD | MATCH_ADDR_COUNTRY | MATCH_ADDR_POSTCODE1 | MATCH_ADDR_LOCALITY | MATCH_ADDR_PRIM_NAME | MATCH_ADDR_PRIM_TYPE | MATCH_ADDR_PRIM_NUMBER | … | CLEANSING_SOURCE_1_TABLE_ID | CLEANSING_SOURCE_1_ROW_ID | |
TIM HORTONS | CANADA | 656 N. STONYBROOK ST | SPRING VALLEY | 10977 | TIM HORTONS | CA | 10977 | SPRING VALLEY | N STONYBROOK | ST | 656 | … | 4 | 0 | ||
BURGER KING | UNITED STATES | 8354 SOUTH DR | WINDSOR | 6095 | BURGER KING | US | 6095 | WINDSOR | SOUTH | DR | 8354 | … | 4 | 1 | ||
MCDONALD’S | UNITED STATES | 8354 SOUTH DR | WINDSOR | 6095 | MCDONALDS | US | 6095 | WINDSOR | SOUTH | DR | 8354 | … | 4 | 2 | ||
KENTUCKY FRIED CHICKEN | UNITED STATES | 599 PENN AVE | LAUREL | 20707 | KENTUCKY FRIED CHICKEN | US | 20707 | LAUREL | PENN | AVE | 599 | … | 4 | 3 | ||
PIZZA HUT | UNITED STATES | 498 PAWNEE RD | WOBURN | 1801 | PIZZA HUT | US | 1801 | WOBURN | PAWNEE | RD | 498 | … | 4 | 4 |
Cleansed Company Data (Example Data Source 1)
You can see that in the column “STD_FIRM” the company names are standardized and in “MATCH_FIRM” the name is stored optimized for matching (“McDonalds” becomes “MCDONALD’S” and “MCDONALDS”). Furthermore you can see that in the fields “MATCH_ADDR_PRIM_NAME“, “MATCH_ADDR_PRIM_TYPE” and “MATCH_ADDR_PRIM_NUMBER” parts of the street name and house number are stored, which in Data Source 1 all come from the field “Address Line” (e.g. “498 Pawnee Road” becomes “PAWNEE”, “RD” and “498”).
Since in this example the Side Effect Data Level was set to “Basic“, the columns “CLEANSING_SOURCE_1_TABLE_ID” and “CLEANSING_SOURCE_1_ROW_ID” are included. By means of these IDs information about the cleansing can be extracted in the “_SYS_TASK” schema, among others in the “CLEANSE_ADDRESS_RECORD_INFO” view.
In the following the information about the cleansing is shown in the side effect tables:
Step 2 – Data Matching
To be able to match data using HANA SDQ and identify potential duplicates, the “Match” node (from the Data Quality section) is used in HANA Flowgraphs. This node always has 1 or more input ports and exactly one output port for the data flow. This node returns match groups with potentially related entries based on match components (such as address + company name), specified match policies, as well as match settings (e.g. should John Doe also match J. Doe?).
The node looks like this:
The documentation can be found here. Note: The example is not shown in this blog, but if only one data source is attached to the node, the best record of a match group can be identified as the master record based on settings.
Components + Match Policies
A component comprises a category of data to be cleansed and/or matched. Components include Firm, Address, Person, Email, Phone, Custom (custom defined components e.g. matching of proprietary IDs). Components can be identified in the match node that have already been cleansed via the “Cleanse” node in SDQ or uncleansed components can be defined.
A match policy is a rule that defines when records should match. They consist of at least one component (e.g. Person, Address, Name). Multiple match policies can be defined in descending order of priority.
The following graphic shows how match policies can be created in the “Policies” tab.
Match Settings
For each component there are different special settings that can change the behavior of the matching, for example, whether abbreviations in company names should also match. The match strictness can be set on a 7-level scale for each component.
Additional “Source Settings” can be set to get statistical data per data source and to optimize performance.
The Side Effect Data Level can also be configured in the settings. This allows additional statistics on the matching behavior. Depending on the set level, the information is stored in additional tables in the “_SYS_TASK” schema. For example, starting with the “Basic” level, the match score and the match rule for each record can be taken from the “MATCH_TRACING” view.
The following figure shows how the match settings (using the example of companies, persons and addresses) can be adjusted.
Match Output
In the tab “Output” can be defined which fields are added to the match output. Output fields are:
- GROUP_ID (INTEGER): ID of the match group; non-matching entries don’t have a GROUP_ID
- CONFLICT_GROUP (NVARCHAR(1)): Conflict Cases (several entries of a group match only indirectly)
- REVIEW_GROUP (NVARCHAR(1)): Matches within a group with low confidence
Optionally, the following fields can be added depending on Match Settings:
- MATCH_POLICY (NVARCHAR(50)): Drawn Policy, z.B. “Person, Phone”
- MATCH_SCORE (INTEGER): Confidence value derived from SDQ
- ROW_ID (INTEGER)/ TABLE_ID (INTEGER): IDs to identify records in Side Effect Tables
In addition, fields from the input of the match node can be passed through.
The following graphic shows how fields can be defined for the matching output.
Example 1: Matching Person Data
The following example shows the matching of the person data from our example.
After the flowgraph has been deployed, a procedure is created with the name of the flowgraph including the suffix “_SP”. This can be called as follows:
CALL "<SCHEMA_NAME>". "<NAME_FLOWGRAPH>_SP"(<NAME_FLOWGRAPH>_<NAME_DATASOURCE_TABLE1> _TAB => '<NAME_INPUT_TABLE1>‘, <NAME_FLOWGRAPH>_<NAME_DATASOURCE_TABLE2> _TAB => '<NAME_INPUT_TABLE2>‘,);
In the SAP WebIDE, information on the started tasks can be obtained in the “Tasks” tab (the TASK_EXECUTION_ID can also be obtained here, this uniquely identifies a run of a task). Furthermore, this information can be taken from the “TASK_EXECUTIONS” table of the “_SYS_TASK” schema.
The result of the matching is shown in the following table (some columns have been hidden, the two match sources are contrasted):
MEMBER_NO | FIRST_NAME | LAST_NAME | BIRTH_DATE | COUNTRY | POST_CODE | CITY | ADDRESS_LINE | SOURCE | ID | FIRST | LAST | BIRTHDAY | COUNTRY | ZIP | CITY | STREET | STREETNUMBER | SOURCE | REVIEW_GROUP | CONFLICT_GROUP |
1 | Ted | Mosby | 04.11.1983 | US | 10977 | Spring Valley | 656 N. Stonybrook Street | 1 | 932093 | Teddy | Mosby | 04.11.1983 | US | 10977 | Spring Valley | North Stonybrook Street | 656 | 2 | N | N |
2 | Lilly | Aldrin | 01.12.1994 | US | 6095 | Windsor | 8354 South Drive | 1 | 83294 | Lilly | Aldrin | 01.12.1994 | US | 6095 | Windsor | South Driv | 8354 | 2 | N | N |
3 | Marshall | Eriksen | 12/14/1989 | US | 6095 | Windsor | 8354 South Drive | 1 | ||||||||||||
4 | Robin | Scherbatsky | 03/18/1987 | US | 20707 | Laurel | 599 Penn Ave. | 1 | ||||||||||||
5 | Barney | Stinson | 03/23/1968 | US | 1801 | Woburn | 498 Pawnee Road | 1 | 219410 | Barney | Stins | 03/23/1968 | US | 1801 | Woburn | Pawnee Road | 498 | 2 | R | N |
5 | Barney | Stinson | 03/23/1968 | US | 1801 | Woburn | 498 Pawnee Road | 1 | 219409 | Barney | Stinson | 03/23/1968 | US | 1801 | Woburn | Pawnee Rd. | 498 | 2 | R | N |
6 | Donna | Paulsen | 08.04.1993 | US | 33030 | Homestead | 8586 S. Winchester Dr. | 1 | 4329 | Donna | Paulsen | 08.04.1993 | US | 33030 | Homestead | South Winchester Drive | 8586 | 2 | N | N |
7 | Harvey | Specter | 12.04.1994 | US | 33030 | Homestead | 8586 S. Winchester Dr. | 1 | ||||||||||||
8 | Mike | Ross | 12/23/1996 | US | 12203 | Albany | 8992 East Harvey St. | 1 | 3432 | Michael James | Ross | 12/23/1996 | US | 12203 | Albany | E. Harvey Street | 8992 | 2 | N | N |
9 | Rachel | Zane | 01/20/1980 | US | 12203 | Albany | 8992 East Harvey St. | 1 | ||||||||||||
10 | Louis | Litt | 06/24/1985 | US | 7712 | Asbury Park | 8701 Bear Hill Street | 1 | 1290383 | Louis | Litt | 06/24/1985 | US | 7712 | Asbury Park | Bear Hill St. | 8701 | 2 | N | N |
11 | Jessica | Pearson | 09.10.2000 | US | 18940 | Newtown | 319 Brown Rd. | 1 | 3129321 | Jessica | Pearson | 09.10.2000 | US | 18940 | Newtown | Brown Rd. | 319 | 2 | N | N |
12 | Charlie | Harper | 11/25/1984 | US | 7110 | Malibu | 758 Jones Rd. | 1 | 43213 | Charles | Harper | 11/25/1984 | US | 7110 | Malibu | Jones Road | 758 | 2 | N | N |
13 | Alan | Harper | 05.08.1992 | US | 7110 | Malibu | 758 Jones Rd. | 1 | 1243221 | Alan | Harper | 05.08.1992 | US | 7110 | Jones Rd. | 758 | 2 | N | N | |
14 | Jake | Harper | 08.05.1980 | US | 7103 | Newark | 7709 El Dorado Court | 1 | 2342322 | Jake | Harper | 08.05.1980 | US | 7103 | Newark | El Dorado Court | 7709 | 2 | N | N |
15 | Walden | Schmidt | 11/13/1987 | US | 8080 | Sewell | 5 Pacific Drive | 1 | ||||||||||||
16 | Harry | Potter | 06/16/1981 | UK | 8701 | Lakewood | 66 Myrtle Dr. | 1 | 213234 | Harry | Potter | 06/16/1981 | UK | 8701 | Lakewood | Myrtle Drive | 66 | 2 | N | N |
17 | Hermione | Granger | 07/15/1991 | UK | 46307 | Crown Point | 874 Rockcrest St. | 1 | ||||||||||||
18 | Ron | Weasley | 10/28/1987 | UK | 60452 | Oak Forest | 7835 Crescent St. | 1 | 2321321 | Ronald | Weasley | UK | 60452 | Oak Forest | Crescent St. | 7835 | 2 | N | N | |
19 | Rubeus | Hagrid | 11/25/1982 | UK | 30701 | Calhoun | 37 Woodsman Lane | 1 | 234232 | Rubeus | Hagrid | 11/25/1982 | UK | 30701 | Woodmans Ln. | 37 | 2 | N | N | |
20 | Albus | Dumbledore | 01.11.1968 | UK | 34231 | Sarasota | 10 Lilac Court | 1 | ||||||||||||
234212 | Hermione | Granger | 07/15/1991 | UK | 46307 | Crown Point | Rockcrest St. | 2 | ||||||||||||
82302 | Rachel | Zane | 01/20/1980 | US | 12203 | Albany | 8992 | 2 | ||||||||||||
129321 | Robin | Scherbatsky | US | 20707 | Laurel | Penn Avenue | 2 | |||||||||||||
83209841 | Marshall | Eriksen | 12/14/1989 | US | 6095 | Windsor | S. Drive | 8354 | 2 |
Match Output Person Data (Example)
It is apparent that many matches were found. For Member No. 5 from Data Source 1 (Barney Stinson), two entries were identified in Data Source 2. No match was found for Member No. 7, but there was also no matching entry in Data Source 2. Member No. 8 from Data Source 1 found a match, although the first names “Mike” and “Michael James” as well as the address data differed. Member No 18 (Ron Weasley) was able to identify a match even though no date of birth was maintained (because multiple match policies were configured). Several matches could not be identified, among them Robin Scherbatsky and Hermione Granger, because no street numbers were maintained. In this use case, one needs to think about setting the match strictness more loosely.
Since the Side Effect Data Level was turned on to “Basic” in this example, IDs can be used to extract matching information in the “_SYS_TASK” schema, including in the “MATCH_TRACING” view. In the following the information about the matching in the side effect tables is shown:
It is visible that in one case only the policy “Person and Address” were matched (Ron Weasley) otherwise “Person, Address and Date” were identified. Furthermore, it can be seen that in one case a match was found inside one data source (Barney Stinson).
Example 2: Matching Company Data
The following example shows the matching of the organizational data from our example.
The result of the matching is shown in the following table (some columns have been hidden, the two match sources are contrasted):
COMPANY_ID | COMPANY_NAME | COUNTRY | POST_CODE | CITY | ADDRESS_LINE | SOURCE | ID | COMPANY | COUNTRY | ZIP | LOCATION | STREETNAME | STREETNUMBER | SOURCE | REVIEW_GROUP | CONFLICT_GROUP |
1 | Tim Hortons | CA | 10977 | Spring Valley | 656 N. Stonybrook Street | 1 | ||||||||||
2 | Burger King | US | 6095 | Windsor | 8354 South Drive | 1 | 2 | Burger King Food | US | 6095 | Windsor | South Drive | 8354 | 2 | N | N |
2 | Burger King | US | 6095 | Windsor | 8354 South Drive | 1 | 3 | BURGER KING | US | Windsor | South Drive | 8354 | 2 | N | N | |
3 | McDonalds | US | 6095 | Windsor | 8354 South Drive | 1 | ||||||||||
4 | Kentucky Fried Chicken | US | 20707 | Laurel | 599 Penn Ave. | 1 | 4 | KFC | US | 20707 | Laurel | Penn Avenue | 599 | 2 | N | N |
5 | Pizza Hut | US | 1801 | Woburn | 498 Pawnee Road | 1 | ||||||||||
1 | Tim Hortons | CA | 10977 | Spring Valley | North Stonybrook Street | 656 | 2 | |||||||||
5 | PIZZA HUT INC | US | 1801 | Woburn | Pawnee Road | 2 |
Match Output Company Data (Example)
It is evident that no match was identified for Company ID 3 (McDonalds) from data source 1. However, there is also no potential match in data source 2. In addition, it can be seen that “Kentucky Fried Chicken” (Company ID 4 in data source 1) and the abbreviation “KFC” were identified as a match. No matches were identified for the Company ID 1 and 5 (“Tim Hortons” and “Pizza Hut”) entries, although potential matches were present in Data Source 2 with a changed name and address.
Subsequently, the match strictness for the company name was loosened and the flowgraph was redeployed and restarted.
The result is now shown in the following.
COMPANY_ID | COMPANY_NAME | COUNTRY | POST_CODE | CITY | ADDRESS_LINE | SOURCE | ID | COMPANY | COUNTRY | ZIP | LOCATION | STREETNAME | STREETNUMBER | SOURCE | REVIEW_GROUP | CONFLICT_GROUP |
1 | Tim Hortons | CA | 10977 | Spring Valley | 656 N. Stonybrook Street | 1 | ||||||||||
2 | Burger King | US | 6095 | Windsor | 8354 South Drive | 1 | 2 | Burger King Food | US | 6095 | Windsor | South Drive | 8354 | 2 | N | N |
2 | Burger King | US | 6095 | Windsor | 8354 South Drive | 1 | 3 | BURGER KING | US | Windsor | South Drive | 8354 | 2 | N | N | |
3 | McDonalds | US | 6095 | Windsor | 8354 South Drive | 1 | ||||||||||
4 | Kentucky Fried Chicken | US | 20707 | Laurel | 599 Penn Ave. | 1 | 4 | KFC | US | 20707 | Laurel | Penn Avenue | 599 | 2 | N | N |
5 | Pizza Hut | US | 1801 | Woburn | 498 Pawnee Road | 1 | 5 | PIZZA HUT INC. | US | 1801 | Woburn | Pawnee Road | 2 | R | N | |
1 | Tim Hortons | CA | 10977 | Spring Valley | North Stonybrook Street | 656 | 2 |
Match Output Company Data with looser Firmname Settings (Example)
Company_ID 5 from data source 1 and ID 5 from data source 2 could now be identified as an additional match (“Pizza Hut” and “PIZZA HUT INC.”).
Conclusion
This blog post has demonstrated how cleansing and matching use cases can be implemented with SAP HANA by using the Smart Data Quality capabilities. Within a few implementation steps it is possible to create a proof of concept and already achieve a good matching quality. Smart Data Quality includes several rules for cleaning and validating addresses, e.g. to identify street suffixes.
I would like to emphasize again at this point that the cleansing feature provides added value even detached from matching, to be used in additional use cases. Moreover, one should consider persisting the cleansed data in order not to need a cleansing step every time the matching node is called.
Thanks for reading! I hope you find this post helpful. For any questions or feedback just leave a comment below this post.
Best wishes and a happy holiday season,
Tim
Find more information and related blog posts on the topic page for Database and Data Management.