Data Anonymization in SAP HANA
Anonymization methods available in the SAP HANA database allow you to gain statistically valid insights from your data while protecting the privacy of individuals.
Why Anonymize?
In a data-driven world, a growing amount of business data contains personal or sensitive information. If this data is to be used by applications for statistical analysis, it must be protected to ensure privacy. Trivial modifications to the data like replacing information that directly identifies an individual such as name or social security number (pseudonymization) or simply removing the information is not enough. Re-identification is still possible, for example if additional information is obtained (referred to as a linkage attack).
Unlike masking and pseudonymization, anonymization methods (also called privacy-enhancing methods) provide a more structured approach to modifying data for privacy protection. The quality of such anonymized or privacy-enhanced data is still sufficient for meaningful analysis. Several anonymization methods exist.
SAP HANA supports the methods k-anonymity, l-diversity, and differential privacy. Which method provides the most appropriate level of privacy depends on your data and the potential attack scenarios and attackers.
Anonymization Workflow
A data controller – that is someone who determines when and how personal data is accessed and processed – defines an SQL view and configures the parameters of the chosen anonymization method to meet the required privacy level. Access to the anonymized view can then be granted to users using standard SAP HANA authorization mechanisms. An overview of this process is shown below:
Data Anonymization Methods
Data anonymization methods provide a structured approach to modifying data for privacy protection. SAP HANA supports the data anonymization methods k-anonymity, l-diversity, and differential privacy.
Anonymizing Data in SAP HANA – EXT Schema
Data anonymization can be applied to SQL views, thus enabling analytics on data while still protecting the privacy of individuals.
Create a table in your EXT Schema for your data
SET SCHEMA EXT;
CREATE COLUMN TABLE EXT.SALES_EMPLOYEES (
ID INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
NAME NVARCHAR(50),
SITE NVARCHAR(20),
GENDER NVARCHAR(10),
AGE NVARCHAR(10),
SALARY DOUBLE
);
Insert your data to your table which contains few PII
INSERT INTO SALES_EMPLOYEES VALUES ('Henry Brubaker','Madrid','Male','29',91000);
INSERT INTO SALES_EMPLOYEES VALUES ('Jaylene Jennings','Paris','Female','19',67000);
INSERT INTO SALES_EMPLOYEES VALUES ('Dean Tavalouris','Boston','Male','20',76000);
INSERT INTO SALES_EMPLOYEES VALUES ('Lydia Wong','Dallas','Female','18',81000);
INSERT INTO SALES_EMPLOYEES VALUES ('Harvey Denton','Madrid','Male','19',45000);
INSERT INTO SALES_EMPLOYEES VALUES ('Pamela Doove','Nantes','Female','21',78000);
INSERT INTO SALES_EMPLOYEES VALUES ('Luciana Trujillo','Barcelona','Female','24',92000);
INSERT INTO SALES_EMPLOYEES VALUES ('Demarcus Collins','Bordeaux','Male','18',67000);
INSERT INTO SALES_EMPLOYEES VALUES ('Edward Tattsyrup','Barcelona','Male','18',78000);
INSERT INTO SALES_EMPLOYEES VALUES ('Benjamin Denton','Paris','Male','19',98000);
INSERT INTO SALES_EMPLOYEES VALUES ('Joaquin Barry','Madrid','Male','20',34000);
commit;
Create Geography Hierarchy table
CREATE COLUMN TABLE EXT.HIER_GEO (
CHILD NVARCHAR(20),
PARENT NVARCHAR(20)
);
Insert your Geography data into table.
INSERT INTO HIER_GEO VALUES ('Europe', NULL);
INSERT INTO HIER_GEO VALUES ('North America', NULL);
INSERT INTO HIER_GEO VALUES ('Spain', 'Europe');
INSERT INTO HIER_GEO VALUES ('France', 'Europe');
INSERT INTO HIER_GEO VALUES ('USA', 'North America');
INSERT INTO HIER_GEO VALUES ('Canada', 'North America');
commit;
Create a view for the Geography Hierarchy
CREATE VIEW V_HIER_GEO AS
SELECT *
FROM HIERARCHY (
SOURCE ( SELECT CHILD AS NODE_ID, PARENT AS PARENT_ID FROM HIER_GEO )
SIBLING ORDER BY PARENT_ID, NODE_ID
);
Create a Function for Age
CREATE OR REPLACE FUNCTION EXT.GROUP_AGE (age NVARCHAR(10), level INTEGER)
RETURNS outValue NVARCHAR(50) AS
BEGIN
DECLARE defineValue INTEGER DEFAULT 5;
DECLARE interval INTEGER;
DECLARE rangeFrom INTEGER;
IF (level > 0) THEN
interval := defineValue * power(2, level - 1);
rangeFrom = FLOOR(age / interval) * interval;
outValue := '[' || rangeFrom || '-' || rangeFrom + interval - 1 || ']';
ELSE
outValue := age;
END IF;
END;
Create an anonymized view based on k-anonymity:
CREATE VIEW SALES_EMPLOYEES_ANON AS
SELECT ID, SITE, GENDER, AGE, SALARY
FROM SALES_EMPLOYEES
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY'
PARAMETERS '{"k": 2}'
COLUMN ID PARAMETERS '{"is_sequence": true}'
COLUMN SITE PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"EXT", "view":"V_HIER_GEO"}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Female"], ["Male"]]}}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"EXT", "function":"GROUP_AGE", "levels":3}}'
);
Set the data change strategy to restricted:
PARAMETERS '{"data_change_strategy": "restricted", "k": 2}'
Access the view (as the access_user with the masked privilege):
SELECT * FROM M_ANONYMIZATION_VIEWS;
REFRESH VIEW SALES_EMPLOYEES_ANON ANONYMIZATION;
SELECT * FROM SALES_EMPLOYEES ORDER BY SITE, GENDER, AGE;
Refresh the view again and then execute the SELECT statement:
The GET_ANONYMIZATION_VIEW_STATISTICS procedure can be used to retrieve the data anonymization KPIs.
Note that only the data controller who defined the view (with CREATE VIEW) or a user with CATALOG READ privileges can currently obtain results using GET_ANONYMIZATION_VIEW_STATISTICS.
Action get_names
The get_names action retrieves all available KPIs for the specified anonymization view:
GET_ANONYMIZATION_VIEW_STATISTICS('get_names', NULL, '<schema_name>', '<view_name>')
CALL GET_ANONYMIZATION_VIEW_STATISTICS('get_names', NULL, 'EXT', 'SALES_EMPLOYEES_ANON');
Action get_values
The get_values action retrieves the computed values for all available KPIs:
GET_ANONYMIZATION_VIEW_STATISTICS('get_values', NULL, '<schema_name>', '<view_name>')
CALL GET_ANONYMIZATION_VIEW_STATISTICS('get_values', NULL, 'EXT', 'SALES_EMPLOYEES_ANON');
Age is now the most important attribute, so location is generalized up one level.
SELECT E.ID, E.SALARY, A.SALARY FROM SALES_EMPLOYEES E INNER JOIN SALES_EMPLOYEES_ANON A ON (A.ID=E.ID);
SELECT AVG(SALARY) FROM SALES_EMPLOYEES UNION SELECT AVG(SALARY) FROM SALES_EMPLOYEES_ANON;
SELECT 'RAW' AS TYPE, GENDER, AVG(SALARY) AS SALARY
FROM SALES_EMPLOYEES GROUP BY GENDER
UNION
SELECT 'ANON' AS TYPE, GENDER, AVG(SALARY) AS SALARY_ANON
FROM SALES_EMPLOYEES_ANON GROUP BY GENDER