Data Anonymization in SAP HANA

Anonymization methods available in the SAP HANA database allow you to gain statistically valid insights from your data while protecting the privacy of individuals.

Why Anonymize?

In a data-driven world, a growing amount of business data contains personal or sensitive information. If this data is to be used by applications for statistical analysis, it must be protected to ensure privacy. Trivial modifications to the data like replacing information that directly identifies an individual such as name or social security number (pseudonymization) or simply removing the information is not enough. Re-identification is still possible, for example if additional information is obtained (referred to as a linkage attack).

Unlike masking and pseudonymization, anonymization methods (also called privacy-enhancing methods) provide a more structured approach to modifying data for privacy protection. The quality of such anonymized or privacy-enhanced data is still sufficient for meaningful analysis. Several anonymization methods exist.

SAP HANA supports the methods k-anonymity, l-diversity, and differential privacy. Which method provides the most appropriate level of privacy depends on your data and the potential attack scenarios and attackers.

Anonymization Workflow

A data controller – that is someone who determines when and how personal data is accessed and processed – defines an SQL view and configures the parameters of the chosen anonymization method to meet the required privacy level. Access to the anonymized view can then be granted to users using standard SAP HANA authorization mechanisms. An overview of this process is shown below:

Data Anonymization Methods

Data anonymization methods provide a structured approach to modifying data for privacy protection. SAP HANA supports the data anonymization methods k-anonymity, l-diversity, and differential privacy.

Anonymizing Data in SAP HANA – EXT Schema

Data anonymization can be applied to SQL views, thus enabling analytics on data while still protecting the privacy of individuals.

Create a table in your EXT Schema for your data

SET SCHEMA EXT;

CREATE COLUMN TABLE EXT.SALES_EMPLOYEES (
  ID INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  NAME NVARCHAR(50),
  SITE NVARCHAR(20),
  GENDER NVARCHAR(10),
  AGE NVARCHAR(10),
  SALARY DOUBLE
  );
  

Insert your data to your table which contains few PII

INSERT INTO SALES_EMPLOYEES VALUES ('Henry Brubaker','Madrid','Male','29',91000);
INSERT INTO SALES_EMPLOYEES VALUES ('Jaylene Jennings','Paris','Female','19',67000);
INSERT INTO SALES_EMPLOYEES VALUES ('Dean Tavalouris','Boston','Male','20',76000);
INSERT INTO SALES_EMPLOYEES VALUES ('Lydia Wong','Dallas','Female','18',81000);
INSERT INTO SALES_EMPLOYEES VALUES ('Harvey Denton','Madrid','Male','19',45000);
INSERT INTO SALES_EMPLOYEES VALUES ('Pamela Doove','Nantes','Female','21',78000);
INSERT INTO SALES_EMPLOYEES VALUES ('Luciana Trujillo','Barcelona','Female','24',92000);
INSERT INTO SALES_EMPLOYEES VALUES ('Demarcus Collins','Bordeaux','Male','18',67000);
INSERT INTO SALES_EMPLOYEES VALUES ('Edward Tattsyrup','Barcelona','Male','18',78000);
INSERT INTO SALES_EMPLOYEES VALUES ('Benjamin Denton','Paris','Male','19',98000);
INSERT INTO SALES_EMPLOYEES VALUES ('Joaquin Barry','Madrid','Male','20',34000);

commit;

Create Geography Hierarchy table

CREATE COLUMN TABLE EXT.HIER_GEO (
  CHILD NVARCHAR(20),
  PARENT NVARCHAR(20)
  );

Insert your Geography data into table.

INSERT INTO HIER_GEO VALUES ('Europe', NULL);
INSERT INTO HIER_GEO VALUES ('North America', NULL);
INSERT INTO HIER_GEO VALUES ('Spain', 'Europe');
INSERT INTO HIER_GEO VALUES ('France', 'Europe');
INSERT INTO HIER_GEO VALUES ('USA', 'North America');
INSERT INTO HIER_GEO VALUES ('Canada', 'North America');

commit;

Create a view for the Geography Hierarchy

CREATE VIEW V_HIER_GEO AS 
  SELECT * 
    FROM HIERARCHY (
      SOURCE ( SELECT CHILD AS NODE_ID, PARENT AS PARENT_ID FROM HIER_GEO )
      SIBLING ORDER BY PARENT_ID, NODE_ID
      );

Create a Function for Age

CREATE OR REPLACE FUNCTION EXT.GROUP_AGE (age NVARCHAR(10), level INTEGER)
RETURNS outValue NVARCHAR(50) AS
BEGIN
  DECLARE defineValue INTEGER DEFAULT 5;
  DECLARE interval INTEGER;
  DECLARE rangeFrom INTEGER;
  IF (level > 0) THEN 
    interval := defineValue * power(2, level - 1);
    rangeFrom = FLOOR(age / interval) * interval;
    outValue := '[' || rangeFrom || '-' || rangeFrom + interval - 1 || ']';
  ELSE
    outValue := age;
  END IF;
END;  

Create an anonymized view based on k-anonymity:

CREATE VIEW SALES_EMPLOYEES_ANON AS
  SELECT ID, SITE, GENDER, AGE, SALARY
  FROM SALES_EMPLOYEES
  WITH ANONYMIZATION (
    ALGORITHM 'K-ANONYMITY'
    PARAMETERS '{"k": 2}'
    COLUMN ID PARAMETERS '{"is_sequence": true}'
    COLUMN SITE PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"EXT", "view":"V_HIER_GEO"}}'
    COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Female"], ["Male"]]}}'
    COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"EXT", "function":"GROUP_AGE", "levels":3}}'
  );

Set the data change strategy to restricted:

PARAMETERS '{"data_change_strategy": "restricted", "k": 2}'

Access the view (as the access_user with the masked privilege):

SELECT * FROM M_ANONYMIZATION_VIEWS;

REFRESH VIEW SALES_EMPLOYEES_ANON ANONYMIZATION;

SELECT * FROM SALES_EMPLOYEES ORDER BY SITE, GENDER, AGE;

Refresh the view again and then execute the SELECT statement:


The GET_ANONYMIZATION_VIEW_STATISTICS procedure can be used to retrieve the data anonymization KPIs.

Note that only the data controller who defined the view (with CREATE VIEW) or a user with CATALOG READ privileges can currently obtain results using GET_ANONYMIZATION_VIEW_STATISTICS.

Action get_names

The get_names action retrieves all available KPIs for the specified anonymization view:

GET_ANONYMIZATION_VIEW_STATISTICS('get_names', NULL, '<schema_name>', '<view_name>')

CALL GET_ANONYMIZATION_VIEW_STATISTICS('get_names', NULL, 'EXT', 'SALES_EMPLOYEES_ANON');

Action get_values

The get_values action retrieves the computed values for all available KPIs:

GET_ANONYMIZATION_VIEW_STATISTICS('get_values', NULL, '<schema_name>', '<view_name>')

CALL GET_ANONYMIZATION_VIEW_STATISTICS('get_values', NULL, 'EXT', 'SALES_EMPLOYEES_ANON');

Age is now the most important attribute, so location is generalized up one level.

SELECT E.ID, E.SALARY, A.SALARY FROM SALES_EMPLOYEES E INNER JOIN SALES_EMPLOYEES_ANON A ON (A.ID=E.ID);

SELECT AVG(SALARY) FROM SALES_EMPLOYEES UNION SELECT AVG(SALARY) FROM SALES_EMPLOYEES_ANON;

SELECT 'RAW' AS TYPE, GENDER, AVG(SALARY) AS SALARY 
FROM SALES_EMPLOYEES GROUP BY GENDER 
UNION 
SELECT 'ANON' AS TYPE, GENDER, AVG(SALARY) AS SALARY_ANON 
FROM SALES_EMPLOYEES_ANON GROUP BY GENDER

 


Anonymization Views: Best Practices

Read the Article

Sara Sampaio

Sara Sampaio

Author Since: March 10, 2022

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x