13 Sep, 2022

Data Anonymization Everything You Wanted to Know About

Data anonymization is making the processed data and the identity of the persons identified by this data in no way identifiable. Anonymization under the Personal Data Protection Law (Law); It means “making personal data incapable of being associated with an identified or identifiable natural person under any circumstances, even by matching them with other data”.

The purpose of anonymizing data is to break any connection between the data and the person identified by this data. All of the tie-breaking processes carried out by methods such as automatic or non-automatic ‘‘grouping”, ”masking”, ”derivation”, ”generalization”, ”randomization” applied to the records in the data recording system where personal data is kept are called ”anonymization methods”. The data obtained as a result of the application of these methods should not be able to identify a specific person.

Anonymization methods, which can be taken as an example, are given in the table below in line with the ‘‘Guidelines for the Deletion, Destruction or Anonymization of Personal Data” published by the Personal Data Protection Board (Board).

Anonymization Methods That Do Not Ensure Value Distortion • Extracting Variables
• Extracting Records
• Regional Hiding
• Generalization
• Lower and Upper Bound Coding
• Global Coding
• Sampling
Anonymization Methods That Provide Value Distortion • Micro Joining
• Data Exchange
• Add Noise
Statistical Methods to Strengthen Anonymization • K-Anonymity
• L-Diversity
• T-Proximity

Table 1: Anonymization Methods


¹ You can reach the Deletion, Destruction or Anonymization Guide of Personal Data here: https://www.kvkk.gov.tr/SharedFolderServer/CMSFiles/bc1cb353-ef85-4e58-bb99-3bba31258508.pdf

Data Anonymization Techniques

1. Anonymization Methods That Do Not Provide Value Aberration

In methods that do not provide value irregularity, no change, addition or subtraction is applied to the values ​​of the data in the set, instead, changes are made to all of the rows or columns in the set. Thus, while the overall data changes, the values ​​in the fields retain their original state. Here are some of the anonymizing methods that do not introduce value irregularity:

  • Removing Variables: It is an anonymization method provided by removing one or more of the variables from the table by completely deleting them. In such a case, the entire column in the table will be removed completely. This method can be used if the variable is a high-order descriptor, there is no more suitable solution, the variable is too sensitive to be disclosed to the public, or it does not serve analytical purposes.
  • Removing Records: In this method, anonymity is strengthened by removing a row containing singularity in the dataset and the possibility of generating assumptions about the dataset is reduced. Usually, the extracted records are those that do not have a common value with other records and can be easily guessed by people who have an idea about the data set.
  • Regional Obfuscation: The purpose of the regional hiding method is to make the dataset more secure and to reduce the risk of predictability. If the combination of values ​​for a particular record creates a very low visibility and this situation is likely to cause that person to become distinguishable in the relevant community, the value creating the exceptional situation is changed to “unknown”.
  • Generalization: It is the process of converting the relevant personal data from a specific value to a more general value. It is the most used method when producing cumulative reports and in operations carried out on total figures. The resulting new values ​​represent aggregate values ​​or statistics belonging to a group, which makes it impossible to reach a real person.
  • Lower and Upper Bound Coding: The lower and upper bound coding method is obtained by defining a category for a certain variable and combining the values ​​within the grouping created by this category. Generally, the low or high values ​​of a certain variable are collected together and a new definition is made for these values.
  • Global Coding: The global coding method is a grouping method used in data sets that do not contain numeric values ​​or have values ​​that cannot be sorted numerically, where lower and upper bound coding is not possible. It is generally used where certain values ​​are aggregated to make it easier to make predictions and assumptions. All records in the dataset are replaced with this new definition by creating a common new group for the selected values.
  • Sampling: In the sampling method, instead of the whole data set, a subset from the set is described or shared. Thus, since it is not known whether a person known to be in the whole data set is included in the described or shared sample subset, the risk of producing accurate predictions about individuals is reduced. Simple statistical methods are used to determine the subset to be sampled.

2. Anonymization Methods That Provide Value Irregularity

Unlike the methods mentioned above, with methods that provide value irregularity; By changing the existing values, distortion is created in the values ​​of the data set. In this case, since the values ​ of the records are changing, the benefit planned to be obtained from the data set should be calculated correctly. Even if the values ​​in the data set are changing, it is still possible to benefit from the data by ensuring that the total statistics are not corrupted. Here are some of the anonymization methods that provide value irregularity:

  • Micro-Association: With this method, all records in the data set are first arranged in a meaningful order, and then the whole set is divided into a certain number of subsets. Then, by taking the average of the value of each subset of the determined variable, the value of that variable of the subset is replaced with the mean value. Thus, the average value of that variable for the entire data set will not change.
  • Data Exchange: The data exchange method is record changes obtained by exchanging values ​​belonging to a subset of variables between pairs selected from the records. This method is mainly used for variables that can be categorized, and the main idea is to transform the database by changing the values ​​of the variables among the records belonging to individuals.
  • Adding Noise: With this method, additions and subtractions are made in a selected variable to provide distortions to a determined extent. This method is mostly applied on datasets containing numeric values. Distortion applies equally to each value.

3. Statistical Methods to Strengthen Anonymization

As a result of the combination of certain values ​​in the records with singular scenarios in anonymized data sets, the possibility of determining the identities of the people in the records or deriving assumptions about their personal data may arise. For this reason, anonymity can be strengthened by minimizing the singularity of the records in the dataset by using various statistical methods in anonymized datasets. The main purpose of these methods is to keep the benefit to be gained from the data set at a certain level while minimizing the risk of anonymity being compromised.

  • K-Anonymity: In anonymized datasets, the fact that the identities of the people in the records are identifiable if the indirect identifiers are combined with the right combinations, or the information about a particular person can be easily guessed, has shaken the confidence in the anonymization processes. Based on this, data sets that were anonymized by various statistical methods had to be made more reliable.

K-anonymity was developed to prevent the disclosure of information specific to individuals with singular characteristics in certain combinations by enabling the identification of more than one person with certain fields in a data set. If there is more than one record belonging to the combinations created by combining some of the variables in a data set, the probability of identifying the people who coincide with this combination decreases.

  • L-Diversity: The L-diversity method, which is formed by studies carried out on the shortcomings of K-anonymity, takes into account the diversity created by sensitive variables corresponding to the same variable combinations.
  • T-Proximity: Although the L-diversity method provides diversity in personal data, there are cases where it does not provide sufficient protection since the said method does not deal with the content and sensitivity of personal data. In this state, the process of calculating the degree of closeness of personal data and values to each other and anonymizing the data set by subclassing them according to these degrees of closeness is called T-proximity method.

Choosing the Anonymization Method

Data controllers decide which of the above methods to apply by looking at the data they have. While applying the anonymization methods, it is recommended by the Guide that the following characteristics of the data set should be taken into account by the data controllers:

  • The nature of the data,
  • The size of the data,
  • The structure of the data in physical environments,
  • Variety of data,
  • The desired benefit / purpose of processing from the data,
  • Frequency of data processing,
  • The reliability of the party to which the data will be transferred,
  • The effort to make the data anonymized is meaningful,
  • The magnitude of the damage that may arise in case of the anonymity of the data, its area of ​​influence,
  • The dispersion/centralization ratio of the data,
  • Access authorization control of users to relevant data,
  • The possibility that the effort it will spend to construct and implement an attack that will disrupt anonymity will be meaningful.

The data controller, who thinks that he has made a data anonymous, is responsible for using the information known to be within the body of other institutions and organizations to which he transfers personal data, or to check whether the data in question re-identifies a person, through contracts and risk analyzes.

General Data Protection Regulation and Data Anonymization

The General Data Protection Regulation (GDPR) summarizes a specific set of rules that protect user data and create transparency. While the GDPR is strict, it allows companies to collect anonymous data without consent, use it for any purpose, and store it for an indefinite period as long as companies remove all identifiers from data. Under GDPR strict rules, if you intend to use and store data for an indefinite period, all identifiers, both direct and indirect identifiers for individual identification, must be removed from the data to ensure the protection of individuals.

The term anonymization is not included in the GDPR. “Therefore, data protection principles do not apply to anonymous information, i.e. information that does not relate to an identified or identifiable natural person, or that has been anonymized in such a way that the data subject cannot or can no longer be identified. Therefore, this Regulation does not relate to the processing of such anonymous information, including for statistical or research purposes.” described as.

veri-anonimlestirme-ve-bulaniklastirmaHowever, instead of anonymization, the term pseudonymisation has been used. “Pseudonymisation” meaning blurring; “It is the processing of personal data in such a way that it cannot be associated with a specific data owner, without additional data subject to technical and organizational measures to be kept separate and not associated with a specific data owner.” Blurring is considered a protection that reduces the risks of data owners and helps controllers and processors fulfill their data protection obligations. In Opinion 28, “The application of blurring to personal data can reduce the risks to such data subjects and help controllers and processors meet their data protection obligations. By the explicit introduction of ”blurring” in this Regulation, it is not intended to preclude any other data protection measure.” The importance of blurring in the form of data protection was emphasized.

n this context, the concept of blurring refers to the processing of personal data in such a way that it can no longer be attributed to a specific data subject without using additional information. It is stated that the said additional information should be kept separately and that technical and administrative measures should be taken to ensure that the personal data in question is not attributed to a specific or identifiable natural person.

When anonymizing is done correctly, the data is rendered non-associated with the relevant persons and therefore the data can no longer be considered as personal data. As GDPR does not apply to anonymized data, anonymized data can be used freely under the GDPR.


To request a quotation for the following: Cyber Security, Digital Transformation, MSSP, Penetration Testing, KVKK, GDPR, ISO 27001 and ISO 27701, please click here.


Data anonymization is making the processed data and the identity of the persons identified by this data in no way identifiable. Anonymization under the Personal Data Protection Law (Law); It means “making personal data incapable of being associated with an identified or identifiable natural person under any circumstances, even by matching them with other data”.

The purpose of anonymizing data is to break any connection between the data and the person identified by this data. All of the tie-breaking processes carried out by methods such as automatic or non-automatic ‘‘grouping”, ”masking”, ”derivation”, ”generalization”, ”randomization” applied to the records in the data recording system where personal data is kept are called ”anonymization methods”. The data obtained as a result of the application of these methods should not be able to identify a specific person. Anonymization methods, which can be taken as an example, are given in the table below in line with the ‘‘Guidelines for the Deletion, Destruction or Anonymization of Personal Data” published by the Personal Data Protection Board (Board).

Anonymization Methods That Do Not Ensure Value Distortion • Extracting Variables
• Extracting Variables
• Regional Hiding
• Generalization
• Lower and Upper Bound Coding
• Global Coding
• Sampling
Anonymization Methods That Provide Value Distortion • Micro Joining
• Data Exchange
• Add Noise
Statistical Methods to Strengthen Anonymization • K-Anonymity
• L-Diversity
• T-Proximity

Table 1: Anonymization Methods

Data Anonymization Techniques

1. Anonymization Methods That Do Not Provide Value Aberration

In methods that do not provide value irregularity, no change, addition or subtraction is applied to the values ​​of the data in the set, instead, changes are made to all of the rows or columns in the set. Thus, while the overall data changes, the values ​​in the fields retain their original state. Here are some of the anonymizing methods that do not
introduce value irregularity:

  • Removing Variables: It is an anonymization method provided by removing one or more of the variables from the table by completely deleting them. In such a case, the entire column in the table will be removed completely. This method can be used if the variable is a high-order descriptor, there is no more suitable solution, the variable is too sensitive to be disclosed to the public, or it does not serve analytical purposes.
    Removing Records: In this method, anonymity is strengthened by removing a row containing singularity in the dataset and the possibility of generating assumptions about the dataset is reduced. Usually, the extracted records are those that do not have a common value with other records and can be easily guessed by people who have an idea about the data set.
    Regional Obfuscation: The purpose of the regional hiding method is to make the dataset more secure and to reduce the risk of predictability. If the combination of values ​​for a particular record creates a very low visibility and this situation is likely to cause that person to become distinguishable in the relevant community, the value creating the exceptional situation is changed to “unknown”.
    Generalization: It is the process of converting the relevant personal data from a specific value to a more general value. It is the most used method when producing cumulative reports and in operations carried out on total figures. The resulting new values ​​represent aggregate values ​​or statistics belonging to a group, which makes it impossible to reach a real person.
    Lower and Upper Bound Coding: The lower and upper bound coding method is obtained by defining a category for a certain variable and combining the values ​​within the grouping created by this category. Generally, the low or high values ​​of a certain variable are collected together and a new definition is made for these values.
    Global Coding: The global coding method is a grouping method used in data sets that do not contain numeric values ​​or have values ​​that cannot be sorted numerically, where lower and upper bound coding is not possible. It is generally used where certain values ​​are aggregated to make it easier to make predictions and assumptions. All records in the dataset are replaced with this new definition by creating a common new group for the selected values.
    Sampling: In the sampling method, instead of the whole data set, a subset from the set is described or shared. Thus, since it is not known whether a person known to be in the whole data set is included in the described or shared sample subset, the risk of producing accurate predictions about individuals is reduced. Simple statistical methods are used to determine the subset to be sampled.

2. Anonymization Methods That Provide Value Irregularity

Unlike the methods mentioned above, with methods that provide value irregularity; By changing the existing values, distortion is created in the values ​​of the data set. In this case, since the values ​ of the records are changing, the benefit planned to be obtained from the data set should be calculated correctly. Even if the values ​​in the data set are changing, it is still possible to benefit from the data by ensuring that the total statistics are not corrupted. Here are some of the anonymization methods that provide value irregularity:

  • Micro-Association: With this method, all records in the data set are first arranged in a meaningful order, and then the whole set is divided into a certain number of subsets. Then, by taking the average of the value of each subset of the determined variable, the value of that variable of the subset is replaced with the mean value. Thus, the average value of that variable for the entire data set will not change.
  • Data Exchange: The data exchange method is record changes obtained by exchanging values ​​belonging to a subset of variables between pairs selected from the
    records. This method is mainly used for variables that can be categorized, and the main idea is to transform the database by changing the values ​​of the variables among the records belonging to individuals.
  • Adding Noise: With this method, additions and subtractions are made in a selected variable to provide distortions to a determined extent. This method is mostly applied on datasets containing numeric values. Distortion applies equally to each value.

3. Statistical Methods to Strengthen Anonymization

As a result of the combination of certain values ​​in the records with singular scenarios in anonymized data sets, the possibility of determining the identities of the people in the records or deriving assumptions about their personal data may arise. For this reason, anonymity can be strengthened by minimizing the singularity of the records in the dataset by using various statistical methods in anonymized datasets. The main purpose of these methods is to keep the benefit to be gained from the data set at a certain level while minimizing the risk of anonymity being compromised.

  • K-Anonymity: In anonymized datasets, the fact that the identities of the people in the records are identifiable if the indirect identifiers are combined with the right
    combinations, or the information about a particular person can be easily guessed, has shaken the confidence in the anonymization processes. Based on this, data sets that were anonymized by various statistical methods had to be made more reliable. K-anonymity was developed to prevent the disclosure of information specific to individuals with singular characteristics in certain combinations by enabling the identification of more than one person with certain fields in a data set. If there is more than one record belonging to the combinations created by combining some of the variables in a data set, the probability of identifying the people who coincide with this combination decreases.
  • L-Diversity: The L-diversity method, which is formed by studies carried out on the shortcomings of K-anonymity, takes into account the diversity created by sensitive
    variables corresponding to the same variable combinations.
  • T-Proximity: Although the L-diversity method provides diversity in personal data, there are cases where it does not provide sufficient protection since the said method does not deal with the content and sensitivity of personal data. In this state, the process of calculating the degree of closeness of personal data and values ​​to each other and anonymizing the data set by subclassing them according to these degrees of closeness is called T-proximity method.

Choosing the Anonymization Method

Data controllers decide which of the above methods to apply by looking at the data they have. While applying the anonymization methods, it is recommended by the Guide that the following characteristics of the data set should be taken into account by the data controllers:

• The nature of the data,
• The size of the data,
• The structure of the data in physical environments,
• Variety of data,
• The desired benefit / purpose of processing from the data,
• Frequency of data processing,
• The reliability of the party to which the data will be transferred,
• The effort to make the data anonymized is meaningful,
• The magnitude of the damage that may arise in case of the anonymity of the data, its area of
​​influence,
• The dispersion/centralization ratio of the data,
• Access authorization control of users to relevant data,
• The possibility that the effort it will spend to construct and implement an attack that will
disrupt anonymity will be meaningful.

The data controller, who thinks that he has made a data anonymous, is responsible for using the information known to be within the body of other institutions and organizations to which he transfers personal data, or to check whether the data in question re-identifies a person, through contracts and risk analyzes.

General Data Protection Regulation and Data Anonymization

The General Data Protection Regulation (GDPR) summarizes a specific set of rules that protect user data and create transparency. While the GDPR is strict, it allows companies to collect anonymous data without consent, use it for any purpose, and store it for an indefinite period as long as companies remove all identifiers from data. Under GDPR strict rules, if you intend to use and store data for an indefinite period, all identifiers, both direct and indirect identifiers for individual identification, must be removed from the data to ensure the protection of individuals.

The term anonymization is not included in the GDPR. “Therefore, data protection principles do not apply to anonymous information, i.e. information that does not relate to an identified or identifiable natural person, or that has been anonymized in such a way that the data subject cannot or can no longer be identified. Therefore, this Regulation does not relate to the processing of such anonymous information, including for statistical or research purposes.” described as.

veri-anonimlestirme-ve-bulaniklastirmaHowever, instead of anonymization, the term pseudonymisation has been used. “Pseudonymisation”
meaning blurring; “It is the processing of personal data in such a way that it cannot be associated with a specific data owner, without additional data subject to technical and organizational measures to be kept separate and not associated with a specific data owner.” Blurring is considered a protection that reduces the risks of data owners and helps controllers and processors fulfill their data protection obligations. In Opinion 28, “The application of blurring to personal
data can reduce the risks to such data subjects and help controllers and processors meet their data protection obligations. By the explicit introduction of ”blurring” in this Regulation, it is not intended to preclude any other data protection measure.” The importance of blurring in the form of data protection was emphasized.

In this context, the concept of blurring refers to the processing of personal data in such a way that it can no longer be attributed to a specific data subject without using additional information. It is stated that the said additional information should be kept separately and that technical and administrative measures should be taken to ensure that the personal data in question is not attributed to a specific or identifiable natural person. When anonymizing is done correctly, the data is rendered non-associated with the relevant persons and therefore the data can no longer be considered as personal data. As GDPR does not apply to anonymized data, anonymized data can be used freely under the GDPR.


To request a quotation for the following: Cyber Security, Digital Transformation, MSSP, Penetration Testing, KVKK, GDPR, ISO 27001 and ISO 27701, please click here.


[:]

About Content:
Share on Social Media:
Facebook
Twitter
LinkedIn
Telegram