Data Masking

It should be pointed out that data masking can be defined as the masking of data anywhere, but when people use the term data masking they usually mean "test data generation" or "analytical data generation." This is the conversion of production data into either test and development data or data for a data warehouse. This conversion involves removing or "masking" the sensitive data for protection of those involved, while using the nonsensitive data for testing or whatever purposes you might have for it.

Organizations are increasingly finding that data masking is mandatory for regulatory compliance, and with good reason. Data masking is an extremely effective way to reduce enterprise risk and protect consumers. Development and test environments are often not as secure as production, and there is little reason developers should have access to such sensitive data. So there are a number of guidelines that data masking should follow, especially in testing and development.

Data masking must not be reversible. However you mask your data, it should never be possible to retrieve the original sensitive data from storage. This is very important!

The results must still be representative of the source data. The reason to mask data instead of just generating random data is that masking allows you to protect sensitive information, but still resembles actual production data for development and testing purposes. Among the information that should be kept intact and accurate is geographic distribution or maintaining human readability of the (fake) names and addresses.

Only mask non-sensitive data if it can be used in the retrieval of sensitive data. It is not necessary to mask everything in your database; just those parts that you think are sensitive. Again, this goes back to keeping the data accurate. But remember, some non-sensitive data can be used to either recreate or associate back to the original sensitive data. This is called inference analysis, and your data masking should protect against it.

Referential integrity must also be maintained. Your data masking solution should maintain referential integrity. If a credit card number is a primary key, and scrambled as a part of the masking, then all instances of that number linked through key pairs must be also scrambled identically. Use common sense.

Finally, data masking must be a repeatable process. Development/test data needs to represent constantly changing production data as closely as it possibly can. Analytical data may need to be generated daily or hourly. This also means your data masking and processing should be automated. If masking is not an automated process it is inefficient, expensive, and ineffective.

Data Management