While we know that software can expose data, we sometimes forget that writing software can expose data.
When a system gets deployed, we typically build a development environment, one or more test environments, and a production environment. No surprises there. However, developing software with sample data, instead of “real” data, can allow defects that are difficult to catch. On the other hand, using “real” data (typically a subset of production data) runs considerable data security risks. In this post, I’ll discuss the notion of building a general purpose deidentification tool specifically for software development and DevOps purposes.
Software development and the corner case
Software developers are an interesting lot. We are usually deeply analytical and like to solve difficult problems. We look for logical combinations that your typical user doesn’t consider. I call these “corner cases” where logic isn’t always thought out in advance.
Examples of this kind of logical corner case are easy to come across. Consider this example:
- In 1997, Contoso Inc. started running a home-grown customer relationship management system. In that system, the concept of “account” was the same as the concept of “customer”.
- In 2012, Contoso Inc. upgraded to SalesForce or Dynamics CRM, both of which have better data models. In those systems, it is entirely possible for a customer to NOT be an account. So to move the data from the old system to the new system, we created “fake customer” records from the old account records and created a one-for-one mapping.
- Now, it’s 2017, and Contoso Inc. is upgrading a front end tool that accesses customer data. Most of the customer data was created by the CRM tool (SalesForce or Dynamics), but about 15% of the customer data is still in the form of these “artificial” records created five years ago during the migration. Our new tool has to cope with both “new” records (created since 2012) and “old” records (created prior to 2012).
- The new application has to execute a rule based on the data, but the data may be inconsistent between old records and new records. How do we apply the rule?
This is a corner case — a place where the logic may not be intuitive and where architects treat lightly. Handling corner cases often requires bits of logic that are highly complex and difficult to test. These bits of logic can be the difference between a successful system and a total calamity.
Corner cases are normal in software development, as most of my gentle readers will recognize. This is the kind of thing that business users often forget to describe. We have to ask. We have to design for them. We have to test them. We sometimes have to simulate operations using them. Corner cases are a bear. I’ve seen systems where over 50% of the code was spent coping with corner cases.
Risking the Exposure of Data for Testing
When writing a system, software developers often find themselves in an interesting bind. In order to test the system adequately, you need realistic data. In the best case, you need production data, so that all your logic can be tested for corner cases like the one above. But if you put production data into a partial system under development, you risk its exposure.
Your customers may not like the fact that their data was copied from the secure production system to a test environment for a new application. That test environment typically doesn’t have mature security protocols in place. The data may not be encrypted at rest, and the authentication logic may need debugging. Configuration of web services may not be “hardened” to prevent illicit access. In many organizations, your data security officer will flat out reject the idea.
In addition, the development team building that system may reside in another country from the customer, running the risk that simply shipping the data from production to test violates local laws or regulations about data privacy.
Do not take this lightly. Companies can be killed on the basis of lost customer data. Exposing customer data from the test environment is a real risk.
This catch-22 hits teams all the time, but I wonder how often people THINK about it. You cannot develop a really solid system without testing against production data, but you cannot put production data into an incomplete system where it can be exposed.
What About De-identification
Pulling production data into a software test environment creates a risk of exposure for potentially sensitive data. It’s a security risk, plain and simple. I suggest that development teams should “de-identify” production data before pulling it into a test environment.
The concept of de-identification has been around a long time. Basically the idea is that a data record reflects a real life person (or company) if I can look at the data record and I can identify who the person. The concept has been used widely in healthcare data management where we refer to this kind of data as Personal Health Information or PHI. While I refer to health data, I believe it is time we start widely extending this concept to all personally identifiable information (PII) such as account data, purchase or rental records, telecommunication records, customer service data, legal matter data, human resource data, sales partner data, and energy generation data.
Deidentification does not apply equally to all data fields. Obvious things like name, address, and social security number have to be masked or removed, but indirect identification data fields should be addressed as well. For example, in the neighborhood where I live, there are only three families that are African American, and of those three, only one African American male over the age of 50. (It’s a small neighborhood). So if we can pinpoint a data record to my neighborhood for a 54 year old African American male, I can identify the individual. This is called “re-identification” and it’s a real problem. One major goal of de-identification is to prevent re-identification.
In medical research, it matters that de-identification does not destroy underlying demographic data. For example, for health care data, government agencies want to know how many people of a specific age have been diagnosed with heart disease. Researchers may want to know what neighborhood the patient lives in, what is their gender, their race, and if they have other conditions that may affect risk factors (like diabetes or family history of heart disease). This creates interesting challenges. After data is de-identified, it has to remain useful for analysis. The more data you alter for the sake of de-identification, the less useful the results become but the more secure it becomes. It’s an imperfect tradeoff.
An excellent presentation on the current state of the art of de-identification and the risk of re-identification in a healthcare setting can be found here.
De-identification in software testing is a different beast
Unlike the need for research, the needs for data security in software testing are typically much lighter. We want to make sure that corner cases remain intact, but we don’t need to worry about maintaining a perfect balance of demographic fields or to have such large datasets that no single record can be traced to a single individual. We want to remove names but we don’t need to encrypt or mask the names. It’s OK if the names are simply rotated (so that records for Mary Smith and Tom Jones gets translated into Mary Jones and Tom Smith).
Unless the application involves mapping, it’s OK to rejigger address data to point to addresses that simply do not exist but still, for the sake of testing, look like addresses. (123 Maple Street, Cincinatti becomes 9812 Oak Terrace, Cincinatti, even if that street doesn’t actually exist in Cincinatti).
However, coincidences are a big part of corner cases. We should do our best to maintain coincidences. In that case, if two records describe the same person or location, those records should be altered in the same way. For example: Let’s say our application stores drivers license numbers. If we change the driver’s license number for Tom Smith from 321000 to 981000, and it just so happens that Mary Jones also has the driver’s license number of 321000, her number should also be changed to 981000. However, if a hacker were to get hold of that record, they should NOT be able to convert it back to the original number of 321000. (in other words, we want to use a one-way data function called a hash).
As you can see, this is not trivial work. Fixing up both structured and unstructured data columns so that corner cases remain testable while reducing the risk of a data breach is not an easy algorithm. Different columns of data need different mechanisms, and the level of exposure risk has to be carefully considered with respect to the efficacy of the test. In addition, the act of de-identifying the data must not introduce data anomalies that do not exist in the original data.
This is a lot of work. There are only a handful of tools available and some are not very good. Others are fantastic but very narrowly focused around healthcare. A good example of the latter is a natural language processing application called the NLM-Scrubber that scrubs free-text medical reports. I have not investigated sufficiently but the NLM-Scrubber application appears to be very specific to a research data environment. In addition, the healthcare field has a wide array of commercial “data masking” tools that will do some of this, but they are, once again, typically designed for healthcare applications and less robust than de-identification tools.
De-identification of production data for the sake of writing a single application is hard. Harder than it needs to be. As a result, most developers don’t bother. It is easier to simply take the risk of exposure than it is to de-identify production data for testing. In my opinion, this is wrong. But it is common. Without freely available tools, developers often have no choice.
Solving the problem
I would suggest that we should develop an open source tool that can be embedded into a DevOps environment for deidentification of production data into a test environment.
I imagine the following design: The system will have two parts: a deidentification tool and a configurator. The configurator will allow a developer to define the rules for deidentifying data for a specific test dataset. (e.g. which fields will be ignored, which fields will be masked, which will be rotated, which will be hashed, etc). The deidentification tool read the source data, deidentify the expected fields, and write it back out for further processing. This could use flat files or connect to a database table. (On a Unix system, this data can be piped from stdin to stdout). In a DevOps environment, scripts can handle the tasks of querying a database to create the source data stream and pushing the de-identified data to a test server.
I’m really tired of taking this risk for granted.