This article originally appeared in the Fall 2018 edition of the U.S. Cybersecurity Magazine
When obdurate isolationist Woodrow Wilson won 1916 re-election under the slogan, “He kept us out of the war!”, he hadn’t anticipated a simple act of data sharing. On January 17, 1917, Room 40 (British Naval Intelligence) finally decrypted the infamous German “Zimmerman telegram” that read, in part, “We intend to begin unrestricted submarine warfare…We shall endeavor in spite of this to keep the United States neutral. In the event of not succeeding, we make Mexico a proposal…make war together…”. Room 40 delayed sharing that telegram with Wilson until February 26 to protect British sources and methods – an example of sharing data while preserving privacy. Nevertheless, the shared data was compelling: on April 2, Wilson asked Congress to declare war.
Fast forward to 1942. British intelligence gradually succeeded in breaking the 3-rotor Enigma. However, on February 1 of that year, German U-boats started to use the 4-rotor version. From then until August 1942, Germany spent (lost) 22 U-boats to cost the Allies some 600 merchant ships. By April 1943, data sharing with the US cracked Enigma altogether. In May of that year, 18 of 49 U-boats patrolling the North Atlantic were sunk, while only 2 merchant ships were lost.
A common thread in these disparate vignettes is the sharing of data that its owning agencies consider sensitive, to achieve valuable public outcomes. That thread continues today in venues outside common defense. In the US for example, juvenile justice is a popular area for data sharing. 35 states have laws reaching back to the 1990s that permit data-sharing in search of improved juvenile justice outcomes. 27 of those states share data across their child welfare and juvenile justice systems for the same purpose. Washington State integrates criminal justice, social service, health, and workforce data to understand relationships between policy and outcomes. Allegheny County, PA is another beacon of data sharing in the service of outcome-driven policy enabled by inter-agency data sharing.
Far more is possible, but privacy laws restrict both inter-agency sharing and agency-outsider sharing (for example with researchers who often provide statistical expertise to make sense of data). In general, personally identifiable information (PII) is constrained by these laws to stay inside the agency that collected it.
A recent poll of government agencies working on such inter-agency sharing found that having the right data to share ranked far below other barriers to sharing. The leading barriers included the cost of assuring legal compliance when sharing; stakeholder concerns about the perception of whether privacy is preserved when sharing; and the difficulty in making data interoperable for sharing when de-identification works directly against such interoperability. The result? Inter-agency sharing of sensitive data often remains impossible, or impractically expensive and difficult.
How might we move both law and technology toward practical, cost-effective inter-agency sharing of sensitive data? 1917 and 1943 are a start, but we need to show it can be done in the context of modern public policy. One upcoming example is the bipartisan Right To Know Before You Go Act of 2017. Rising tuition and job uncertainty are at the forefront of student and parent concerns, along with the recognition that a college education is the second biggest investment that many college-bound citizens will ever make. Students and parents have a right to know all they can about how to make that investment wisely. What do they want to know? An amalgam of recent discussions sums up this way: “When other students like me choose this college, and that program of study, how much did it cost? What’s the chance that a student like me will graduate? How long will that take? Will I be able to get a job afterward, and if so, how much can I expect to earn? What’s the chance that I’ll be able to repay my student loans?” Today, most of that information comes to students from the least reliable sources: college brochures, family member input, and the Internet. The current, ridiculously bad outcome? Something less than 60% of students graduate from the institutions where they started, while as many as 44% of students at for-profit 4-year institutions don’t graduate, according to Pew Research. Not…good…enough.
Who holds the real data that can answer these questions accurately? Several agencies that are unable or unwilling to share that data. The US Census Bureau holds residence, family size, employment, and disability data. The Internal Revenue Service holds a wealth (no pun intended) of data about income. The Department of Education’s Federal Student Aid agency holds student loan, grant, and loan repayment data. The National Student Clearinghouse (not part of .gov) holds program of study, degree, scholarship, and other details of college careers. The Department of Veterans Affairs holds service record and GI Bill data. By sharing and linking that data, we can fulfill the student Right to Know.
All of that raw data is personally identifiable, and much is quite sensitive. Yet the answers we owe to students don’t require revealing any sensitive personal information in public. The answers students need are statistical summaries that either can’t be re-identified, or that we know how to protect from re-identification — for example using techniques such as epsilon- differential privacy. So, providing those statistical outputs to students puts no PII at risk. Instead, the challenge here is to protect the privacy of subjects of that detailed, sensitive input data while it is being used to compute those results.
Many solutions today de-identify data before sharing it. However, study after study shows that de-identification simply doesn’t prevent re-identification. In addition, such de-identification must be done anew each time the data is used for different purposes. De-identification also prevents exactly the cross-dataset linking needed to generate accurate answers, and prevents accurate data cleaning during the analysis process. In short, de-identification doesn’t work, is expensive, and destroys data utility.
Some solutions take original data and create synthetic models of it – matching statistical distributions of various data attributes – before sharing the synthesized substitute. The problem here is that the synthesis process can only model distributions that are explicitly chosen and known in advance. Meaningful correlations can be lost, hiding exactly the relationships that students and parent want to discover.
Secure computation is a promising alternative, though performance and usability are still being improved. Here, data is shared in full, so no utility is lost. However, that data remains encrypted at all times – even during computation, and even while results are filtered and fitted up with differential privacy – so that its privacy remains fully protected. With secure computation, the data and any potentially re-identifiable results are never “in the clear”, even if analysis platforms are hacked.
Inter-agency sharing of sensitive data needs to move from the realm of “it would be nice, but I’m sorry…we just can’t” to “we have a wealth of administrative data among us…what public good can we do?” Technologies – some limited, others very promising – exist to get the job done. Agencies such as DARPA, IARPA, DHS S&T, Census, and NIST are improving these technologies. Some key members of Congress understand and are advocates for the possibilities. As cyber security professionals, we can get more examples in play to help these and other advocates push policies and statutes to keep up with technology.