shirky.com Clay Shirky's Writings About the Internet
Economics and Culture, Media and Community, Open Source

DNA, P2P, and Privacy

First published on the 'Networks, Economics, and Culture' mailing list
Subscribe to Networks, Economics, and Culture.

For decades, the privacy debate has centered on questions about databases and database interoperability: How much information about you exists in the world' databases? How easily is it retrieved? How easily is it compared or combined with other information?

Databases have two key weaknesses that affect this debate. The first is that they deal badly with ambiguity, and generally have to issue a unique number, sometimes called a primary key, to every entity they store information on. The US Social Security number is a primary key that points to you, the 6-letter Passenger Name Record is a primary key that points to a particular airline booking, and so on. This leads to the second weakness: since each database maintains its own set of primary keys, creating interoperability between different databases is difficult and expensive, and generally requires significant advance coordination.

Privacy advocates have relied on these weaknesses in creating legal encumbrances to issuing and sharing primary keys. They believe, rightly, that widely shared primary keys pose a danger to privacy. (The recent case of Princeton using its high school applicants' Social Security numbers to log in to the Yale admittance database highlights these dangers.) The current worst-case scenario is a single universal database in which all records -- federal, state, and local, public and private -- would be unified with a single set of primary keys.

New technology brings new challenges however, and in the database world the new challenge is not a single unified database, but rather decentralized interoperability, interoperability brought about by a single universally used ID. The ID is DNA. The interoperability comes from the curious and unique advantages DNA has as a primary key. And the effect will put privacy advocates in a position analogous to that of the RIAA, forcing them to switch from fighting the creation of a single central database to fighting a decentralized and interoperable system of peer-to-peer information storage.

DNA Markers

While much of the privacy debate around DNA focuses on the ethics of predicting mental and physical fitness for job categories and insurance premiums, this is too narrow and too long-range a view. We don't even know yet how many genes there are in the human genome, so our ability to make really sophisticated medical predictions based on a person's genome is still some way off. However, long before that day arrives, DNA will provide a cheap way to link a database record with a particular person, in a way that is much harder to change or forge than anything we've ever seen.

Everyone has a biological primary key embedded in every cell of their body in the form of DNA, and everyone has characteristic zones of DNA that can be easily read and compared. These zones serve as markers, and they differ enough from individual to individual that with fewer than a dozen of them, a person can be positively identified out of the entire world's population.

DNA-as-marker, in other words, is a nearly perfect primary key, as close as we can get to being unambiguous and unforgeable. If every person has a primary key that points to their physical being, then the debate about who gets to issue such a key are over, because the keys are issued every time someone is born, and re-issued every time a new cell is created. And if the keys already exist, then the technological argument is not about creating new keys, but about reading existing ones.

The race is on among several biotech firms to be able to sequence a person's entire genome for $1000. The $1 DNA ID will be a side effect of this price drop, and it's coming soon. When the price of reading DNA markers drops below a dollar, it will be almost impossible to control who has access to reading a person's DNA.

There are few if any legal precedents that would prevent collection of this data, at least in the US. There are several large populations that do not enjoy constitutional protections of privacy, such as the armed services, prisoners, and children. Furthermore, most of the controls on private databases rely on the silo approach, where an organization can collect an almost unlimited amount of information about you, provided they abide by the relatively lax rules that govern sharing that information.

Even these weak protections have been enough, however, to prevent the creation of a unified database, because the contents of two databases cannot be easily merged without some shared primary key, and shared primary keys require advance coordination. And it is here, in the area of interoperability, that DNA markers will have the greatest effect on privacy.

You're the Same You Everywhere

Right now, things like alternate name spellings or alternate addresses make positive matching difficult across databases. Its hard to tell if Eric with the Wyoming driver's license and Shawn with the Florida arrest record are the same person, unless there is other information to tie them together. If two rows of two different databases are tied to the same DNA ID, however, they point to the same person, no matter what other material is contained in the databases, and no matter how it is organized or labeled.

No more trying to figure out if Mr. Shuler and Mr. Schuller are the same person, no more wondering if two John Smiths are different people, no more trying to guess the gender of J. Lee. Identity collapses to the body, in a way that is far more effective than fingerprints, and far more easily compared across multiple databases than more heuristic measures like retinal scans.

In this model, the single universal database never gets created, not because privacy advocates prevent it, but because it is no longer needed. If primary keys are issued by nature, rather than by each database acting alone, then there is no more need for central databases or advance coordination, because the contents of any two DNA-holding databases can be merged on demand in something close to real time.

Unlike the creation of a vast central database, even a virtual one, the change here can come about piecemeal, with only a few DNA-holding databases. A car dealer, say, could simply submit a DNA marker to a person's bank asking for a simple yes-or-no match before issuing a title. In the same way the mid-90s ID requirements for US domestic travel benefited the airlines because it kept people from transferring unused tickets to friends of family, we can expect businesses to like the way DNA ties transactions to a single customer identity.

The privacy debate tends to be conducted as a religious one, with the absolutists making the most noise. However, for a large number of people, privacy is a relative rather than an absolute good. The use of DNA as an ID will spread in part because people want it to, in the form of credit cards that cannot be used in other hands or cars that cannot be driven by other drivers. Likewise, demands that DNA IDs be derived from populations who do not enjoy constitutional protections, whether felons or children, will be hard to deflect as the cost of reading an individual's DNA falls dramatically, and as the public sees the effective use of DNA in things like rape and paternity cases.

Peer-to-Peer Collation of Data

In the same way Kazaa has obviated the need for central storage or coordination for the world's music, the use of DNA as an ID technology makes radically decentralized data integration possible. With the primary key problem solved, interoperability will arise as a side effect, neither mandated nor coordinated centrally. Fighting this will require different tactics, not least because it is a rear-guard action. The keys and the readers both exist, and the price and general availability of the technology all point to ubiquity and vanishingly low cost within a decade.

This is a different kind of fight over privacy. As the RIAA has discovered, fighting the growth of a decentralized and latent capability is much harder than fighting organizations that rely on central planning and significant resources, because there is no longer any one place to focus the efforts, and no longer any small list of organizations who can be targeted for preventive action. In a world where database interoperability moves from a difficult and costly goal to one that arises as a byproduct of the system, the important question for privacy advocates is how they will handle the change.

shirky.com Clay Shirky's Writings About the Internet
Economics and Culture, Media and Community, Open Source