How should lawyers deal with the ever growing volume of digital data in e-discovery? The answer likely involves far more extensive use of statistics. 

Do Your Searches Pass Judicial Scrutiny? by Wayne C. Matus and John E. Davis of Pillsbury Winthrop Shaw Pittman (New York Law Journal, 31 Oct 2008) offers an excellent paradigm for how to use concept search in EDD. Both implicit and explicit in their suggested approach is heavy reliance on statistical analysis.

The authors suggest sampling data: “Counsel should isolate a random and statistically significant sample of the relevant datasets…” This is not as easy as it sounds. Philosophically and mathematically, “random” is a difficult concept; choosing a truly random sample turns out to be difficult.

Practically, a “statistically significant sample” requires an appropriate sampling technique and defining the sample size. The latter is easier than the former; at least for well-distributed populations, the “central limit theorem” says that sample sizes do not scale linearly with the size of the population. In English: valid samples are not a fixed percent of the population. (Note that many polls of US citizens sample around 1500 people.)

The problem is that most populations are not well-distributed; instead, they are lumpy. Huh? Think of these questions: how do you know if you should sample uniformly across a data set or more heavily in key custodian files? What date range should the sample cover and might it be appropriate to over-weight certain ranges? Should you sample more from e-mail than from Word files? How do you sample Excel files, which may consist mainly of numbers. [In this political season, consider just one of the problems of sampling voters: more younger than older voters have only mobile phones and there is no way to sample mobile phones. What gottchas like this lurk in EDD?]

In all the recent cases and articles about EDD and search techniques, I believe that judges and lawyers have overlooked a critical question: how do we know if a document is responsive or privileged. Medical trials compare a new therapy a proven one, the so-called gold standard. Reproducible tests such as blood chemistry or imaging establish standards. How do lawyers know what the right designation for responsiveness and privilege is for any given document?

I’ve not seen any data I’ve seen on how reproducible lawyer designations are. Do we rely on the judgment of a contract lawyer who may not know the case that well? Is the partner in charge the authority on document designations? How many lawyers have run a document through two or more lawyer reviews and compared results? Until we can agree on a method to reproduce reliably document designations, it’s not obvious to me how we can compare search techniques. [This goes back to my prior blog posts pointing out that it’s a mistake to assume that human review gets the designations right.]

To untangle this mess, lawyers should call in the statisticians.