Vol. 56 No. 5

Trial Magazine

Tech Bytes

You must be an AAJ member to access this content.

If you are an active AAJ member or have a Trial Magazine subscription, simply login to view this content.
Not an AAJ member? Join today!

Join AAJ

AI in the E-Discovery Toolkit

Chad S. Roberts May 2020

In the life cycle of new technology, there typically is a period when the commercialized version of a product is overhyped and inflated expectations are replaced with market disillusionment. Nonetheless, innovators and early adopters persist, the technology evolves, and prices decline. Eventually, the innovative new technology matures into mainstream use. 

The use of artificial intelligence (AI) software to manage and optimize documentary evidence is becoming a mainstream technology for plaintiff attorneys. In the litigation support context, most AI applications are currently limited to evidence in the form of text. In other industry sectors, applications that analyze images (including facial recognition and diagnostic X-rays) or sounds (including conversations) are also emerging. Knowing when AI can make electronic discovery more efficient requires an understanding of its basic principles, strengths, and limitations.

AI software allows a trial lawyer’s subjective judgments about existing evidence to be predictably replicated and applied to a much larger collection of new evidence. Initially, the lawyer makes subjective judgments about a small, manageable collection of documents. AI software then replicates these judgments and applies them to additional evidence, over and over, document by document, without the diminished quality and expense that comes from repetitive tasks or human exhaustion.

Indexing

A preliminary step—indexing—runs unseen in the background of the software. During indexing, the software “maps” the textual content of every document in the entire collection of documents. It does this by computing a very sophisticated numerical analysis of the words used in the document—the type, frequency, and position of the words, as well as their relative proximity to other words. A highly complex mathematical profile of each document’s content is created. When two documents have profile attributes that are similar to one another, the documents themselves are deemed to be “conceptually similar.”

Because the conceptual similarity of any two documents is a matter of degree, each document can be numerically ranked by the degree to which it is conceptually similar to another. Later in the evidence management process, this ranking will be used to find and retrieve additional documents of interest.

Machine Learning by ‘Ayes and Nays’

Machine learning is at the heart of AI and is the process by which the software refines the accuracy of its predictions about a document’s relevance. The process begins with an assumption that documents with conceptual similarity are likely to have similar degrees of relevance. Then machine learning is used to fine-tune the association between similar content and similar relevance.

To do this, the AI software first arranges all of the documents into a list ranked by their calculated similarity to a small set of documents already known to be relevant. Next, the software sets up a workflow in which the lawyer reviews the documents on the list and for each document is asked: “Is this what you’re looking for? Yes or No.”

In the course of this review, the smaller subset of documents that the lawyer judged to be either relevant or irrelevant begins to grow. In a continuous cycle, the software identifies the most common attributes of the relevant documents’ profiles and—just as important—the most common attributes of the irrelevant documents.

Next, the software makes small adjustments to the complex formula it uses to calculate overall similarity. The formula is continuously adjusted so that the computed similarity rankings result in a “best fit” match for the actual assessments made by the lawyer. By doing this repeatedly, the machine learns to accurately imitate (and later predict) the subjective judgments made by the reviewing lawyer.

When the software reaches the point at which it is consistently making the correct predictions about the relevance of a document, the process is said to have reached “stability.” At that point, the lawyer can reasonably assume that the software can predictably replicate judgments about all of the remaining documents in the evidence set, even the ones the lawyer has not yet seen or reviewed. Only documents predicted to contain the content of interest are reviewed; the remaining documents are ignored.

Optimizing Machine Learning

Because machine learning greatly amplifies the judgments of the reviewing lawyer, the quality of this review effort is paramount. In most cases, it should be conducted by the first or second chair merits counsel.1 The best outcomes result when lawyers have an equal understanding of both the facts of the case and the process by which the machine learns.

An additional challenge is that all documents produced in response to a discovery request will inherently have a high degree of conceptual similarity. Differences between the targeted content and the rest of the document set can be subtle and nuanced. When there are different categories of subject matter in the same large document collection, it is generally best to set up completely independent machine learning projects organized by subject matter. Combining multiple subjects as the targeted content—liability and causation, for example—confounds the software’s ability to discriminate between desired and non-desired content.

Relevance is not the same as conceptual similarity. To prevent the introduction of error into the machine learning process, a disciplined adherence to this proposition is critical. Consider a case involving conspiratorial price fixing of oil and gas leases from the government. The day before industry bids are due to be submitted, the principals of two competitors have an email exchange:

Conspirator 1: “I like to stop at Joe’s Sports Bar on the way home from work. The Oilers have a big game coming up.”

Conspirator 2: “I know the place. It’s a make or break game for the Oilers, and they all need to use the same playbook. See you there.”

This exchange may be the most relevant, most important document in the case to prove that a conspiracy occurred. However, if this document is identified as having the conceptual content you want to find in the rest of the document collection, you likely will find yourself reading internal communications about the company’s fantasy football pool instead of its oil and gas leases. Counterintuitively, this document may need to be coded as not having the content that you want to find more of.2

Similarly, if only a small portion of a very large document contains the type of content you seek (for example, three paragraphs about the subject matter buried in a corporation’s annual 10-K submission), designating the entire document as being the type you seek will also confuse the AI algorithm. The best AI software tools permit the specific content of interest contained in a portion of a larger document to be identified for machine learning purposes.

To help the software start to learn what kind of content you seek, you can insert documents related to the plaintiff’s case into the larger document collection and identify them to the software as examples of targeted content. Expert witness reports, focused excerpts from deposition transcripts, and dispositive motions all make good examples of “training documents.” If necessary, exemplars of targeted content can be created and inserted into the indexed document collection for training purposes: Try to imagine what kind of documents you would like to find, write an example of one, and train the machine to find more just like it.

Every plaintiff lawyer pressed by the limitations of time and workload wishes for clones to help with evidence review. The skillful application of AI to manage large volumes of evidence comes close to that.


Chad S. Roberts is a principal and founder of eDiscovery CoCounsel, a Miami law firm providing e-discovery services for plaintiffs only. He can be reached at chad.roberts@edcclaw.com


Notes

  1. While the use of AI software in evidence management can generate large reductions in overall costs, lead merits counsel should anticipate that it can require a higher proportion of their own involvement in the process. A qualitative benefit of this is a more thorough familiarity with the evidence.  
  2. If a highly unique “smoking gun” document like this is discovered in a data set, techniques other than