By Jeff Guttman
At times eDiscovery technology seems like a sequel to 1980s prehistoric movie A Quest for Fire.
Much like our neanderthal relatives, lawyers, and others in the legal tech space have developed tools to create more efficient processes to organize and analyze data. Despite its humble beginnings, the use of analytics in eDiscovery is now commonplace, with most platforms offering some sort of analytics framework. But it wasn’t always that way.
The evolution in eDiscovery analytics began as computers began their proliferation as decision-making support systems in the 1960s, and it was during this era data analytics entered the space. In the beginning, we were all just curious humans looking at keyword searches and metadata, hoping to spot patterns (if we were lucky enough to have searchable text and available metadata!). In 1970, the amendment in Federal Rules of Civil Procedure were amended to include discovery of such data for the first time.
Unfortunately, despite our great need for the information contained in the data, we lacked efficient methods to quickly review the information. Even with the creation of Microsoft Office products, database-management platforms, and cloud storage technology, manual review of documents and data was arduous and expensive. But hope was on the horizon.
Where We’ve Been
Over time, we quickly realized that we could develop tools that would prove useful in this space. And so began our quest for the best and most efficient way to analyze digital information. Though analytics, through the journey of evolution in eDiscover analytics, has faced bumps on few instances, in the last ten years, a few pieces of key technology were integral in pushing data analytics forward:
- Predictive Coding. This technology uses technology to determine the relevancy of a large volume of documents based on a subset of those documents. At the start of the review, a human person manually reviews a set of documents – the “seed set” – and ranks each document’s relevancy. The software will then analyze the rankings and “learn” from them, allowing it to predict the relevancy of the remaining documents. Then the human reviewer begins conducting overturn rounds, where he or she reviews the computer’s coding and “teaches” it whether it was right or wrong on its assessment. Once the algorithm is stable, the remaining documents are reviewed.
Obvious benefits of this technology include increased speed and efficiency for review since the technology, unlike its human counterparts, can run day and night without breaks. But predictive coding requires a subject matter expert to review thousands of seed documents at the beginning of the case.
Often you need the person most knowledgeable about the subject matter to start the process – usually a firm partner. Initial reviewers may need to review thousands of documents to effectively teach the technology.
Though better than manual review, few partners have the time and few associates the expertise at the beginning of a matter to competently review the documents. And since the initial reviewer is “training” the AI, an ineffective reviewer can quickly derail the review. All in all, predictive coding is a great idea but impractical for most projects.
- Continuous Active Learning (CAL). CAL uses a smaller set of example documents to score and rank the rest of the population, prioritizing initial documents it determines are likely to be most relevant to the matter. As you move through the review, the ranking engine continually updates and reshuffles the prioritized documents, learning as it goes.
The added benefit of this type of technology is that it’s really good at determining whether a document is responsive. However, one issue some reviewers have with CAL technology is that, for large reviews, some documents may never reach an attorney as the system has deemed them irrelevant throughout the process.
Further, sometimes CAL may be prohibited by the ESI review protocol developed by attorneys or judges. For example, if an attorney has agreed to a keyword search using fifteen specific search terms, she has inadvertently agreed to produce any and all documents with those terms, whether they are actually relevant or not.
Lastly, some data types don’t play well with the technology. For example, Excel documents and JPEGs aren’t really compatible with CAL’s search methods. For those documents that don’t receive a ranking score from CAL, reviewers may have to create a secondary workflow to ensure all responsive documents are found.
- Categorization and conceptual analytics. While CAL is good at answering the “yes/no” question of responsiveness, it cannot speak to which documents relate to specific discovery issues. Enter: categorization and conceptual analytics.
Types of conceptual analytics are Concept Clustering, Key Word Expansion/Context Searching, and Find Similar. This type of analytics technology allows reviewers to create an approach that’s issue-based (e.g., “Find all documents that relate to fraud”).
To teach the technology, reviewers must submit exemplar documents to show the machine what it’s looking for. Usually, reviewers will need between 10 to 100 example paragraphs to effectively teach the technology. If no example document can be found in the client’s files, users can simply create a paragraph that might be similar to the material they’re searching for. For example, a reviewer may submit the following: “I’m so glad we decided not to report that income on our taxes. Now we’re rich!” Using the material, the technology will search for similar text and rank how closely it is related to your submitted example.
The technology is useful for “talking around” the concepts at issue without having to stick to particular keywords. This type of analytics is particularly helpful in the investigations context where bad actors may be using code or slang (as you can imagine, fraudsters rarely send emails congratulating each other on their “fraud”).
Since categorization and conceptual analytics focus on the meaning of words reviewed, rather than the specific text, reviewers are more likely to find helpful information. Despite its obvious benefits, some factors can adversely impact the utility and usage of the tools. For example, for large data sets, the time needed to index and crawl through millions of documents to build out conceptual indexes can take days or weeks. (Reviewers may be forced to work on data for an extensive period of time before even getting it into the algorithm.)
Additionally, the data structure itself can be an issue. With the increase of collaborative solutions like Slack, Teams, etc., data is simply structured in ways that seem understandable to humans but very confusing to computers.
Lastly, issues that overlap or are highly nuanced can prove problematic, as well as when a topic’s relevancy depends on who is speaking or when they are speaking.
Where We’re Headed
Many believe the future of analytics in eDiscovery is bright. Experts predict that certain key eDiscovery features will continue to affect the landscape:
- SaaS and cloud-based models. These models are expected to outpace traditional IT approaches in the future, adding more complexity to the eDiscovery and data analytics process.
- Effects of the COVID-19 pandemic. It’s no surprise that a world-wide pandemic has changed the legal landscape forever, with legal and corporate employees utilizing more hybrid/remote work platforms than ever before. These types of work arrangements continue to complicate data analytics and eDiscovery, with the use of collaboration tools exploding.
- Introduction of new and complex file types. As new technology continues to expand, so do the types of files necessary to store the data. As various chat and collaboration platforms have affected the legal market, new types of data will also continue to do so. Integrating agile and innovative approaches to e-discovery is necessary for law firms.
- Workflow automation. By creating a streamlined workflow, automated by some top technology platforms, lawyers and eDiscovery professionals can weed out the unnecessary steps for their review. No matter the analytics platform, by developing a clear process, reviewers can save time and money.
- Automated Early Case Assessments (ECA). Although often overlooked, ECAs are an integral part of the legal process, helping legal professionals prepare for litigation by more accurately predicting cost of litigation, as well as by creating realistic budgets to help companies understand the cost associated with litigation. By automating ECAs, tech-savvy legal teams can ensure that their time is better spent on other projects.
- Rise of “e2eDiscovery” approach. This “end-to-end” approach entails working with trusted vendors to ensure continuity throughout the review process. Having too many cooks early in data management and review the process can derail it, particularly if you have over-or-under aggressive team members. For example, if you over-filter data at the wrong moment, the results down the line can be problematic. By bundling all parts of the eDiscovery process with a vendor like Level Legal, firms can cut costs and be sure that their vendor partner understands the big picture as it relates to their data.
- Chat GPT. Since its release in 2022, this technology has already created quite a stir, amassing a million users in its first five days. For the legal community, its introduction has brought mixed feelings. Some immediately see the benefits it brings to the legal field in general and eDiscovery specifically: generating keyword searches based on text, summarizing specific documents, and even reviewing specific documents for relevancy. Still, others are triggered by its seemingly surmounting potential to replace lawyers. (Though cautionary tales of using the technology without understanding its limits are becoming more commonplace.)
Still Lagging Behind?
Just because analytics has evolved immensely since its creation doesn’t mean there aren’t still issues with the technology. For example, despite all of these developments – smart tools, smart workflows, designed and deployed by smart people – a few issues remain:
- Data explosion. The analytics industry still can’t keep up with the data. Ten years ago, a manual review needed 400 people. Today, the same review would likely require the same number of reviewers because data continues to expand at rapid rates. Even just finding the data can prove to be difficult for an organization.
- Judges and Lawyers. Retired Federal Circuit judge Richard Posner was right: Law lags science. Even though analytics can be immensely helpful for lawyers and judges, they still do not always agree on the use of these tools during ESI protocol meet-and-confers.
The evolution in eDiscovery and data analytics has been a remarkable journey that has revolutionized the way we understand and utilize data. From its modest start as a tool for basic data reporting, it has evolved into a powerful technology capable of uncovering meaningful insights, predicting future trends, and driving informed decision-making across industries.
By continuing to harness its power – but understanding its limits – we are embarking on an exciting path of innovation, unlocking limitless possibilities and transforming the way we live, work, and solve problems.
Contact us to discuss how we can help you with eDiscovery and data analytics.
Jeff Guttman is Level Legal’s renaissance man. A student of the great books, a former cotillion marshal, and an avid gamer, the director of client services brings a multifaceted approach to eDiscovery and managed review. A true craftsman, he’s constantly improving his people, process, and technology to delight customers.