September 14, 2016
By Dan Rubins
Reading time: 5 minutes

Processing real-world information is messy. Real-world information is even more problematic with Legal Language, where the amount of information and degree of precision encoded in language is extremely high. Every single word carries meaning and is often placed carefully. While Legal Language can be difficult for us mere mortals to read and understand, lawyers who craft contracts often do so with great intention.

In order for software to effectively make sense out of legal documents, we need a formula for finding the context of words. A skilled human reader can do this relatively easily, but for a machine, making sense of words based on their relationship to each other is an extremely complex task.

How We Analyze Language With a Distributional Model

One way to do this is to use a distributional model, or word vector, where words are analyzed based on where they appear in relation to each other. Words that are related to each other tend to be found in similar linguistic contexts, so we can use an algorithm to impute the relative meaning of those words from their context. When it comes to legal documents like contracts, this allows us to work with reliable statistical methods in addition to human interpretation for far better accuracy. It also unlocks an entire space of mathematics that lets us perform tasks like summarizing the information in the document and running comparisons against other documents to easily spot unusual language or terms.

Distributional word vector tools start by looking at millions of sentences to find instances where the words appear naturally together - in statistical terms, it builds a co-occurrence matrix. From this information, the algorithm creates a word vector, also called a word embedding, which is a compressed representation of the co-occurence matrix. A word vector might be described as a set of numbers that shows the physical location of a word (like GPS) in a very high dimensional space. Like GPS, we can compare the relative locations of the words to derive the relationship between two words - like Word #1 is “next to” Word #2, or determine the path from Word #1 to Word #2. When looking at new documents, we can use these word vectors to determine the meaning of a word in context. When we combine word vectors into larger sets, our software “learns” a numeric representation of a legal concept, like Indemnification, Assignment, or an exchange.

Of course, this is an over-simplification, so let’s dive into two of the most popular ways to create word vectors.

Predictive software tools, like word2vec, that analyze text using a distributional model look at how often two words appear together or close together in text. Then the tool creates a prediction of how likely those two words are to appear together again and what that is likely to mean. Because similar words usually show up in similar contexts, the tool learns naturally how to create word vectors.

There are two models that word2vec uses. One is called continuous bag of words (CBOW), which predicts the word used based on what the surrounding words are in a given window – for example, you tell the software to look at 20 words and ask it to figure out which other words are nearby. The second model is called skip-gram and it works the opposite way. You take a current word and the software can predict the 20 words that are most likely to appear nearby.

Why use one method over the other? CBOW tends to be most useful in smaller datasets because it treats the entire group of words as one analysis. Skip-gram treats each target word-context pair as a new analysis, which usually works best in larger datasets.

To speed up analysis, word2vec also allows negative sampling, which is essentially choosing a few contexts at random to see if the vector is similar to the specific target word-context you’re looking at. Two words or terms that are unrelated would, logically, not appear together frequently. If your chosen pair does appear together more often than your random examples, they are more likely to be related.

Using word2vec will take a little longer from the start than it would with a count-based software tool because the predictive model essentially “learns” relationships as it analyzes words. But because it is managing this analysis as it works through the data, it tends to take less memory to accomplish the same task as with a count-based tool.

Count-based tools, like GloVe (Global Vectors), essentially count the words around the key term and how far apart they are. Words are grouped in a matrix and weighted by how much distance is between them, so terms that appear close together frequently have more co-occurrence weight. These become word vectors that identify the relationship between the key term and another related word.

Because count-based tools use relatively straight-forward mathematical formulas to show the relationship between words, there are not multiple models or variations as there are with predictive tools.

The matrices that GloVe creates can take up more memory than a more linear-based tools like word2vec, so large-scale analysis can require more robust hardware to process. It does tend to work more quickly than a predictive tool and is suggested to be more accurate, though research on which type of software works best for a given situation is still somewhat sparse. In general, we’ve found that GloVe can produce more accurate results because it is actually counting through specific instances of the target word-context pairs.

Distributional word vectors can take different forms, but the result is an improvement in how machines can interpret natural language. Harnessing the ability for a computer to analyze, understand context and compare language use can make working through legal documents much easier and faster than using human eyes to review each line of text.

Why it Matters

One of the reasons these techniques are so powerful: they’re unsupervised. This means once the algorithm is coded, no human has to be involved in training or creating the word vectors, so its not constrained by human limits of efficiency, only by the compute power available to the training algorithm. It’s also not constrained by human biases and differences while a human creates the algorithm (just bias in the data it reads).

The scale of processing that unsupervised algorithms afford allows interesting new applications, like Legal Robot, to take advantage of far larger datasets than a human could ever read or understand. In some ways the algorithm has the capacity to understand a greater volume and diversity of text (and certainly with greater speed), but is more shallow in its understanding than a human expert. So, the “understanding” that an algorithm can garner is different than what a human can provide - which is why these techniques work so well in tandem with humans.