The majority of natural language processing modules leverage a certain technique for text representations on character, word or sequence level (sentence, paragraph or document). As a result, the efficiency of these modules is highly dependent on the quality of the embeddings they are built on.


In order to provide our customers with the best document analysis in near real-time conditions, we pay close attention to the quality of our text embeddings, the complexity they come with as well as their impact on process time.

The point of this section is to provide a brief explanation of the most used types of textual embeddings at Hyperlex.

  • Statistical embeddings vs distributional embeddings

The statistical approach to textual analysis is based on tracking the presence of certain words in the processed texts as well as their frequency and rarity.
In this regard, the most common embedding methods are the “Bag of words” (BoW) [1] and “TF-IDF” [2].

Both methods consist of building a complete inventory of words based on all observed documents called “vocabulary”. Bag of Words technique, will represent each text as a sparse vector with zeros representing the absence of certain words from the text and vice versa. For example, with such technique a word is represented by a vector having the same size as the vocabulary with zeros in every cell except one representing the index of this word in the vocabulary. (One hot encoding)
As for the TF-IDF method, it is basically a collection of counters for term frequency in the given text as well as that term’s rarity in the general corpus. This means that the term is as important for the prediction as it is frequent in its context and rare among the other textual documents in the training set.

Legal documents in general and contracts in particular are often complex texts with a very special vocabulary and sophisticated sentence structure. So when it comes to analysing these documents, the mere presence or frequency of a word is not enough to capture the full meaning of a contract or clause. This is why we decided very early to leverage distributional approaches for an improved understanding and analysis !

Distributional embeddings are based on the linguistic hypothesis of Distributional Information. This hypothesis states that the meaning of a word is determined by the collection of contexts where it is being used which means that our understanding of a word improves each time we encounter it in a new context. This is the reason why distributional embeddings require a large corpus of data to train meaningful embeddings.

The most known and widely used approaches to distributional embeddings are Word2vec and GloVe.

Word2vec [3] is a group of models introduced by google research team in 2013 led by Tomas Mikolov. This approach is based on the “CBOW” and “Skip-gram” paradigms to predict a word based on it’s context.
The “CBOW” technique consists of predicting the probability of a word given a context (the context being one or more words) while “Skip-gram” models predict the probability of having a certain context based on a word.

Glove [4] on the other hand, leverage word pairs co-occurrence in a given context to build a Co-Occurrence matrix of the shape (V, V), V being the size of the vocabulary.
The vocabulary being often very large, we end up having embeddings of size V which are far from practical for techniques based on these embeddings. To solve this problem we perform a Principal Component Analysis to select the N most important features, N being the size of the final embeddings.

Finally, we cannot discuss word embeddings without mentioning FastText [5], Facebook’s own word embeddings (introduced in 2016) following the original word2vec paradigm but using a collection of sub-words instead of words as input. These embeddings are known to be robust to different forms and conjugations as well as spelling errors.


  • The rise of the language models

The success of word embeddings resulted in the rise of interest in natural language processing among the research community. Many researchers based their work on word embeddings and used it as input feature for their architectures with a wide variety of applications ranging from sentiment analysis to information extraction.

At this point, we started noticing the limitations of word embeddings since a word has one embeddings independently from the context it is being used in. The problem is more noticeable when applying these WE to a domain specific corpus where certain words have very different uses from the general domain. This sparked the need for contextualised word embeddings which are trained using a Language Modeling approach.

Langage models are statistical or probabilistic techniques that are used to determine the probability of a given sequence of words occurring in a sentence. In most cases we are trying to predict the next word of a sequence given the N preceding words.

A good example of popular and successful language models is Elmo (Peters et al., 2018)[6]. Upon its release it was added to existing NLP systems significantly improving the state-of-the-art for every considered task at the time.

Figure1: Improvements with language model embeddings over the state-of-the-art (Peters et al., 2018) figure 13 [7]

Elmo consists of a bi-directional LSTM (Long Short Term Memory) neural architecture having ordered word sequences as input (with words vectors as input for each word) and predicting the next word in the sequence.

Similarly to regular word embeddings methods, a language model only requires raw unlabeled text that could be scrapped from the internet or retrieved from specific corpuses. And, to be completely transparent, a GPU is also required due to the huge training time this task needs.

  • Transformer based LM

Figure2: The Transformer – model architecture (Vaswani and al.)

Figure2: The Transformer – model architecture (Vaswani and al.)

Transformer(Vaswani and al.) [8] is a neural architecture based solely on attention mechanisms that learns contextual relations between words in a text.

In its original paper, Transformer introduced two types of transformer blocks: Encoder and Decoder. The encoder’s role is to transform each token input into a latent representation and the decoder produces a prediction for the given task while attending to the input’s latent states.

Almost all previous Language models up until the release of transformers used a form of recurrent neural networks (LSTM, GRU..) and Open-AI research team was the first to use it for language modeling in their GPT model [9].

On its release, GPT (stands for Generative Pre-trained Transformer) took the NLP community by surprise outperforming state-of-the-art results on a variety of language tasks such as Textual Entailment and Semantic Similarity.
A couple of months later, google’s research team released its own transformer based language model named BERT (Bidirectional Encoder Representations from Transformers) [10] that outperformed the GPT model and confirmed the huge potential of these giant language models.

These two models have very similar architectures with the difference being that Bert based it’s architecture on the encoder bloc of the Transformer architecture while GPT used the decoder bloc.
The language modeling task was a little different as well since GPT used regular language modeling (predicting next word from left to right) where Bert used “Masked Language Modeling” (predicting one or more masked words in the original sequence). Bert has one additional training objective consisting of predicting the next sentence in a text.

Since the release of these two giants in late 2018, several variants of Bert and GPT models were introduced which differ in training protocol or the choice of tokenizers such as RoBERTA [11, ALBert [12] etc.

For the rest of this article we will be going through the process of fine-tuning Camembert on a relatively large corpus of legal contracts.
Camembert [13] is one of the latest Bert-like language models that was introduced as a French version of the RoBERTA model by the ALMAnaCH team at Inria.

Figure 3: Data processing pipeline


  • Data source

Most language models available today are trained on scrapped data from the internet (Common Crawl Corpus [14], Oscar [15], …).
In our case, we will be using a subset of our legal contract database consisting of hundreds of thousands of PDF/Word documents provided by our large client base.

  • Security

At Hyperlex, security and confidentiality are priorities. Therefore, we have a triple layer storage security system which basically means that every one of our client’s contracts must be encrypted at all times. The decryption step is only performed In Memory and on sandboxed, inaccessible servers.
These security measures add a certain level of complexity to our data preparation and model training pipelines.
In the legal realm, the most important version of a contract is the final and signed version. Most of the time, we are dealing with handwritten signatures(as opposed to electronic signatures) which means that once the contract negotiation is done, the final and agreed on version must be printed, signed, scanned and uploaded to our application. Therefore, 98% of our contract database consists of scanned PDF documents.

  • Optical Character recognition

The first step of our data pipeline is to perform Optical Character Recognition (OCR) to retrieve the textual content of these documents.
Over the last two decades, OCR methods and techniques improved drastically based on the wide success of deep learning algorithms in computer vision. Nonetheless, these OCR modules are far from being perfect and highly sensitive to the scan quality, document orientation etc… which translates into spelling errors and missing words from the original version of the contract.

  • Document Segmentation

The typical output of an OCR module is a list of bounding boxes of words per page in the document as well as the extracted word from the corresponding image crop.

At this point in our data preparation pipeline, we are interested in building meaningful text samples for our Language Model to be trained on. The granularity of these text samples depends on the input size of our model.
In our case, we are interested in training/fine-tuning a camembert model which is based on the RoBERTa architecture and having input sequence length up to 512 tokens which means that we can go beyond the sentence granularity and train our model on more meaningful blocks of text in the legal context.
In fact, the backbone of every contract is a collection of clauses with each tackling a specific topic such as term, termination, confidentiality, severability, liability etc …
This specific structure enables us to have very rich contexts for language modeling giving that we extract these blocks correctly.

To do so we have implemented a complete segmentation pipeline, that leverages in bounding boxes positions, syntactic, semantic as well as regular contract patterns to build a complete document structure (lines, paragraphs, clauses,…) based on the OCR output and ensure we properly segment each one of our documents.

  • Data selection/preprocessing

In this last step of data preparation, our main interest is to filter irrelevant and very noisy text samples generated from our document segmentation. This includes very short structural components of documents such as titles, companies informations and signature blocks.
Noisy text samples are samples containing a lot of spelling errors or gibberish chunks of text due to OCR poor performance on bad quality documents. In those cases; we tried applying spelling correction when possible but ended up discarding most of these samples.

Training Process

Experiments and evaluation
As mentioned before we will be using pre-trained CamemBERT as our base model and we will be finetuning it for 10 epochs on our legal dataset.

In this section we will be conducting multiple experiments to study the following factors:

  • Complex embeddings: Evaluate the need for LM based embeddings by comparing with traditional embeddings techniques
  • Need to Fine-tune: Evaluate the need for fine-tuning on domain specific data
  • Text granularity: Compare models trained on sentence and paragraph level samples.
  • Fine-tuning time: Evaluate the effect/need for long fine-tuning steps
  • Freezing LM for downstream tasks: Comparing the effect of freezing the LM weights when training on downstream tasks.

To evaluate and compare these different approaches, we are going to train a clause classifier based on embeddings generated by these specific models.


  • Language modeling

We have at our disposal (post-filtering) :

  • 14782 contracts
  • 3.4 million paragraphs <=> 8.4 million sentences
  • Clause classification
    The clause classification dataset contains:
  • 48700 clauses
  • 64 classes

This dataset is highly imbalanced with some categories having up to 4498 samples while other classes have as little as 50 samples.
For evaluation we will be using 25% of our clause classification dataset (12175 samples) and our metric of choice for this imbalanced classification is the un-weighted macro f1-score.


To implement and train our model, we need the original camemBert pre-trained model, an implementation of the SentencePiece tokenizer and the actual camembert(RoBERTa) architecture preferably in Pytorch.
Lucky for us, HuggingFace’s team already addressed these common needs through their Transformers library.

Transformers library provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Processing as well as their Tokenizers and Basic configurations while maintaining deep interoperability between TensorFlow 2.0 and PyTorch.
Hugging Face hosts multiple official pretrained models as well as pre-trained models from various sources and for different domains in more than 100 languages. They made a platform to share/download pre-trained models for everyone to use on their projects.


  • Complex embeddings
    In this experiment we will be evaluating the impact of embeddings generation technique on our clause classification task.
    We will be training the same basic classification head for the same number of epochs with clause embeddings generated using TF-IDF, Fasttext and pre-trained CamemBert.

Fasttext paragraph embeddings seem to be less effective than TF-IDF embeddings while an out of the box CamemBert language model provides better results than these two techniques but at significantly higher cost (infrastructure/ inference time)

  • Need to Fine-tune

Now that we established the usefulness of complex embeddings techniques lets try push these techniques even further using transfer learning. This experiment consists of comparing the performance of an out of the box camemBert embeddings to a legal domain fine-tuned(for 10 epochs) version

Fine-tuning the original Camembert model on our legal domain corpus improves the overall performance of our clause classification model by approximately 4%.

After a thorough analysis of the per-class performance, we noticed that the fine-tuned model performance significantly improved for certain classes like: “clause modalités de paiement”, “clause de prix” and “clause de revision des prix”. The particularity of these categories is that they all cover the same specific topic with a lot of common expressions and very similar phrases. This is a good indication that our legal domain model has a better “understanding” of some legal concepts.

  • Text Granularity

As mentioned above, we have the possibility of fine-tuning our Language Model on sentences (similarly to the original paper) or clauses (one or more paragraphs). Therefore we will be comparing embeddings generated from
two language models, the first was fine-tuned on sentence level samples while the seconds had paragraph level training inputs. Both models were initialised from the pre-trained CamemBert and trained for 4 epochs.

The granularity of text samples while fine-tuning seem to have no effect on the overall performance. Therefore, we will continue our training on the clause (paragraph) level since it best meets our inference granularity.

  • Fine-tuning time

When it comes to large language models, fine-tuning is a very expensive process given the resources required (multiple GPUs) and the considerable training time. Therefore, we are going to test the quality of embeddings generated at each fine-tuning epoch and evaluate the efficiency (cost vs performance) of further training these models.

Based on the above plot, the performance improvement is observed very early in the fine-tuning which indicates that language models don’t need extended training to adapt to specific domains. This observation is quite valuable when we consider the cost of extending the training for more epochs.

  • Freezing LM for downstream tasks

So far, we have used the pre-trained language model as an encoder to provide our classifier with clauses embeddings. This means that LM weights were freezed during the clause classification training. In this experiment we are going to fine-tune these weights while training our clause classifier and evaluate the impact on the overall performance.

We observe a leap in performance if we continue fine-tuning the language model while training our clause classifier. This is a very interesting improvement if the LM is only being used for clause classification but this can undermine the generalization capabilities of the model if used for other downstream tasks. This is called Catastrophic Forgetting.


In recent years, research in Natural language processing has been moving at a tremendous pace with more efficient techniques being introduced almost every week. The introduction of transfer learning and pretrained language models improved the performance of downstream NLP tasks and opened the door for new applications in a variety of fields such as contract analysis in the legal domain.

As observed in the previous experiments, leveraging language models helped us improve our clause classification compared to traditional embedding methods and similar improvement is also observed in other contract analysis tasks such as information extraction, entity relations and similarity search.

The downside of using these methods is the complexity of the infrastructures we need to deploy large models as well as maintaining the code base for these models architecture but thanks to the significant contributions to the NLP community by companies such as Facebook, Google, OpenAi and HuggingFace, applying state of the art techniques and models for domain specific tasks has never been easier.

sources :


This article was written by Ahmed Touila, Senior Machine Learning Research Engineer at Hyperlex

We recommend also :