Hyperlex is a SaaS contract management solution that develops its own artificial intelligence. Thanks to this AI, we can offer a unique and efficient service. It analyzes contracts and extracts the important data they contain to facilitate your daily work.
It is the Machine Learning team of Hyperlex that develops it.
Among the team members, Estelle. Pôle Data (almost) alone, she tells us about her career and her daily life in this exclusive interview.
What is your professional background?
I did a preparatory class and then an engineering school, called Télécom Sud Paris. For my last year, I went to Eurecom which offered a specialization in Data Science, which was not available in my original school.
I did a first project in automatic language processing - or NLP - on the evaluation of sentiment detection systems in Tweets. It was my first exposure to this field. I really enjoyed it!
After this degree, I wanted to continue in the field. I was then recruited by the French Ministry of the Armed Forces, as an engineer in automatic language processing (NLP) within their Data Science team.
I gained a fairly comprehensive knowledge base about NLP. Even if I didn't work on all the tasks, I interacted with it a lot. That's when I started working on annotation.
I launched campaigns to collect data. I worked on writing annotation guides.
The goal? To be comprehensive enough in our instructions that people who are not familiar with artificial intelligence can understand. It doesn't matter who reads them, they have to be read, understood and interpreted in the same way by everyone. They need to be as clean as possible.
I was then contacted by Hyperlex for a position called Data Analyst. But that's not exactly what I do...
What is your role in the ML team?
Today, my position covers three different themes:
1️⃣ Among them is what about the annotation campaigns I mentioned earlier.
I do them hand in hand with legal experts. I have to define guidelines and then check with the team and the colleagues who use the data if what is written in these guidelines is appropriate for them.
The objective is to annotate useful elements for our customers in contracts: such as types (NDA, force majeure clause, non-competition clause, etc.) but also information such as amounts, addresses, organizations or even competent jurisdictions.
When you create your guidelines, there should be as little ambiguity as possible: all answers should be in there. There are always holes in the racket. So we try to cover as many cases as possible.
These campaigns I am conducting in particular to increase the language coverage of Hyperlex. Today, we have models trained on French, English and German. But we are working on data in Spanish, Portuguese, Italian and soon on Japanese and Chinese. I have to check the quality of what is provided.
2️⃣ The second aspect of my post starts from a team observation. Even though our system uses increasingly complex neural architectures, the performance varies little. In fact, we no longer make progress if the annotated corpus is not of sufficient quality.
So within the ML team, they were looking for someone who had mastered the annotation of the quality of this data, upstream.
In this context, there is a replay work on the data. Either, how to annotate, clean, have statistics of what goes, what doesn't go or think about how to do to have the least losses and make sure of the quality of the data that we want to add in our corpus.
When a new client signs with Hyperlex, we offer to take over the existing contracts. This means that we recover all of these contracts to implement them in Hyperlex. Since our AI learns in particular from data specific to a client, labeling his history allows the ML models to make relevant suggestions on his data as soon as he starts using Hyperlex. In the case of data recovery, the Contract Analyst delivers annotated data at the client's request. But these are not necessarily perfectly compatible with the standards we had identified on our side...
3️⃣ Finally, as the last major activity, there is the Data Model .
Even if Hyperlex is not very old, we realize that we need a common data repository for all teams. For example, we can say that for such and such a type of clause we use such and such a definition or such and such a format.
Because today, we have, on the one hand, global models, trained on the models of different customers, and on the other hand, local models trained only on customer data. It is only afterwards that we make an arbitration according to these two models to enrich as close as possible to what the customer expects.
In the long run, the objective will be, firstly, to succeed in identifying the attributes that are in our global model but that are not useful to anyone and that should be removed. Then, to define the categories that customers almost systematically ask for and that we would benefit from adding to our global model.
This is a project that we are gradually putting in place and that is transverse to several Machine Learning and Hyperlex teams!
What is your daily life like?
I have a recurring background task, which is to look after the legal experts, throughout the campaigns. This can mean sending out sessions to annotate, checking that they are making progress, answering their questions, checking quality after the fact. It doesn't take me a whole day, but it happens in dribs and drabs.
Often, I have features to develop in Python, whether it's to add/create sessions, add scripts that I need either in the campaigns, or in the evolution of the Data Model or to calculate KPIs on user usage or the quality of our ML suggestions.
It also goes through Metabase to try to bring up information and to have a vision of what exists. At the moment, for example, we are re-annotating old classification data for types of clauses. I have to make sure, in what I'm going to give as a session to annotate to the legal experts, that there is heterogeneity in the types of clauses. From the very first re-annotations, we must be able to obtain a balanced and efficient model on the clauses that are most used by our clients.
How do you work with the other members of your team?
I'm the data center all by myself! I have interns who are a great help. My role is to understand everyone's needs and pain points on history to identify what can be put in place to solve them.
👀 Going further: Behind the scenes with Hyperlex's Machine Learning team
What tools do you use and why?
I use Discovery, our in-house annotation tool, data visualization tools like Metabase, Grafana and Python for all development tasks.
How would you explain your job to a 6 year old?
Imagine you are making an apple pie with your older brother. On the table you find apples, pie crust, sugar and butter. 🍎
You would like to write this recipe down somewhere to remember it, but you can't write very well. However, you have special glasses that see all these ingredients and will be able to identify them, to write the recipe for you.
These glasses, in addition to noting what they see (apples, sugar, butter ...), they will succeed in recognizing whether it is white sugar or brown sugar, sweet butter or semi-salted butter, or even Golden, Granny or Gala apples ...
My job is to help her glasses recognize as many ingredients as possible so they can write as many recipes as possible!
What do you like best about your job?
The variety of my assignments. That's something I was really looking for when I was looking for a job. I already had it at the department.
I don't just do code, I also do technical project monitoring. I have to make sure everything is moving forward, planning, checking.
In fact, I try to make things as clean as possible to make everyone's life easier. Even if I'm not the one using the information afterwards, it's super satisfying to think that because of the data we annotated, we're going to see a model that works.
Discover the daily life of Estelle and Alaa from the Machine Learning team, in this video :
These articles should interest you to learn more about our technology:
- Automatic handwriting recognition in electronic contract management
- Discovering Automatic Language Processing (ALP): text classification (pt. 1)