Natural Language Processing in Production at Courier

Backend histories

6 min readJan 12, 2018

I joined Codeq in 2014 after an informal, pleasant conversation with Paulo about applying NLP to analyze news, app reviews, emails, and many other exciting ideas. Soon after, Dane asked me to join an “applied research startup focused on bringing big ideas to life”. That phrase sounded a bit crazy at that point: “applied research startup”… sure!

It turns out that, after these years, we have indeed done a lot of research at Codeq (see here and here for reference) and we have been able to develop some big ideas. Our recent one: applying NLP and Deep Learning to let users spend less time reading emails.

The NLP research part at Codeq has been amazingly fun. But going from research to production has been a painful road! What started as a single script to summarize our own emails has turned into a very complex architecture composed of a robust NLP preprocessing pipeline, a bunch of Deep Learning classifiers, a set of NLP hybrid modules (conversational email summarizer, botmail summarizer and task extractor), as well as different microservices, which all in conjunction make it possible to summarize millions of emails per week.

In this post we want to describe some details about our NLP backend and show what have we learned while developing such a complex architecture and deploying it in production.

Analyzing an email message

Our NLP pipeline at Courier is composed of three main processes:

Email Parser

In theory, an email is a structured text object. In theory. Because in practice, there are a lot of twisted variations in the structure of a raw message that make it very hard to implement a general parsing strategy to extract the main elements of an email: headers (to, from cc, date, subject), salutation, main content, threads, signature, etc. This is exactly our first component: an email parser that analyzes a raw message to get its main relevant content.

Email Classifier

Once we are able to extract the important elements of an email, including its plain text (or HTML content), we need to classify what type of message it is in order to apply the best summarization strategy. This second component is an email type classifier. First of all, it is in charge of detecting the language of an email (we only work currently in English). After that, it classifies the email into one of two categories: Conversational or Botmail. If the email is classified as Botmail, it applies a subclassifier to identify different subtypes, e.g., Purchases and Payments, Travel, Social, etc.

Email Summarizer

Classified emails are then summarized following different methods. All types of emails require an specific NLP preprocessing analysis, from simple ones, such as tokenization and sentence splitting, to more elaborate ones, including speech act classification or discourse relation extraction. Finally, NLP-preprocessed emails become the input of the third component, whose main role is to summarize and extract important information such as tasks.

NLP Services

At the beginning, all modules described above were integrated in one single script. Does that sound familiar? I bet you have been there too: when you’re prototyping ideas you don’t care about performance, code readability, unit testing. It is easy to keep adding and adding new stuff as you require it. The main goal is to proof that the concept works, you don’t care if the concept is elegant enough.

But then the concept was working and we needed to connect that script, which analyzed one email at a time, with a server in charge of getting an user inbox and analyzing thousands of emails quickly.

As a first version, we developed a REST service that loaded all our NLP modules in memory and kept listening to POST requests to interact with the server. Still it was a monolith service with a lot of complex Machine Learning models loaded in memory, making it heavy, slow and difficult to debug.

From Monolith to Microservices

Thanks Reddit for the best 2001 Monolith pic I could find!

Yes, we have been there too. Going from a monolithic application to microservices was not a fancy decision in our case. It was the only way we could serve thousand of requests per hour and being able to ingest a new user account fast. We needed to split the original service into small single services to be in charge of each of the three processes previously described. The result is a pretty simple tech stack:

Tornado for serving our NLP REST apps: parser, classifier and summarizer services
Supervisor to manage Tornado services
Datadog for monitoring and analytics

This stack, combined with a Courier distributed server and load balancer, has been powerful enough and allowed us to digest millions of emails during the first week of our beta.

What have we learned?

We have learned a lot since our beta launch, and we are working hard to improve our NLP backend. Some of the most important insights for us:

Everything is joy and fun, until production

You will see how painful your code is when you need to deploy it in production. So the sooner you start avoiding spaguetti code the better. Follow basic coding conventions: we all ignore them, but then we all complain about not having them (PEP 8 if your code base is in Python). Also, DO a lot of unit testing. It’s almost impossible to have too much unit testing and it will always help you avoid later surprises. If you’re doing text analysis, start soon with normalizing the encoding in your pipeline (almost all Summarizer errors on figure above were Unicode related issues).

NLP is expensive

Our complete in-house developed NLP preprocessing pipeline includes tokenization, sentence splitting, pos-tagging, lemmatization, named entity recognition, speech act classification, question classification, emotion classification, discourse relation extraction, task identification, sarcasam detection and anaphora resolution. It is an expensive pipeline in terms of performance. Even if we have been able to optimize a lot of our own modules, the sole NLP preprocessing takes a considerable amout of time of the complete analysis backend.

Divide and conquer

“Around 80% to 90% of all email in the world is botmail”

Related to the last point, our best strategy to improve the performance of our NLP preprocessing was to apply specific pipelines to different types of emails. On conversational email we are doing much more NLP analysis than on Botmail. Sometimes, a generic approach is just not possible.

Serving Machine Learning models in production

Our ML models are deeply integrated into our NLP libraries. In our case, the use of REST services to serve those libraries has proven to work well. The combo Tornado + Supervisor is a powerful solution. If you are working with Deep Learning models (or ML ones in general) and need to serve them isolated in production, TensorFlow Serving may be another interesting approach to consider. Here some insights about it.

Monitoring

We just can’t live without Datadog. Before we were doing a lot of automatic log analysis of our NLP Services with our own tools, which was ok. But a tool such as Datadog allowed us to expand the performance overview of our NLP backend and provided us with much richer insights. Is crucial to invest in such tools. Sure there are amazing similar open source versions, such as the Elastic Stack, but if you want something that works out of the box with minimal learning curve and is extremely easy to implement, then Datadog is an outstanding solution.

Conclusions

Our NLP backend at Courier has become a complex system with a lot of intricate connections between our NLP libraries, Deep Learning models and REST services. In this post we wanted to give you a brief overview of our backend and share some hard learned lessons.

We are proud of what we have developed at Courier and we are working hard to help users spent less time reading email. Give it a try!