Codeq NLP API Tutorial

In this tutorial we showcase the most recent module from Codeq’s NLP API, which identifies abusive and harmful content from texts.

This is a follow up tutorial; previous content can be found here:

The complete list of modules in the Codeq NLP API can be found here:

Define a NLP pipeline and analyze a text

As usual, the first thing to do is to create an instance of the Codeq Client and declare a pipeline containing the NLP annotators we are interested in. The client receives as input a text to be analyzed along with the declared pipeline. The output is a Document object that contains a list of Sentences; each sentence contains its own abuse predicted labels.

To print a quick overview of the results, you can use the method document.pretty_print(), which we will explain in detail in the following sections.

FULL DISCLAIMER: we do not endorse the content of the examples used here.

from codeq_nlp_api import CodeqClientclient = CodeqClient(user_id="USER_ID", user_key="USER_KEY")pipe = [
"twitter_preprocess", "abuse"
text = "Take you #BLM bullshit and shove it up your nasty ass"

document = client.analyze(text, pipeline=pipe)


In this tutorial we are going to focus on two annotators: the abuse classifier and a tool to preprocess tweets. Specifically, we will describe:

  • the keyword (KEY) used to call each annotator,

Abuse Classifier

The goal of this module is to automatically analyze texts that need to be reviewed for abusive and harmful language, mainly in the context of user-generated content or social communities.

More specific details about this annotator can be found here:

  • KEY: abuse

Output Labels

  • Offensive: sentences that contain profanity or that could be perceived as disrespectful.
pipe = ["abuse"]text = "Oi freak why don't you shut up! You're just another sand nigger trying to destroy America. Take you BLM bullshit and shove it up your nasty ass. Go fuck yourself, mudslime!"document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
raw_sentence = sentence.raw_sentence
abuse = sentence.abuse
print("\n%s - %s" % (abuse, raw_sentence))
# ['Offensive', 'Insult'] - Oi freak why do not you shut up!
# ['Offensive', 'Insult', 'Hate speech/racist'] - You're just another sand nigger trying to destroy America.
# ['Offensive', 'Obscene/scatologic'] - Take you BLM bullshit and shove it up your nasty ass.
# ['Offensive', 'Obscene/scatologic', 'Insult', 'Hate speech/racist'] - Go fuck yourself, mudslime!

Twitter Preprocessor

This annotator can be used to preprocess and clean tweets, e.g., to remove artifacts like user mentions and URLs, segment hashtags into tokens and generate a list of clean words from a tweet. The output of this annotator is also stored at the Sentence level.

This module can improve the output of the abuse classifier, since some hashtags may contain important information for the detection of abusive content.

  • KEY: twitter_preprocess
pipe = [
"twitter_preprocess", "abuse"

text = "Being white is actually good. Unfortunately not everyone can try this pleasure #WhiteLivesMatter #whitepride"""

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
print("\ntokens: %s" % sentence.tokens)
print("tokens_clean: %s" % sentence.tokens_clean)
# tokens: ['Being', 'white', 'is', 'actually', 'good', '.']
# tokens_clean: ['Being', 'white', 'is', 'actually', 'good', '.']
# tokens: ['Unfortunately', 'not', 'everyone', 'can', 'try', 'this', 'pleasure', '#WhiteLivesMatter', '#whitepride']
# tokens_clean: ['Unfortunately', 'not', 'everyone', 'can', 'try', 'this', 'pleasure', 'white', 'lives', 'matter', 'white', 'pride']

Here we can see that the tags #WhiteLivesMatter and #whitepride are tokenized as “white lives matter” and “white pride” respectively.

Wrap up

In this tutorial we described the Abuse Classifier of the Codeq NLP API and a tool to clean tweets. The code below summarizes how to call the annotators explained here and access their output.

  • Take a look at our documentation to learn more about the NLP tools we provide.

Senior Computational Linguist at Codeq