Codeq NLP API Tutorial

Part 2. Linguistic features

Rodrigo Alarcón

6 min readNov 4, 2020

We started this series of posts to explain how you can call our NLP API to analyze your own documents.

The first tutorial covered how to get started and send requests using our Python SDK.
In this second tutorial we will show you different modules that can be applied to analyze texts at a basic linguistic level.

The complete list of modules we offer can be found in our documentation:

Codeq NLP API Documentation

The first thing you need to do before start using Codeq's NLP API is to sign up to generate a User ID and User Key…

api.codeq.com

Define a Pipeline

First, we will import the Codeq Client and define a pipeline containing different linguistic annotators.

from codeq_nlp_api import CodeqClient

client = CodeqClient(user_id="USER_ID", user_key="USER_KEY")

pipe = [
    "language", "tokenize", "ssplit",
    "stopword",
    "pos", "lemma", "stem",
    "parse", "chunk"
]

Analyze a text

We can call now the client, analyze a text and print a quick overview of the analyzed content:

text = "This model is an expensive alternative with useless battery life."

document = client.analyze(text, pipeline=pipe)print(document.pretty_print())

The NLP API uses two main objects to store the output:

A Document object is used to store global information about the text, for example the language, summary or key phrases, e.g., content that is not associated with specific sentences.
A list of Sentence objects is encapsulated within the Document, where particular information pertinent to each sentence is stored, for example tokens or part of speech tags.

Let’s go through each of the linguistic annotators of the pipeline and see the specific output that they produce.

Each annotator is called by a keyword (KEY), which we defined in the pipe variable above.
For each annotator, the output is stored in specific attributes (ATTR) at the Document or Sentence levels.

Language Identification

This module is designed to identify the language in which pieces of text are written. Currently, we support the identification of 50 languages:

KEY: language
ATTR: document.language
ATTR: document.language_probability

print(document.language)
print(document.language_probability)# OUTPUT:
# English
# 0.9999996423721313

Tokenization

This module generates a segmentation of a text into words and punctuation marks. Tokens will be stored at both Document and Sentence levels.

KEY: tokenize
ATTR: document.tokens
ATTR: sentence.tokens

print(document.tokens)# OUTPUT:
# ['This', 'model', 'is', 'an', 'expensive', 'alternative', 'with', 'useless', 'battery', 'life', '.']

Sentence Segmentation

This module generates a list of sentences from a raw text. Each sentence will contain by default its position, raw_sentence and tokens.

KEY: ssplit
ATTR: document.sentences
ATTR: sentence.position
ATTR: sentence.raw_sentence
ATTR: sentence.tokens

for sentence in document.sentences:
    print(sentence.position)
    print(sentence.raw_sentence)
    print(sentence.tokens)# OUTPUT:
# 0
# This model is an expensive alternative with useless battery life.
# ['This', 'model', 'is', 'an', 'expensive', 'alternative', 'with', 'useless', 'battery', 'life', '.']

Stopword Removal

This annotator removes common stop words from the text.

KEY: stopword
ATTR: sentence.tokens_filtered

for sentence in document.sentences:
    print(sentence.tokens_filtered)# OUTPUT:
# ['model', 'expensive', 'alternative', 'useless', 'battery', 'life']

Part of Speech Tagging

This module identifies the category of each word from a grammatical or syntactic standpoint (e.g., verbs, nouns, prepositions, etc.). That category is referred to as a part of speech (POS).

The output of our POS tagger corresponds to the list of tags used in the Penn Treebank Project.

There are two attributes to access the POS tags, the first one contains only a list of tags for each token, the second one is a string with both tokens and POS tags in the format “token/POS”.

KEY: pos
ATTR: sentence.pos_tags
ATTR: sentence.tagged_sentence

for sentence in document.sentences:
    print(sentence.tokens)
    print(sentence.pos_tags)
    print(sentence.tagged_sentence)# OUTPUT:
# ['This', 'model', 'is', 'an', 'expensive', 'alternative', 'with', 'useless', 'battery', 'life', '.']
# ['DT', 'NN', 'VBZ', 'DT', 'JJ', 'JJ', 'IN', 'JJ', 'NN', 'NN', '.']
# This/DT model/NN is/VBZ an/DT expensive/JJ alternative/JJ with/IN useless/JJ battery/NN life/NN ./.

Lemmatization

This module identifies the appropriate lemma or canonical form of the words in a sentence. For example, conjugated verbs will be displayed in their infinitive form (are and is become be).

KEY: lemma
ATTR: sentence.lemmas

for sentence in document.sentences:
    print(sentence.lemmas)# OUTPUT:
# ['this', 'model', 'be', 'a', 'expensive', 'alternative', 'with', 'useless', 'battery', 'life', '.']

Stemming

The goal of this annotator is to reduce inflectional forms of a word to a common base form. For example the words connection, connections, or connecting should be converted to the base form connect.

KEY: stem
ATTR: sentence.stems

for sentence in document.sentences:
    print(sentence.stems)# OUTPUT:
# ['this', 'model', 'is', 'an', 'expens', 'altern', 'with', 'useless', 'batteri', 'life', '.']

Dependency Parser

This annotator analyzes every word in a sentence and assigns it to a head node (another word in the same sentence) with a syntactic relation. The labels given to each relation of head and dependent reflect syntactic relations that exist between those words; for example, if a verb has a noun phrase as a subject, the relation of the verb to the head noun of its subject is given the label nsubj.

For all but one word in the sentence, its head will be another word in the same sentence; the exception has the artificial “root” node as its head.

Dependencies are stored in 3-tuples consisting of: head, dependent and relation. Head and dependent are in the format “token@@@position”. Positions are 1-indexed, with 0 being the index for the root.

The output of our Dependency Parser uses the labels for grammatical relations from the Stanford Dependencies.

KEY: parse
ATTR: sentence.dependencies

for sentence in document.sentences:
    print("dependencies:")
    for dependency in sentence.dependencies:
        print("- %s" % dependency)# OUTPUT:
# dependencies:
# - ['model@@@2', 'This@@@1', 'det']
# - ['is@@@3', 'model@@@2', 'nsubj']
# - ['is@@@3', 'alternative@@@6', 'xcomp']
# - ['is@@@3', '.@@@11', 'punct']
# - ['alternative@@@6', 'expensive@@@5', 'amod']
# - ['alternative@@@6', 'an@@@4', 'det']
# - ['alternative@@@6', 'with@@@7', 'prep']
# - ['life@@@10', 'battery@@@9', 'nn']
# - ['life@@@10', 'useless@@@8', 'amod']
# - ['with@@@7', 'life@@@10', 'pobj']
# - ['root@@@0', 'is@@@3', 'root']

Chunker

Finally, the chunker annotator groups some of the words of a sentence into relatively small, non-overlapping groups to represent grammatically and/or semantically significant components. Common chunks include for example grouping a determiner with its noun, e.g., “tall person”, or an auxiliary verb with its verb, e.g., “will leave”.

The output of our Chunker uses the labels for chunk types from CONLL 2000.

KEY: chunk
ATTR: sentence.chunks
ATTR: sentence.chunk_tuples

for sentence in document.sentences:
    print("chunks:")
    for chunk in sentence.chunks:
        print("- %s" % chunk)# OUTPUT:
# chunks:
# - [NP This model]
# - [VP is]
# - an
# - [ADJP expensive alternative]
# - [PP with]
# - [NP useless battery life]
# - .

The variable sentence.chunks represents the groups of words and their label in a convenient format, e.g., “[NP This model]”. There is also another variable sentence.chunk_tuples where the chunks are stored as tuples indicating the chunk label, tokens and positions of the tokens:

for sentence in document.sentences:
    print("chunk_tuples:")
    for chunk_tuple in sentence.chunk_tuples:
        print("- %s" % chunk_tuple)# OUTPUT:
# chunk_tuples:
# - ['NP', 'This model', [0, 1]]
# - ['VP', 'is', [2]]
# - ['ADJP', 'expensive alternative', [4, 5]]
# - ['PP', 'with', [6]]
# - ['NP', 'useless battery life', [7, 8, 9]]

Wrap up

In this post we wanted to show some of the linguistic annotators of our NLP API, the pipeline keys you can use to call them, as well as the attributes where their output is stored. The following code snippet resumes the workflow explained above:

Take a look at our documentation to learn more about the NLP tools we provide.
Do you need inspiration? Go to our use case demos and see how you can integrate different tools.
In our NLP demos section you can also try our tools and find examples of the output of each module.