Codeq NLP API Tutorial

Part 2. Linguistic features

We started this series of posts to explain how you can call our NLP API to analyze your own documents.

The complete list of modules we offer can be found in our documentation:

Define a Pipeline

First, we will import the Codeq Client and define a pipeline containing different linguistic annotators.

from codeq_nlp_api import CodeqClient

client = CodeqClient(user_id="USER_ID", user_key="USER_KEY")

pipe = [
"language", "tokenize", "ssplit",
"stopword",
"pos", "lemma", "stem",
"parse", "chunk"
]

Analyze a text

We can call now the client, analyze a text and print a quick overview of the analyzed content:

text = "This model is an expensive alternative with useless battery life."

document = client.analyze(text, pipeline=pipe)
print(document.pretty_print())

The NLP API uses two main objects to store the output:

  • A Document object is used to store global information about the text, for example the language, summary or key phrases, e.g., content that is not associated with specific sentences.

Let’s go through each of the linguistic annotators of the pipeline and see the specific output that they produce.

  • Each annotator is called by a keyword (KEY), which we defined in the pipe variable above.

Language Identification

This module is designed to identify the language in which pieces of text are written. Currently, we support the identification of 50 languages:

  • KEY: language
print(document.language)
print(document.language_probability)
# OUTPUT:
# English
# 0.9999996423721313

Tokenization

This module generates a segmentation of a text into words and punctuation marks. Tokens will be stored at both Document and Sentence levels.

  • KEY: tokenize
print(document.tokens)# OUTPUT:
# ['This', 'model', 'is', 'an', 'expensive', 'alternative', 'with', 'useless', 'battery', 'life', '.']

Sentence Segmentation

This module generates a list of sentences from a raw text. Each sentence will contain by default its position, raw_sentence and tokens.

  • KEY: ssplit
for sentence in document.sentences:
print(sentence.position)
print(sentence.raw_sentence)
print(sentence.tokens)
# OUTPUT:
# 0
# This model is an expensive alternative with useless battery life.
# ['This', 'model', 'is', 'an', 'expensive', 'alternative', 'with', 'useless', 'battery', 'life', '.']

Stopword Removal

This annotator removes common stop words from the text.

  • KEY: stopword
for sentence in document.sentences:
print(sentence.tokens_filtered)
# OUTPUT:
# ['model', 'expensive', 'alternative', 'useless', 'battery', 'life']

Part of Speech Tagging

This module identifies the category of each word from a grammatical or syntactic standpoint (e.g., verbs, nouns, prepositions, etc.). That category is referred to as a part of speech (POS).

The output of our POS tagger corresponds to the list of tags used in the Penn Treebank Project.

There are two attributes to access the POS tags, the first one contains only a list of tags for each token, the second one is a string with both tokens and POS tags in the format “token/POS”.

  • KEY: pos
for sentence in document.sentences:
print(sentence.tokens)
print(sentence.pos_tags)
print(sentence.tagged_sentence)
# OUTPUT:
# ['This', 'model', 'is', 'an', 'expensive', 'alternative', 'with', 'useless', 'battery', 'life', '.']
# ['DT', 'NN', 'VBZ', 'DT', 'JJ', 'JJ', 'IN', 'JJ', 'NN', 'NN', '.']
# This/DT model/NN is/VBZ an/DT expensive/JJ alternative/JJ with/IN useless/JJ battery/NN life/NN ./.

Lemmatization

This module identifies the appropriate lemma or canonical form of the words in a sentence. For example, conjugated verbs will be displayed in their infinitive form (are and is become be).

  • KEY: lemma
for sentence in document.sentences:
print(sentence.lemmas)
# OUTPUT:
# ['this', 'model', 'be', 'a', 'expensive', 'alternative', 'with', 'useless', 'battery', 'life', '.']

Stemming

The goal of this annotator is to reduce inflectional forms of a word to a common base form. For example the words connection, connections, or connecting should be converted to the base form connect.

  • KEY: stem
for sentence in document.sentences:
print(sentence.stems)
# OUTPUT:
# ['this', 'model', 'is', 'an', 'expens', 'altern', 'with', 'useless', 'batteri', 'life', '.']

Dependency Parser

This annotator analyzes every word in a sentence and assigns it to a head node (another word in the same sentence) with a syntactic relation. The labels given to each relation of head and dependent reflect syntactic relations that exist between those words; for example, if a verb has a noun phrase as a subject, the relation of the verb to the head noun of its subject is given the label nsubj.

For all but one word in the sentence, its head will be another word in the same sentence; the exception has the artificial “root” node as its head.

Dependencies are stored in 3-tuples consisting of: head, dependent and relation. Head and dependent are in the format “token@@@position”. Positions are 1-indexed, with 0 being the index for the root.

The output of our Dependency Parser uses the labels for grammatical relations from the Stanford Dependencies.

  • KEY: parse
for sentence in document.sentences:
print("dependencies:")
for dependency in sentence.dependencies:
print("- %s" % dependency)
# OUTPUT:
# dependencies:
# - ['model@@@2', 'This@@@1', 'det']
# - ['is@@@3', 'model@@@2', 'nsubj']
# - ['is@@@3', 'alternative@@@6', 'xcomp']
# - ['is@@@3', '.@@@11', 'punct']
# - ['alternative@@@6', 'expensive@@@5', 'amod']
# - ['alternative@@@6', 'an@@@4', 'det']
# - ['alternative@@@6', 'with@@@7', 'prep']
# - ['life@@@10', 'battery@@@9', 'nn']
# - ['life@@@10', 'useless@@@8', 'amod']
# - ['with@@@7', 'life@@@10', 'pobj']
# - ['root@@@0', 'is@@@3', 'root']

Chunker

Finally, the chunker annotator groups some of the words of a sentence into relatively small, non-overlapping groups to represent grammatically and/or semantically significant components. Common chunks include for example grouping a determiner with its noun, e.g., “tall person”, or an auxiliary verb with its verb, e.g., “will leave”.

The output of our Chunker uses the labels for chunk types from CONLL 2000.

  • KEY: chunk
for sentence in document.sentences:
print("chunks:")
for chunk in sentence.chunks:
print("- %s" % chunk)
# OUTPUT:
# chunks:
# - [NP This model]
# - [VP is]
# - an
# - [ADJP expensive alternative]
# - [PP with]
# - [NP useless battery life]
# - .

The variable sentence.chunks represents the groups of words and their label in a convenient format, e.g., “[NP This model]”. There is also another variable sentence.chunk_tuples where the chunks are stored as tuples indicating the chunk label, tokens and positions of the tokens:

for sentence in document.sentences:
print("chunk_tuples:")
for chunk_tuple in sentence.chunk_tuples:
print("- %s" % chunk_tuple)
# OUTPUT:
# chunk_tuples:
# - ['NP', 'This model', [0, 1]]
# - ['VP', 'is', [2]]
# - ['ADJP', 'expensive alternative', [4, 5]]
# - ['PP', 'with', [6]]
# - ['NP', 'useless battery life', [7, 8, 9]]

Wrap up

In this post we wanted to show some of the linguistic annotators of our NLP API, the pipeline keys you can use to call them, as well as the attributes where their output is stored. The following code snippet resumes the workflow explained above:

  • Take a look at our documentation to learn more about the NLP tools we provide.

Senior Computational Linguist at Codeq