Codeq NLP API Tutorial

We started this series of tutorials to show you how to call different modules of Codeq’s NLP API. So far, we have covered the following topics:

In this tutorial we will detail NLP modules related to the extraction and disambiguation of named entities from texts. Named entities refer to real-world entities, that is, things in the world, whether concrete or abstract (e.g., person, organizations, locations, etc.). The identification of named entities plays an important role in many NLP related use cases, for example in summarization of texts and monitoring of brands or products.

The complete list of modules in the Codeq NLP API can be found here:

Define a NLP pipeline and analyze a text

We initialize an instance of the Codeq Client and declare a pipeline containing the desired annotators. After that, we can send a text to the client and retrieve a Document object with a list of Sentences, where the information related to named entities is stored. To print a quick overview of the results, you can use the method document.pretty_print(), which we will explain in detail in the following sections.

from codeq_nlp_api import CodeqClientclient = CodeqClient(user_id="USER_ID", user_key="USER_KEY")pipe = [
"ner", "nel", "salience",
"date", "coreference"
]
text = 'Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO. Musk said the company was "bleeding money like crazy" as it worked through the Model 3 production ramp in the spring and summer. He said the company "came within single-digit" weeks of death before it was able to meet its Model 3 production goals.'document = client.analyze(text, pipeline=pipe)print(document.pretty_print())

For each annotator of this pipeline we are going to explain:

  • the keyword (KEY) used to call the annotator,

Named Entity Recognition

This annotator extracts named entities from texts and provides the tokens of the entity, its type and its position in the tokenized sentence.

  • KEY: ner

Output Labels

  • PER (person)
pipe = ["ner"]

text = 'Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO.'

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
named_entities = sentence.named_entities
for ne in named_entities:
ne_tokens, ne_type, ne_position = ne
print("ne_tokens: %s" % ne_tokens)
print("ne_type: %s" % ne_type)
print("ne_position: %s\n" % ne_position)
# OUTPUT:
#
# ne_tokens: Tesla
# ne_type: ORG
# ne_position: ['0']
#
# ne_tokens: Elon Musk
# ne_type: PER
# ne_position: ['7', '8']
#
# ne_tokens: last Friday
# ne_type: DATE
# ne_position: ['10', '11']
#
# ne_tokens: Axios
# ne_type: ORG
# ne_position: ['16']
#
# ne_tokens: HBO
# ne_type: ORG
# ne_position: ['20']

Named Entity Linking

Named Entity Linking (also called Named Entity Disambiguation) refers to the process of mapping a mention of an entity extracted from a text to unique entries of a knowledge base.

Given the high ambiguity of language, a named entity can have multiple names and a name can be linked to different named entities. Hence, the main goal of this annotator is to disambiguate the entity mentions in their textual contexts and identify a concrete referent, using Wikipedia and Wikidata as knowledge bases.

  • KEY: nel
pipe = [
"ner", "nel"
]
text = 'Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO.'document = client.analyze(text, pipeline=pipe)for sentence in document.sentences:
named_entities = sentence.named_entities
named_entities_linked = sentence.named_entities_linked
for i, ne in enumerate(named_entities):
print("ne: %s" % ne)
ne_linked = named_entities_linked[i]
if ne_linked:
label = ne_linked['label']
description = ne_linked['description']
wikipedia_link = ne_linked['wikipedia_link']
wikidata_link = ne_linked['wikidata_link']
print("- label: %s" % label)
print("- description: %s" % description)
print("- wikipedia: %s" % wikipedia_link)
print("- wikidata: %s\n" % wikidata_link)
# Output:
#
# ne: ['Tesla', 'ORG', ['0']]
# - label: Tesla
# - description: American automotive, energy storage and solar power company
# - wikipedia: https://en.wikipedia.org/wiki/Tesla,_Inc.
# - wikidata: https://www.wikidata.org/wiki/Q478214
#
# ne: ['Elon Musk', 'PER', ['7', '8']]
# - label: Elon Musk
# - description: South African-born American entrepreneur
# - wikipedia: https://en.wikipedia.org/wiki/Elon_Musk
# - wikidata: https://www.wikidata.org/wiki/Q317521
#
# ne: ['last Friday', 'DATE', ['10', '11']]
# - None
#
# ne: ['Axios', 'ORG', ['16']]
# - label: AXIOS Media
# - description: American news and information website
# - wikipedia: https://en.wikipedia.org/wiki/Axios_(website)
# - wikidata: https://www.wikidata.org/wiki/Q28230873
#
# ne: ['HBO', 'ORG', ['20']]
# - label: HBO
# - description: American pay television network
# - wikipedia: https://en.wikipedia.org/wiki/HBO
# - wikidata: https://www.wikidata.org/wiki/Q23633

The output stored in the variable sentence.named_entities_linked is a list with the same number of entities found on the Named Entity Recognition module stored in the variable sentence.named_entities. If it is not possible to disambiguate a named entity (for example entities of type DATE, or cases where no referent is found on the knowledge base), then the corresponding element in the list sentence.named_entities_linked will be None.

Named Entity Salience

The goal of this annotator is to indicate the salience of named entities, that is, how relevant they are to the content of the input document. This annotator produces a tuple for each named entity, indicating if the entity is salient or not and its salience score.

  • KEY: salience
pipe = [
"ner", "salience"
]

text = 'Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO.'

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
named_entities = sentence.named_entities
named_entities_salience = sentence.named_entities_salience
for i, ne in enumerate(named_entities):
ne_salience = named_entities_salience[i]
is_salient, salience_score = ne_salience
print("\nne: %s" % ne)
print("- is_salient: %s" % is_salient)
print("- score: %s" % salience_score)
# Output:
#
# ne: ['Tesla', 'ORG', ['0']]
# - is_salient: 1
# - score: 0.4574693441390991
#
# ne: ['Elon Musk', 'PER', ['7', '8']]
# - is_salient: 0
# - score: 0.2523033320903778
#
# ne: ['last Friday', 'DATE', ['10', '11']]
# - is_salient: 0
# - score: 0
#
# ne: ['Axios', 'ORG', ['16']]
# - is_salient: 0
# - score: 0.31706923246383667
#
# ne: ['HBO', 'ORG', ['20']]
# - is_salient: 0
# - score: 0.25251904129981995

Date Resolution

This annotator tries to resolve the specific dates of temporal expressions in natural language (e.g., next Friday, last Monday, etc.). The annotator takes as referent a relative date for the resolution, by default today. The output includes the date entity, its tokens span and the resolved timestamp.

  • KEY: date
pipe = [
"ner", "date"
]

text = 'Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO.'
document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
dates = sentence.dates
for date in dates:
date_tokens = date[0]
date_span = date[1]
timestamp = date[2]
print("date_tokens: %s" % date_tokens)
print("date_span: %s" % date_span)
print("timestamp: %s" % timestamp)
# Output:
#
# date_tokens: last Friday
# date_span: ['10', '11']
# timestamp: 2020-11-06 09:00:00 (from date referent: 11/11/2020)

Coreference Resolution

A coreference occurs when two or more mentions in a text refer to the same entity using different words. For example, in the sentence:

“John is working today, he is at the office.”

The pronoun he refers to the entity John.

Our coreference resolution module tries to resolve pronominal coreferences, i.e., find references of entities for personal pronouns (he, she, him, her, etc.). The annotator produces a resolved coreference including the mention of the entity, its referent and a coreference chain with all the elements that point to the same entity.

  • KEY: coreference
pipe = [
"ner", "coreference"
]

text = 'Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO. Musk said the company was "bleeding money like crazy" as it worked through the Model 3 production ramp in the spring and summer. He said the company "came within single-digit" weeks of death before it was able to meet its Model 3 production goals.'

document = client.analyze(text, pipeline=pipe)

for sentence in document.sentences:
print("\n%s" % sentence.raw_sentence)
coreferences = sentence.coreferences
for coreference in coreferences:
mention = coreference['mention']
referent = coreference['referent']
referent_chain = coreference['referent_chain']
print("- mention: %s" % mention)
print("- referent: %s" % referent)
print("- referent_chain: %s\n" % referent_chain)
# Output:
#
# Tesla almost died earlier this year, Elon Musk said last Friday in an interview with Axios that aired on HBO.
#
# - mention: None
# - referent: ['00_07', ['Elon', 'Musk'], [7, 8]]
# - referent_chain: []
#
# Musk said the company was "bleeding money like crazy" as it worked through the Model 3 production ramp in the spring and summer.
#
# - mention: ['01_00', ['Musk'], [0]]
# - referent: ['00_07', ['Elon', 'Musk'], [7, 8]]
# - referent_chain: ['00_07']
#
# - mention: None
# - referent: ['01_16', ['Model', '3'], [16, 17]]
# - referent_chain: []
#
# He said the company "came within single-digit" weeks of death before it was able to meet its Model 3 production goals.
#
# - mention: ['02_00', ['He'], [0]]
# - referent: ['01_00', ['Musk'], [0]]
# - referent_chain: ['01_00', '00_07']
#
# - mention: ['02_19', ['Model', '3'], [19, 20]]
# - referent: ['01_16', ['Model', '3'], [16, 17]]
# - referent_chain: ['01_16']

As we can observe from the output above, a coreference element can contain a null mention if it is the first referent found in a text:

# mention: None
# referent: ['00_07', ['Elon', 'Musk'], [7, 8]]
# referent_chain: []

The referent_chain indicates the ids of the related coreference elements:

# mention: ['02_00', ['He'], [0]]
# referent: ['01_00', ['Musk'], [0]]
# referent_chain: ['01_00', '00_07']

In the example above, the referent chain ids [‘01_00’, ‘00_07’] mean that all the related tokens, including the current mention, are:

# [‘02_00’, [‘He’], [0]]
# [‘01_00’, [‘Musk’], [0]]
# [‘00_07’, [‘Elon’, ‘Musk’], [7, 8]]

Wrap up

In this tutorial we described different modules related to the extraction and disambiguation of named entities. The code below summarizes how to call the annotators explained here and access their output.

  • Take a look at our documentation to learn more about the NLP tools we provide.

Senior Computational Linguist at Codeq