Darío Garigliotti | forgotten entities

Entities, like people, organizations and products, are at the core of multiple techniques and applications devised to automatically process text documents. One of these applications is reputation analysis, for example, in monitoring through news streams the stock performance of a company or a scandal around a celebrity. This task is referred to as online reputation management, and deals with identifying and mining textual contexts from media streams, for example, sentences in news articles, in which a given entity occurs. In order to identify such contexts, it is necessary to perform entity linking: detect mentions of entities and disambiguate them.

Although these problems are usually addressed with methods that rely on knowledge bases, this kind of semantic resources is limited by the amount of information available for a given entity. New companies, new celebrities, new movies: new entities emerge on a daily basis, for which very limited to none information is yet available in knowledge repositories. There is then a need for quickly acquiring representations for these long-tail entities (entities at the tail of the prominence distribution), possibly by benefiting from similar entities for which monitoring is already more accessible.

As an example, consider the entity Isai, a then-novel investment fund. Since it is just emerging, the entity name occurs in very few textual contexts. How to collect as many of these contexts as possible, to use them to model entity linking for this entity, given that other entity like the movie Isai also occurs in other textual contexts, and we are not able yet to properly differentiate between them?

In our project, the main underlying idea is to utilize established entities (referred to as support entities, with rich KB entries) similar to the input entity, and their contexts (support contexts), to rank the contexts in which the input entity is mentioned. Specifically, we first perform support entities ranking (SER), to get a ranked list of entities that are similar to our entity of interest. Next, we identify support contexts for each support entity via traditional entity linking (support context ranking, SCR). Finally, we rank contexts for our input entity by considering their similarity to the support contexts (context-to-context ranking, CCR).

If you are interested in more of this, please read our article, for the technical details about how we:

formalize the problem of context retrieval for long-tail entities;
detail our proposed approach to acquire contextual representations for these long-tail entities in an unsupervised fashion;
build a curated test collection of long-tail entities and relevant contexts;
and show experimentally that our method substantially outperforms both an entity linking and a retrieval baseline.