Darío Garigliotti | A fact about Vietnam war facts

In the previous post, I describe a systematic comparison of dimensions in type-aware entity retrieval. We considered target entity type identification using an oracle process. Indeed, target entity types may be provided by the user explicitly as part of the search request, for example, via faceted user interfaces. Often, however, users would prefer to use simple keyword queries as input. In that case, target entity types need to be identified automatically based on the keyword query. In this post, I discuss how to assign target entity types to queries from a type taxonomy.

Hierarchical Target Entity Type Identification

Firstly, let’s see a couple of examples to motivate the definition of the hierarchical target entity type identification task.

The query “finland car industry manufacturer saab sisu” has both Company and Automobile as valid target entity types. Hence, instead of implicitly assuming -as done in related literature- that every query must have a single target type, which is not particularly useful in practice, we relax this assumption. We allow for possibly multiple main types, if they are sufficiently different, i.e., lie on different branches in the taxonomy.

Second, it can happen -and in fact it does happen for 33% of the queries considered in a distinguished previous work- that a query cannot be mapped to any type in the given taxonomy. Take, for example, the query “Vietnam war facts.” What are its taget types? There is not a Fact type in (most of) known taxonomies, nor should there be: an entity cannot be of type “fact,” but instead facts declare statements, i.e., facts, about entities. So what are these Vietnam war facts about? The war as a whole? Particular events? The people involved? Soldiers, in particular? The cities? Since facts are about entities in general, one might think that any type of entity, or rather any piece of information centered around any kind of entities, could satisfy the query; no type, then, is a meaningful target entity type. We then allow a query not to have any type (or, equivalently, to be tagged with a special NIL-type). This relaxation means that we can now take any query as input.

Hierarchical target entity type identification is the task of finding the main target types of a query, from a type taxonomy, such that (i) these correspond to the most specific category of entities that are relevant to the query, and (ii) main types cannot be on the same branch in the taxonomy. If no matching type can be found in the taxonomy then the query is assigned a special NIL-type.

Our approach: Learning-to-Rank

We then build a test collection for evaluating this task. For design constraints of simplicity when annotating the dataset instances, we only assign DBpedia types to the entities in the collection. Furthermore, we propose an approach for automatically identifying target entity types.

We consider two baseline methods from the literature. One of them uses an entity-centric strategy, this is, it uses entities as a bridge to reach entity types from a query. The other, a type-centric model, builds a textual type representation by concatenating the descriptions of all the entities assigned to each type.

Our approach consists in learning to rank (LTR) target entity types. The entity-centric and type-centric models capture different aspects of target entity type identification, and it is therefore sensible to combine the two as features in our approach. In addition, we leverage other signals, including knowledge base features and type label similarities.

We evaluate target entity type identification intrinsically. We find that our supervised learning (LTR) approach significantly and substantially outperforms all baseline methods.

Moreover, our analysis of the discriminative power of the features underlines the effectiveness of textual similarity, enriched with distributional semantic representations, measured between the query and the type label.

Results: feature analysis of target entity type identification

Performance of our LTR approach, measured by NDCG@5, when incrementally adding features according to their individual information gain, measured by Gini score.

If you are interested in more of this, please read Chapter 4 of my thesis Task-Based Support in Search Engines, for technical details of the problem, the terminology, the methodology, and the experimental results and analysis.