This is the last part in a series on type-aware entity retrieval. Through previous posts, I have described our work to answer the core question on how to utilize type-based information to improve entity retrieval. We have seen results across several configurations combining choices for three dimensions of interest, all of those studies using an idealized setting, where target types were provided by an oracle. And we have also described and evaluated a method to automatically identify target entity types. We now plug these target types automatically identified in place of the oracle, in order to study how well type-aware entity retrieval performs by using these kind of types instead of perfect, idealized information.

This corresponds to an extrinsic evaluation of the method for hierarchical target type identification, i.e., evaluating the “goodness” of the target types automatically identified by using them in an external task (here, entity retrieval). Or, in a third, more formal statement: have these entity retrieval systems got very specific taste as the persona in the song by Saint Motel, when it’s about type-based information, or will they do fine enough with less perfect, more realistically available data?

We evaluate all combinations of retrieval models, type representation modes and (target entity type) identification models, alongside the single DBpedia Ontology as type taxonomy (since it is the only type system used to annotate the target entity type collection). We also compute the performances using the target type labels provided by the human assessors as target entity types in the collection, referred to as Oracle 2 (to distinguish it from Oracle 1, the original oracle with perfect information).

Overall, our results show that when target entity types are identified automatically, using hierarchical relationships from ancestor types is the most effective way of representing entity type information; keeping only the most specific types is helpful only when an accurate target entity type identification method is employed.

We observe that strict filtering achieves the best performance among all configurations, and the best results are obtained when it is combined with the LTR identification model. Moreover, We verify that an effective target entity type identification method, returning relevant types at the top ranks, can bring considerable retrieval improvements using the strict filtering retrieval model.

Entity retrieval performance for all combinations of retrieval models and type representation modes, using automatically identified target types from DBpedia. The red line corresponds to the term-based baseline. Performance is measured by NDCG@10.


In order to better understand the effects of automatic target type identification, we break down the entity retrieval results into the four query categories present in the DBpedia-Entity v2 collection. These range from short keyword queries (e.g., “guitar origin blues”), to named entity queries (e.g., “brooklyn bridge”), to entity list queries (e.g., “products of medimmune, inc”), to natural language queries (e.g., “who was called scarface?”). We find that type-aware retrieval using the LTR method significantly and consistently outperforms the term-based baseline for all query categories.

Differences in NDCG@10 per query between type-aware entity retrieval and its corresponding (term-based) baseline, using the strict filtering model with top-level DBpedia target types automatically detected by LTR, grouped by query categories.


If you are interested in more of this, please read Chapter 4, in particular Section 4.5.2, of my thesis Task-Based Support in Search Engines, for technical details of the problem, the terminology, the methodology, and the experimental results and analysis.