Einstein was not a physicist
I mean, of course Einstein was a physicist, but is it not a very catchy title? I mean, is it not what you would conclude after not finding Physicist among the types of the entity Albert Einstein? I get it: you are aware that knowledge bases and repositories are not perfect, and in particular, are incomplete. But what about the vast majority of users who very likely trust in these resources, specially Wikipedia, as a holy book, and could not imagine that Albert Einstein not typed as Physicist was just missing? Worse, that they would never imagine that a type like Physicist is not even available? And I get it: you know Einstein was a physicist and so you don’t need to learn if by exploring (subject, predicate, object)
triples in a knowledge base or checking the bottom part with the categories in the Einstein article on Wikipedia. But what about less known facts? And what about cases where that information was useful? What about cases, like, I don’t know… entity retrieval?
OK, let’s bring some order back to this very young blog. In the last post, I explained entity types, and described some type systems where types are made available. And, since I was there and felt powerful, I kind of ranted about why I found it not necessarily appropiated in the best of DBpedia Ontology extent to contain the type Lunar crater, when comparing with types missing from design, types that seem to be more useful since they would cover a likely larger amount of more “interesting” entities. By “interesting”, I mean it in terms of their in their informative purpose when structured knowledge is used in applications like web search. Yes, I felt untouchable. I apologize. Indeed, one has to understand that there are decisions behind that have been made by experts in ontology design. Yet, to have you on my side, I teased that physicist is not a type available in DBpedia. It is possible to indicate that Einstein was a Scientist, but Physicist is not among the few children types of Scientist. Physicist actually is not anywhere in DBpedia Ontology. Same as Mathematician, for example. (So, good luck to all of you physicists and mathematicians, when trying to be typed as such in this ontology. We hope you don’t mind to know that Biologist made it as a type there and you, well, you just have to wait a bit more for a newer version, hopefully. Welcome, Darwin, Pasteur, and Fleming. Sayonara, Mr. Newton!)
Entity retrieval
Entity retrieval is the task of obtaining a ranked list of entities relevant to a search query. Many queries are entity-oriented, that is, queries whose expected result is an entity or a list of entities, or queries that contain entity mentions. Ranking entities in response to a search query is then an important problem.
Type information is known to contribute to entity retrieval. As an illustration, consider the scenario of a user planning a trip across Europe, and so, wishing to know in which cities she can use the Uber car sharing service. After issuing the query “cities in europe where uber is available,” the search engine we envisage would return a list of entities, according to some ranking criteria, as shown in this figure:
To achieve this, the system could try to identify the type of entities (here, European cities) that the query is seeking, which we refer to as target entity types. This type-based information could then be combined with the types assigned to each entity in a reference knowledge base, in order to improve the ranking of results.
We conducted a research around a main question: How can one exploit entity type information to improve entity retrieval? The concept of entity types, while seemingly straightforward, turns out to be a multifaceted research problem that, at the moment of our work, had not been thoroughly investigated in the literature.
Historically, it was assumed that the user complements the keyword query with one or more target types, using Wikipedia categories as the type system. Further developments were motivated and driven by the peculiarities of Wikipedia’s category system. Prior to our work, it was not known whether the same methods prove effective, and even if these issues persist at all, in case of other type taxonomies. Hence, we consider and systematically compare multiple type taxonomies (DBpedia, Freebase, Wikipedia, and YAGO).
Additionally, there is the issue of representing entity type information, more specifically, to what extent the hierarchy of the taxonomy should be preserved. Three type representation modes were considered in our work: most specific types, top-level types, and all types along the path to the top type.
A third question is how to combine type-based and text-based matching in the retrieval model. We studied three type-aware entity retrieval models: strict filtering, soft filtering, and interpolation.
Throughout our experiments, we make use of a so-called target entity types oracle. We assume that there is an “oracle” process in place that provides us with the (distribution of) correct target types for a given query. This corresponds to the setting that was employed at previous benchmarking campaigns, where target types are provided explicitly. We employ this idealized setting to ensure that our results reflect the full potential of using type information, without being hindered by the imperfections of an automated type detector.
We systematically evaluate all combinations of the three proposed dimensions -type taxonomies, type representation modes, and retrieval models-, in contrast with a single, purely term-based baseline.
Some of the main results found in our research can be summarized as follows:
- Wikipedia, in combination with the most specific type representation, performs best;
- keeping only the most specific types in the hierarchy provides the best performance across the board, for all configurations;
- strict filtering with the most specific type representation is the best retrieval model.
Furthermore, we perform a detailed analysis of particular configurations on the level of individual queries between a given configuration and the corresponding (term-based) baseline.
If you are interested in more of this, please read Chapter 3 of my thesis Task-Based Support in Search Engines, for technical details of the problem, the terminology, the methodology, and the experimental results and analysis.