Time for a short story. The studies for my Masters degree started, as usual in my faculty, with a mandatory entrance course and exam that one has to pass in order to be admitted in the degree studies themselves. It is a course in fundamentals of maths, designed to give everybody a common ground in basics for the degrees in mathematics, astronomy, physics and computer science. One thing, in particular, has stucked in my mind since then. When we refreshed the notion of set, one example was the set A = {Pi, the Eiffel tower}. Having studied sets and set operations in early primary school, and never used them again during eleven years or more, the key thing there, the novelty for me by then, was the idea that a set could contain elements of different types. Until then, the intuition was to have a set of, say, apples, where everything is an apple. Or a set of numbers, where all are, well, numbers. Beyond that realization, this flexibility of sets was, I’d guess, never handy for any particular purpose, other than help remarking the nature of lists as a data structure where every element is of the same type. A similar notion of ensemble as a “salad of stuff” would appear again years later when I started working with knowledge repositories and knowledge bases.

Entities

There I had it: Wikipedia, a large set of dispair things containing, among others, Pi, the Eiffel tower, Eiffel (the architect), Eiffel (the programming language), Paris (France), Paris (Texas), Paris (Hilton), the Pythagorean theorem, the NBA, the 1996-97 NBA season, and the Thirty Years War. People, objects, cities, abstract concepts, events. Each of these things is uniquely identified, and so corresponds to an entity. Entities, such as persons, organizations, and locations, are natural units for organizing information. And as we will see later, entities are at the core of the current paradigm of web search, since they can provide not only more focused responses, but often immediate answers.

(Semi-)Structured Knowledge

A knowledge repository contains a collection of entities in a certain representation. Wikipedia is arguably the most widely known knowledge repository, where an entity is represented as a semi-structured article. That is, most of the document content is raw (or unstructured) text, with few structured information in the infoboxes presented as entity cards within the articles. Its collaborative project allows a large number of contributors to combine their efforts, daily used by millions of users around the world as their trustable reference in many areas of knowledge.

Wikipedia has been created for human consumption. A knowledge base (KB), instead, contains structured knowledge, often extracted from free text, intended to be accessible to machines. Contrasting with the unstructured nature of textual documents, this structured format represents knowledge usually by Resource Description Framework (RDF) triples of the shape (subject, predicate, object). Each of these triples declares a fact about the subject (an entity) by relating it with an object (a literal or another entity). For example, the fact that Oslo is the capital of Norway is represented by a triple (Oslo, isCapitalOf, Norway), where Oslo is the unique identifier of an entity in the knowledge base that corresponds with the city of Oslo, Norway indentifies uniquely an entity that corresponds to Norway, and isCapitalOf is a predicate that captures the relationship between the subject and object. There may be other predicates between such a subject and an object, possibly more or less interesting or useful, like isPlacedIn or isTheLargestCityOf. The object could also be a literal, like in the case of declaring the fact about the number of inhabitants for a city. The alternative denomination of knowledge bases as knowledge graphs emphasizes the relationships between entities.

A number of general-purpose knowledge bases have been developed along the recent years. DBpedia is widely used by researchers in areas such as Semantic Web and Information Retrieval. Developed by processing the semi-structured data from Wikipedia, it is a hub for multiple KBs to redirect to its large repository of entities. Other cross-domain KBs include YAGO, Freebase, and Wikidata. Alongside these ones, there also exist many purpose-specific knowledge bases, for example, GeoNames, WordNet, and DBLP, usually created for projects in specific domains. All these KBs are repositories of open data, freely available to everyone.

The increased availability of structured data published in knowledge repositories and knowledge bases constitutes the pivotal component that sparked the evolution trend from a traditional paradigm of web search into a semantic search or entity-oriented search. It will be clear what this current paradigm is when we see below some examples that appear on a daily basis in our search results. But, what do we mean by semantic search?

Web search is a human experience of a very rapid and impactful evolution. It has become a key technology on which people rely daily for getting information about almost everything. This evolution of the search experience has also shaped the expectations of people about it. Many users seem to expect today’s web search engines to behave like a kind of “wise interpreter,” capable of understanding the intent and meaning behind a search query, realizing her current context, and responding to it directly and appropriately. Search by meaning, or semantic search, rather than just literal matches, relies then on semantically meaningful representations of the information need expressed by the user query.

Consider the following search result from Google:

The query, “oslo weather,” leads to a widget that reports recent weather conditions, as well as presents forecasts for the next twenty-four hours and the next seven days.

Now consider this other query:

The display contains the result of the latest Liverpool-Barcelona match, with some statistics, and provides a link to a video with the match highlights on YouTube.

Finally, consider this example:

The search result presents an interactive chart with the stock exchange performance for Tesla Inc. intended in the query “tsla.”

These three examples correspond to direct displays in which the search engine has gone beyond the traditional ten blue links, by aiming to understand the query intent and return in the search engine result page a direct answer when possible. Note also that the verticals appropriately provide rich content depending on each query: maps and images for the weather, videos and images for the match, and finance and news for the stock performance.

Direct displays and verticals are examples of how major commercial search engines have indeed responded to user expectations, capitalizing on query semantics, i.e., query understanding, by introducing features which not only provide information directly but also engage the user to stay interacting with the results page.

Knowledge panels are perhaps one of the most prominent features in the web search experience reflecting this recent evolution trend from search engines into answer engines. Below, the entity card that results from searching “oslo” in Google.

Oslo is described by some of its entity properties, retrieved from some knowledge base of reference, such as a brief abstract, a handful of literal values like elevation and population, and related entities like the University of Oslo. Given the large portion of web search queries looking for entities, entities and their properties -attributes, types, and relationships- are first-class citizens in our space of structured knowledge.

Entity types

A characteristic property of an entity is its type. For example, the entity Oslo is of type City. Other types for this entity are Settlement, a more general type than City, and Norwegian city, more specific than City.

An entity is assigned one or more types from a type system of reference. Types in a type system are usually arranged in hierarchical relationships, which makes such a system to be known also as type taxonomy or type hierarchy.

Types such as City and Norwegian city are intuitive semantic classes that group entities. It is possible to think in many other types for a given entity. For example, Oslo is also a European city, and a Norwegian city with more than 100,000 inhabitants. Yet, in an actual scenario leveraging entity-oriented structured knowledge, the types available for an entity are usually the ones provided by the type taxonomy of reference. This is, depending on the type system that is used, a type that exists in a given taxonomy might not be available in another. As an example, the DBpedia Ontology is a rather small type hierarchy, manually curated, of a few more than 700 types in its current version.

A little digression. Even though manually curated, its small size and the decisions behind its design make that, as we said, not every fact can be properly declared, or not at all, due to types missing in the taxonomy. In the other hand, some types might look not so much useful, when comparing with some missing types that might seem to deserve priority in getting a spot. For example, since several versions of DBpedia, one of the types in this ontology is Lunar crater. Several lunar craters -assuming that each gets eventually a corresponding entity in the knowledge base- would actually benefit from the existence of Lunar crater to declare their type, but remind this example when we move to the next blog post (hint: physicist is not a type in DBpedia Ontology). 700 types might seem actually not so few, but considering how fast the exponential branching is from the root of the ontology, it is indeed rather a small size.

Using DBpedia Ontology as reference, it is only possible to indicate that Oslo is a City, but not a Norwegian city, since this more specific type does not exist in the DBpedia Ontology. Instead, the Wikipedia category graph contains a category Cities and towns in Norway. This very large graph of categories records hierarchical relationships, but is not strictly a taxonomy. Other type hierarchies used by the research community include the YAGO taxonomy and Freebase types. This variety of possible type references will constitute one of the dimensions that we study through the work that I will describe in the next blog post, when we discuss type-aware entity retrieval, this is, retrieval of entities in web search taking advantage of the entity type information available in knowledge bases.


If you are interested in more of this, please read Sections 1.1 and 2.2 of my thesis Task-Based Support in Search Engines.