BibTeX entries are available on my DBLP profile.
Ph.D. thesis
-
Task-based Support in Search Engines
Garigliotti, Darı́o
2020
[Abstract]
[PDF]
[Slides]
Web search has become a key technology on which people rely on a daily basis.
The evolution of the search experience has also shaped the expectations of people about it.
Many users seem to expect today’s search engines to behave like a kind of “wise interpreter,” capable of understanding the meaning behind a search query, realizing its current context, and responding to it directly and appropriately.
Semantic search encompasses a large portion of information retrieval (IR) research devoted to study more meaningful representations of the user information need.
Entity cards, direct displays, and verticals are examples of how major commercial search engines have indeed capitalized on query understanding.
Search is usually performed with a specific goal underlying the query.
In many cases, this goal consists of a nontrivial task to be completed.
Current search engines support a small set of basic tasks, and most of the knowledge-intensive workload for supporting more complex tasks is left to the user.
Task-based search can be viewed as an information access paradigm that aims to enhance search engines with functionalities for recognizing the underlying tasks in searches and providing support for task completion.
The research presented in this thesis focuses on utilizing and extending methods and techniques from semantic search in the next stage of the evolution: to support users in achieving their tasks.
Our work can be grouped in three grand themes:
(1) Entity type information for entity retrieval: we conduct a systematic evaluation and analysis of methods for type-aware entity retrieval, in terms of three main dimensions.
We revisit the problem of hierarchical target type identification, present a state-of-the-art supervised learning method, and analyze the usage of automatically identified target entity types for type-aware entity retrieval;
(2) Entity-oriented search intents: we propose a categorization scheme for entity-oriented search intents, and study the distributions of entity intent categories per entity type.
We develop a method for constructing a knowledge base of entity-oriented search intents;
and (3) Task-based search: we design a probabilistic generative framework for task-based query suggestion, and principledly estimate each of its components.
We introduce the problems of query-based task recommendation and mission-based task recommendation, and establish respective suitable baselines.
2025
-
MLJ
When Redundancy Matters: Machine Teaching of Representations
Ferri, Cèsar,
Garigliotti, Darı́o,
Håvardstun, Brigt,
Hernandez-Orallo, Jose,
and Telle, Jan Arne
In Springer Machine Learning Journal (Under review)
2025
[Abstract]
[PDF]
In traditional machine teaching, a teacher wants to teach a concept to a learner, by means of a finite set of examples, the witness set. But concepts can have many equivalent representations. This redundancy strongly affects the search space, to the extent that teacher and learner may not be able to easily determine the equivalence class of each representation. In this common situation, instead of teaching concepts, we explore the idea of teaching representations. We work with several teaching schemas that exploit representation and witness size (Eager, Greedy and Optimal) and analyze the gains in teaching effectiveness for some representational languages (DNF expressions and Turing-complete P3 programs). Our theoretical and experimental results indicate that there are various types of redundancy, handled better by the Greedy schema introduced here than by the Eager schema, although both can be arbitrarily far away from the Optimal. For P3 programs we found that witness sets are usually smaller than the programs they identify, which is an illuminating justification of why machine teaching from examples makes sense at all.
-
ECAI
Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models
Freitas, Diogo,
Håvardstun, Brigt,
Ferri, Cèsar,
Garigliotti, Darı́o,
Telle, Jan Arne,
and Hernandez-Orallo, Jose
In 28th European Conference on Artificial Intelligence
2025
[Abstract]
[PDF]
Large language models have become multimodal, and many of them are said to integrate their modalities using common representations. If this were true, a drawing of a car as an image, for instance, should map to a similar area in the latent space as a textual description of the strokes that form the drawing. To explore this in a black-box access regime to these models, we propose the use of machine teaching, a theory that studies the minimal set of examples a teacher needs to choose so that the learner captures the concept. In this paper, we evaluate the complexity of teaching vision-language models a subset of objects in the Quick, Draw! dataset using two presentations: raw images as bitmaps and trace coordinates in TikZ format. The results indicate that image-based representations generally require fewer segments and achieve higher accuracy than coordinate-based representations. But, surprisingly, the teaching size usually ranks concepts similarly across both modalities, even when controlling for (a human proxy of) concept priors, suggesting that the simplicity of concepts may be an inherent property that transcends modality representations.
2024
-
ACL
SDG target detection in environmental reports using Retrieval-augmented Generation with LLMs
Garigliotti, Darı́o
In Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP), co-located with the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)
2024
[Abstract]
[PDF]
With the consolidation of Large Language Models (LLM) as a dominant component in approaches for multiple linguistic tasks, the interest in these technologies has greatly increased within a variety of areas and domains. A particular scenario of information needs where to exploit these approaches is climate-aware NLP. Paradigmatically, the vast manual labour of inspecting long, heterogeneous documents to find environment-relevant expressions and claims suits well within a recently established Retrieval-augmented Generation (RAG) framework. In this paper, we tackle two dual problems within environment analysis dealing with the common goal of detecting a Sustainable Developmental Goal (SDG) target being addressed in a textual passage of an environmental assessment report.We develop relevant test collections, and propose and evaluate a series of methods within the general RAG pipeline, in order to assess the current capabilities of LLMs for the tasks of SDG target evidence identification and SDG target detection.
-
ISWC
On the Relevant Set of Contexts for Evaluating Retrieval-Augmented Generation Systems
Garigliotti, Darı́o
In 23rd International Semantic Web Conference
2024
[Abstract]
[PDF]
The recent interest in approaching more language and knowledge processing tasks via the Retrieval-Augmented Generation (RAG) framework allows for the consideration of evaluation criteria that can lead to a discrepancy in the way that the set of relevant results is determined for assessing retrieval-based performances. In this work, we describe and reflect on the consequences of such a discrepancy, and present basic results from experimentation over a RAG-based benchmark for Question Answering.
-
ISWC
Retrieval-Augmented Generation for Query Target Type Identification
Garigliotti, Darı́o
In 23rd International Semantic Web Conference
2024
[Abstract]
[PDF]
The paradigm shift unleashed by Entity-Oriented Search still characterizes a vast space of the dynamics with which users engage in digital information access, from Web search to e-commerce and social networks. The progress in research around Entity Retrieval tasks has in particular shown the convenience of incorporating type-based information for entities in their methods to provide relevant answers to queries. As types are typically accessible in an ontology of reference within the knowledge base where their assigned entities live, automatically identifying query target types is a relevant problem to tackle. In this work, we propose to address the task of Query Target Type Identification by assessing the capabilities of Large Language Models that have recently shown widespread success. Our experimentation with methods based on Retrieval-Augmented Generation over a purposely built test collection from the literature challenges a well-established closed LLM by presenting it with entity type information from a resource within the core in hubbing Linked Open Data.
-
IDEAL
Evaluating Performance and Trustworthiness of RAG Systems for Generating
Administrative Text
Sánchez-Navalón, Hugo,
Monserrat, Carlos,
Garigliotti, Darı́o,
and Ferri, Cèsar
In Intelligent Data Engineering and Automated Learning - IDEAL 2024
- 25th International Conference, Proceedings, Part I
2024
[Abstract]
[PDF]
As administrative language tends to be formal and exempt from double meanings or figurative expressions, it is a particular domain in which to explore the performance of Language Models. This paper presents a study on the feasibility of creating administrative texts-based RAG systems to serve as chatbots, analyzing the performance for this task of several Small and Large Language Models and defining ways of evaluating whether they hallucinate or not and whether they provide the user useful information or not. Conventional metrics depending on ground truth labels, such as cosine similarity or those from the ROUGE family, are explored, as well as new approaches to using other metrics not so popular in text evaluation, such as Euclidean and Manhattan distances. Moreover, all those objective metrics are compared with a subjective Likert scale to assess their performance at solving real users’ problems and to find relations between subjective perceptions and ob- jectively measured metrics for each of the RAG systems proposed. The results show that SLM models (such as NeuralChat) can perform as well as an LLM if RAG programming provides them with an appropriate context.
-
IDEAL
Automatic PDF Document Classification with Machine Learning
Luna, Sócrates Llácer,
Garigliotti, Darı́o,
Martı́nez-Plumed, Fernando,
and Ramirez, Cèsar Ferri
In Intelligent Data Engineering and Automated Learning - IDEAL 2024
- 25th International Conference, Proceedings, Part I
2024
[Abstract]
[PDF]
UniversitatPolitècnicadeValència(UPV)faceschallengesin managing its Alfresco document repository, which contains 600,000 PDF files, of which only 100,000 are correctly categorised. Manual classifica- tion is laborious and error-prone, hindering information retrieval and ad- vanced search capabilities. This project presents an automated pipeline that integrates optical character recognition (OCR) and machine learn- ing to efficiently classify documents. Our approach distinguishes between scanned and digital documents, accurately extracts text and categorises it into 51 predefined categories using models such as BERT and RF. By improving document organisation and accessibility, this work optimises UPV’s document management and paves the way for advanced search technologies and real-time classification systems.
-
IDEAL
Entity Examples for Explainable Query Target Type Identification with
LLMs
Garigliotti, Darı́o
In Intelligent Data Engineering and Automated Learning - IDEAL 2024
- 25th International Conference, Proceedings, Part II
2024
[Abstract]
[PDF]
When answering a user query with relevant entities from a knowledge base (KB), utilizing their semantic class or type information typically structured in the KB is known to improve the retrieval per- formance for these entities. Accordingly, it is important to identify the target types of entities expected by a query. This work addresses the task of Target Type Identification (TTI) by replacing the established supervisedly learnt ranking approach with a generative approach pow- ered by Large Language Models (LLMs). Beyond assessing the ability of LLMs at predicting query target types, we study aspects of the strategy to elicit generation, in particular, the role of example relevant entities in supporting the explanation of mechanisms behind the LLM predictions.
-
RecSys
On Data Contamination in Recommender Systems
Garigliotti, Darı́o
In Proceedings of the 18th ACM Conference on Recommender Systems
2024
[Abstract]
[PDF]
The interest in studying the phenomenon of data contamination has recently increased due to the establishment of Large Language Models as the dominant technology for a vast array of information processing tasks. This position paper describes the reasons behind this increased awareness about data contamination and reflects on its possible implications for the field of Recommender Systems.
-
ECAI
EquinorQA: Large Language Models for Question Answering Over Proprietary
Data
Garigliotti, Darı́o,
Johansen, Bjarte,
Kallestad, Jakob Vigerust,
Cho, Seong-Eun,
and Ferri, Cèsar
In 27th European Conference on Artificial Intelligence - Including 13th
Conference on Prestigious Applications of Intelligent Systems (PAIS)
2024
[Abstract]
[PDF]
Large Language Models (LLMs) have become the state- of-the-art technology in a variety of language understanding tasks. Accordingly, many commercial organizations have been increasingly trying to integrate LLMs in multiple areas of their production and analytics. A typical scenario is the need for answering questions over a domain-specific, private collection of documents, such that the answer is supported by evidence clearly referenced from those documents. The Retrieval-Augmented Generation (RAG) framework has been recently used by many applications for this kind of scenarios, as it intuitively bridges dedicated data collections and state-of-the-art generative models. Yet, LLMs are known to present data con- tamination, a phenomenon in which their performance on evaluation data relevant to a task is influenced by said data being already incor- porated to the LLM during training phase. In this paper, we assess the performance of LLMs within the domain of Equinor, the largest energy company in Norway. Specifically, we address question answering with a RAG-based approach over a novel data collection not available for well-established LLMs during training, in order to study the effect of data contamination for this task. Beyond shedding light on LLM performance for a highly-demanded, realistic indus- trial scenario, we also analyze its potential impact for an ensemble of personas in Equinor with particular information needs and contexts.
-
ER
Self-explanatory Retrieval-Augmented Generation for SDG Evidence
Identification
Garigliotti, Darı́o
In Advances in Conceptual Modeling - ER
2024
[Abstract]
[PDF]
With the establishment of the Sustainable Development Goals (SDG) framework, practitioners in environmental impact assessment have an increasing requirement to detect relevant information centered on this frame of reference. The task of automatically identifying evidence that supports the project actually addressing a particular SDG target becomes crucial for enabling assessment digitalization across long, heterogeneous documents. In this work, we tackle SDG evidence identification via the well-suited Retrieval-augmented Generation (RAG) approach pow- ered by Large Language Models (LLM). The identified evidence may also support further related tasks in conceptual modeling where reports or parts of their content are to be assigned to entries in a structured resource such as a domain-specific ontology. Beyond the measurement of performance of a series of method configurations on this task, we also assess RAG abilities for making this kind of decisions when the LLM is requested to explain its own mechanisms alongside the answer it gener- ates. Our evaluation resources are made publicly available.
-
ECML
On the implications of data contamination for Information Retrieval systems
Garigliotti, Darı́o
In Proceedings of Machine Learning and Principles and Practice of Knowledge Discovery in Databases
2024
[Abstract]
[PDF]
Data contamination occurs when test instances have been compromised during a training stage of building a machine learning model. The consequences of this phenomenon over the quality of learning data are crucial when evaluating a learned predictor, since it could dis- tort the assessment of the actual capabilities of the system. Its study has recently gained more traction in the research on Large Language Mod- els, where it is common to chase performances in order to support claims about model abilities. Since the field of Information Retrieval increas- ingly studies and develops approaches that rely on these data-centric technologies, this position paper considers the phenomenon of data con- tamination in terms of its possible consequences for this field.
-
ECML
Explaining LLM-based Question Answering via the self-interpretations of a model
Garigliotti, Darı́o
In Proceedings of Machine Learning and Principles and Practice of Knowledge Discovery in Databases
2024
[Abstract]
[PDF]
As Large Language Models (LLMs) become increasingly ubiquitous in data-driven methods for multiple information processing tasks, so is also more significant the need to provide explainability mechanisms for these methods. In this work, we tackle a paradigmatic instance of the family of Question Answering problems by the means of a general approach based on Retrieval-augmented Generation (RAG). We focus not only on the performance for different parameter configurations but, in particular, on augmentation strategies that inquire the very generator LLM about its own interpretations behind the answer that it provides for a question.
-
ACL
Confounders in Instance Variation for the Analysis of Data Contamination
Mehrbakhsh, Behzad,
Garigliotti, Darı́o,
Martínez-Plumed, Fernando,
and Hernandez-Orallo, Jose
In Proceedings of the 1st Workshop on Data Contamination (CONDA), co-located with the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)
2024
[Abstract]
[PDF]
Test contamination is a serious problem for the evaluation of large language models (LLMs) because it leads to the overestimation of their performance and a quick saturation of benchmarks, even before the actual capability is achieved. One strategy to address this issue is the (adversarial) generation of variations, by including different exemplars and different rephrasings of the questions. However, these two interventions can lead to instances that can be more difficult (accumulating on the expected loss of performance by partly removing the contamination) but also to instances that can be less difficult (cancelling the expected loss of performance), which would make contamination undetectable. Understanding these two phenomena in terms of instance difficulty is critical to determine and measure contamination. In this paper we conduct a comprehensive analysis of these two interventions on an addition task with fine-tuned LLAMA-2 models.
-
ISWC
Explainable LLM-powered RAG To Tackle Tasks In The Unstructured-structured Data Spectrum
Garigliotti, Darı́o
In 23rd International Semantic Web Conference - Proceedings of the Special Session on Harmonising Generative AI and Semantic Web Technologies (HGAIS 2024)
2024
[Abstract]
[PDF]
In the context of multiple spaces of research and application in text and information processing dominated by Large Language Models (LLMs), Retrieval-augmented Generation (RAG) provides a general framework with which to integrate external, explicit knowledge into the vast parametric knowledge of LLMs. In this paper, we present a crosspoint of tasks of diverse nature, maturity and level of cognitive challenge for an intelligent system, that nevertheless share in their analogies the suitability for being addressed by a similar RAG approach. Based on observations from several of our recent works, we reflect on the RAG framework, in particular about methods where the LLM is prompted with strategies to explain its generation output, across these tasks with components ranging from unstructured to structured data.
2023
-
WWW
Do bridges dream of water pollutants?: Towards DreamsKG, a knowledge graph to make digital access for sustainable environmental assessment come true
Garigliotti, Darı́o,
Bjerva, Johannes,
Nielsen, Finn Årup,
Butzbach, Annika,
Lyhne, Ivar,
Kørnøv, Lone,
and Hose, Katja
In ACM Web Conference 2023 - Proceedings of the Comopanion of the
2023
[Abstract]
[PDF]
An environmental assessment (EA) report describes and assesses the environmental impact of a series of activities involved in the development of a project. As such, EA is a key tool for sustainability. Improving information access to EA reporting is a billion-euro untapped business opportunity to build an engaging, efficient digital experience for EA. We aim to become a landmark initiative in making this experience come true, by transforming the traditional manual assessment of numerous heterogeneous reports by experts into a computer-assisted approach. Specifically, a knowledge graph that represents and stores facts about EA practice allows for what it is so far only accessible manually to become machine-readable, and by this, to enable downstream information access services. This paper describes the ongoing process of building DreamsKG, a knowledge graph that stores relevant data- and expert-driven EA reporting and practicing in Denmark. Representation of cause-effect relations in EA and integration of Sustainable Developmental Goals (SDGs) are among its prominent features.
-
ESWC
Environmental impact assessment reports in Wikidata and a Wikibase
Nielsen, Finn Årup,
Lyhne, Ivar,
Garigliotti, Darı́o,
Butzbach, Annika,
Ravn Boess, Emilia,
Hose, Katja,
and Kørnøv, Lone
In Joint Proceedings of the ESWC 2023 Workshops and Tutorials co-located with 20th European Semantic Web Conference
2023
[Abstract]
[PDF]
Environmental impact assessment (EIA) is a required process for projects in many countries, which results in the preparation and publication of a detailed report. Such EIA reports describe impacts on, e.g., humans and the environment. In this paper, we describe our efforts in modeling the metadata of EIA reports and their description of environmental impacts and mitigations with Wikidata and Wikibase. We show that it is possible to record the bibliographic metadata of Danish EIA reports and in doing so link it to the rest of the Wikidata knowledge graph, allowing multilingual search. With our dedicated instance of a Wikibase, we show how EIA reports and their associated projects along with activities, impacts, recipients, and mitigations can be represented and how we can show aggregated views of the data based on SPARQL templates so that users can explore the data and make efficient use of it.
2022
-
NLE
Recommending tasks based on search queries and missions
Garigliotti, Dario,
Balog, Krisztian,
Hose, Katja,
and Bjerva, Johannes
Natural Language Engineering
2022
[Abstract]
[PDF]
Web search is an experience that naturally lends itself to recommendations, including query suggestions and related entities. In this article, we propose to recommend specific tasks to users, based on their search queries, such as planning a holiday trip or organizing a party. Specifically, we introduce the problem of query-based task recommendation and develop methods that combine well-established term-based ranking techniques with continuous semantic representations, including sentence representations from several transformer-based models. Using a purpose-built test collection, we find that our method is able to significantly outperform a strong text-based baseline. Further, we extend our approach to using a set of queries that all share the same underlying task, referred to as search mission, as input. The study is rounded off with a detailed feature and query analysis.
2021
-
SIGWEB Newsl.
Task-Based Support in Search Engines
Garigliotti, Darı́o
SIGWEB Newsl.
2021
[Abstract]
[PDF]
The research conducted by Garigliotti focused on utilizing and extending methods and tech- niques from semantic search within an information access paradigm that aims to support users in achieving their tasks. More specifically, to enhance search engines with functionalities for rec- ognizing the underlying tasks in searches and providing support for task completion. The work presented in this thesis is organized in three grand themes: entity type information for entity retrieval, entity-oriented search intents, and task-based search. Alongside the theoretical and em- pirical contributions, a number of resource contributions were developed, including several corpora and test collections, and a knowledge base of entity-oriented search intents.
-
SIGIR Forum
Task-Based Support in Search Engines
Garigliotti, Darı́o
SIGIR Forum
2021
[Abstract]
[PDF]
Web search has become a key technology on which people rely daily for getting information about almost everything. The evolution of the search experience has also shaped the expectations of people about it. Many users seem to expect today’s web search engines to behave like a kind of "wise interpreter," capable of understanding the meaning behind a search query, realizing its current context, and responding to it directly and appropriately. Search by meaning, or semantic search, encompasses a large portion of information retrieval (IR) research devoted to study more meaningful representations of the information need expressed by the user query. Entity cards, direct displays, and verticals are examples of how major commercial search engines have indeed responded to user expectations, capitalizing on query understanding. Search is usually performed with a specific goal underlying the query. In many cases, this goal consists of a nontrivial task to be completed. Current search engines support a small set of basic tasks, and most of the knowledge-intensive workload for supporting more complex tasks is left to the user.Task-based search can be viewed as an information access paradigm that aims to enhance search engines with functionalities for recognizing the underlying tasks in searches and providing support for task completion. The research presented in this thesis focuses on utilizing and extending methods and techniques from semantic search in the next stage of the evolution of search engines, namely, to support users in achieving their tasks. Our work can be grouped in three grand themes: (1) Entity type information for entity retrieval: we conduct a systematic evaluation and analysis of methods for type-aware entity retrieval, in terms of three main dimensions. Also, we revisit the problem of hierarchical target type identification, present a state-of-the-art supervised learning method, and analyze the usage of automatically identified target entity types for type-aware entity retrieval; (2) Entity-oriented search intents: we propose a categorization scheme for entity-oriented search intents, and study the distributions of entity intent categories per entity type. We further develop a method for constructing a knowledge base of entity-oriented search intents; and (3) Task-based search: we design a probabilistic generative framework for task-based query suggestion, and principledly estimate each of its components. Furthermore, we introduce the problems of query-based task recommendation and mission-based task recommendation, and establish respective methods as suitable baselines.
2019
-
Semi-supervised Learning for Word Sense Disambiguation
Garigliotti, Darı́o
arXiv e-prints
2019
[Abstract]
[PDF]
This work is a study of the impact of multiple aspects in a classic unsupervised word sense disambiguation algorithm. We identify relevant factors in a decision rule algorithm, including the initial labeling of examples, the formalization of the rule confidence, and the criteria for accepting a decision rule. Some of these factors are only implicitly considered in the original literature. We then propose a lightly supervised version of the algorithm, and employ a pseudo-word-based strategy to evaluate the impact of these factors. The obtained performances are comparable with those of highly optimized formulations of the word sense disambiguation method.
-
NeuType: A Simple and Effective Neural Network Approach for Predicting Missing Entity Type Information in Knowledge Bases
Hovda, Jon Arne Bø,
Garigliotti, Darı́o,
and Balog, Krisztian
arXiv e-prints
2019
[Abstract]
[PDF]
Knowledge bases store information about the semantic types of entities, which can be utilized in a range of information access tasks. This information, however, is often incomplete, due to new entities emerging on a daily basis. We address the task of automatically assigning types to entities in a knowledge base from a type taxonomy. Specifically, we present two neural network architectures, which take short entity descriptions and, optionally, information about related entities as input. Using the DBpedia knowledge base for experimental evaluation, we demonstrate that these simple architectures yield significant improvements over the current state of the art.
-
ICTIR
Unsupervised Context Retrieval for Long-Tail Entities
Garigliotti, Darı́o,
Albakour, Dyaa,
Martinez, Miguel,
and Balog, Krisztian
In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval
2019
[Abstract]
[PDF]
[Poster]
[Slides]
[Video]
Monitoring entities in media streams often relies on rich entity representations, like structured information available in a knowledge base (KB). For long-tail entities, such monitoring is highly challenging, due to their limited, if not entirely missing, representation in the reference KB. In this paper, we address the problem of retrieving textual contexts for monitoring long-tail entities. We propose an unsupervised method to overcome the limited representation of long-tail entities by leveraging established entities and their contexts as support information. Evaluation on a purpose-built test collection shows the suitability of our approach and its robustness for out-of-KB entities.
-
IRJ
Identifying and exploiting target entity type information for ad hoc entity retrieval
Garigliotti, Darío,
Hasibi, Faegheh,
and Balog, Krisztian
Information Retrieval Journal
2019
[Abstract]
[PDF]
[Repository]
Today, the practice of returning entities from a knowledge base in response to search queries has become widespread. One of the distinctive characteristics of entities is that they are typed, i.e., assigned to some hierarchically organized type system (type taxonomy). The primary objective of this paper is to gain a better understanding of how entity type information can be utilized in entity retrieval. We perform this investigation in two settings: firstly, in an idealized “oracle” setting, assuming that we know the distribution of target types of the relevant entities for a given query; and secondly, in a realistic scenario, where target entity types are identified automatically based on the keyword query. We perform a thorough analysis of three main aspects: (i) the choice of type taxonomy, (ii) the representation of hierarchical type information, and (iii) the combination of type-based and term-based similarity in the retrieval model. Using a standard entity search test collection based on DBpedia, we show that type information can significantly and substantially improve retrieval performance, yielding up to 67% relative improvement in terms of NDCG@10 over a strong text-only baseline in an oracle setting. We further show that using automatic target type detection, we can outperform the text-only baseline by 44% in terms of NDCG@10. This is as good as, and sometimes even better than, what is attainable by using explicit target type information provided by humans. These results indicate that identifying target entity types of queries is challenging even for humans and attests to the effectiveness of our proposed automatic approach.
2018
-
CIKM
IntentsKB: A Knowledge Base of Entity-Oriented Search Intents
Garigliotti, Darı́o,
and Balog, Krisztian
In Proceedings of the 27th ACM International Conference on Information and Knowledge Management
2018
[Abstract]
[PDF]
[Poster]
[Repository]
[Slides]
We address the problem of constructing a knowledge base of entity-oriented search intents. Search intents are defined on the level of entity types, each comprising of a high-level intent category (property, website, service, or other), along with a cluster of query terms used to express that intent. These machine-readable statements can be leveraged in various applications, e.g., for generating entity cards or query recommendations. By structuring service-oriented search intents, we take one step towards making entities actionable. The main contribution of this paper is a pipeline of components we develop to construct a knowledge base of entity intents. We evaluate performance both component-wise and end-to-end, and demonstrate that our approach is able to generate high-quality data.
-
SIGIR
A Semantic Search Approach to Task-Completion Engines
Garigliotti, Darı́o
In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval
2018
[Abstract]
[PDF]
[Slides]
Web search has become a key technology in society. The increased engagement of users has enhanced their expectations, leading to an evolution of search engines towards attempting an understanding, or semantics, of information needs. The next paradigm shift is to support task completion, this is, to help the user complete her underlying goal when issuing a search query. In this research, I propose to study semantic search components suitable for task-based search. Our contributions address three main challenges, which are as follows. We conduct a systematic formalization and evaluation of aspects in utilizing entity type information for entity retrieval. We also approach to understanding entity-oriented queries by categorization of search intents. Last, we develop methods for generating high-quality task-based query suggestions. We envisage the capability of the three identified components to complement each other for supporting task completion.
-
ECIR
Towards an Understanding of Entity-Oriented Search Intents
Garigliotti, Darío,
and Balog, Krisztian
In Advances in Information Retrieval - Proceedings of the 40th European Conference on IR Research
2018
[Abstract]
[PDF]
[Poster]
[Repository]
Entity-oriented search deals with a wide variety of information needs, from displaying direct answers to interacting with services. In this work, we aim to understand what are prominent entity-oriented search intents and how they can be fulfilled. We develop a scheme of entity intent categories, and use them to annotate a sample of queries. Specifically, we annotate unique query refiners on the level of entity types. We observe that, on average, over half of those refiners seek to interact with a service, while over a quarter of the refiners search for information that may be looked up in a knowledge base.
-
ECIR
Generating High-Quality Query Suggestion Candidates for Task-Based Search
Ding, Heng,
Zhang, Shuo,
Garigliotti, Darío,
and Balog, Krisztian
In Advances in Information Retrieval - Proceedings of the 40th European Conference on IR Research
2018
[Abstract]
[PDF]
[Poster]
[Repository]
We address the task of generating query suggestions for task-based search. The current state of the art relies heavily on suggestions provided by a major search engine. In this paper, we solve the task without reliance on search engines. Specifically, we focus on the first step of a two-stage pipeline approach, which is dedicated to the generation of query suggestion candidates. We present three methods for generating candidate suggestions and apply them on multiple information sources. Using a purpose-built test collection, we find that these methods are able to generate high-quality suggestion candidates.
2017
-
ICTIR
On Type-Aware Entity Retrieval
Garigliotti, Darı́o,
and Balog, Krisztian
In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval
2017
[Abstract]
[PDF]
[Slides]
Today, the practice of returning entities from a knowledge base in response to search queries has become widespread. One of the distinctive characteristics of entities is that they are typed, i.e., assigned to some hierarchically organized type system (type taxonomy). The primary objective of this paper is to gain a better understanding of how entity type information can be utilized in entity retrieval. We perform this investigation in an idealized "oracle" setting, assuming that we know the distribution of target types of the relevant entities for a given query. We perform a thorough analysis of three main aspects: (i) the choice of type taxonomy, (ii) the representation of hierarchical type information, and (iii) the combination of type-based and term-based similarity in the retrieval model. Using a standard entity search test collection based on DBpedia, we find that type information proves most useful when using large type taxonomies that provide very specific types. We provide further insights on the extensional coverage of entities and on the utility of target types.
-
ICTIR
Learning to Rank Target Types for Entity-Bearing Queries
Garigliotti, Darı́o,
and Balog, Krisztian
In Proceedings of the 1st International Workshop on LEARning Next gEneration
Rankers (LEARNER 2017), co-located with the 3rd ACM International Conference on
the Theory of Information Retrieval (ICTIR 2017)
2017
[Abstract]
[PDF]
[Slides]
This paper revisits the learning-to-rank approach we proposed for automatically identifying the target entity types of queries. After presenting our contributions and results, we draw on the learned lessons and encountered challenges to identify directions for future enhancements.
-
SIGIR
Target Type Identification for Entity-Bearing Queries
Garigliotti, Darı́o,
Hasibi, Faegheh,
and Balog, Krisztian
In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
2017
[Abstract]
[PDF]
[Poster]
[Repository]
Identifying the target types of entity-bearing queries can help improve retrieval performance as well as the overall search experience. In this work, we address the problem of automatically detecting the target types of a query with respect to a type taxonomy. We propose a supervised learning approach with a rich variety of features. Using a purpose-built test collection, we show that our approach outperforms existing methods by a remarkable margin.
-
SIGIR
Generating Query Suggestions to Support Task-Based Search
Garigliotti, Darı́o,
and Balog, Krisztian
In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
2017
[Abstract]
[PDF]
[Poster]
[Repository]
We address the problem of generating query suggestions to support users in completing their underlying tasks (which motivated them to search in the first place). Given an initial query, these query suggestions should provide a coverage of possible subtasks the user might be looking for. We propose a probabilistic modeling framework that obtains keyphrases from multiple sources and generates query suggestions from these keyphrases. Using the test suites of the TREC Tasks track, we evaluate and analyze each component of our model.
-
SIGIR
Nordlys: A Toolkit for Entity-Oriented and Semantic Search
Hasibi, Faegheh,
Balog, Krisztian,
Garigliotti, Darı́o,
and Zhang, Shuo
In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
2017
[Abstract]
[PDF]
[Poster]
[Repository]
We introduce Nordlys, a toolkit for entity-oriented and semantic search. It provides functionality for entity cataloging, entity retrieval, entity linking, and target type identification. Nordlys may be used as a Python library or as a RESTful API, and also comes with a web-based user interface. The toolkit is open source and is available at http://nordlys.cc.
-
WSDM
Supervised Ranking of Triples for Type-Like Relations—The Cress Triple Scorer at the WSDM Cup 2017
Hasibi, Faegheh,
Garigliotti, Darío,
Zhang, Shuo,
and Balog, Krisztian
In WSDM Cup 2017 Notebook Papers, February 10, Cambridge, UK
2017
[Abstract]
[PDF]
[Poster]
This paper describes our participation in the Triple Scoring task of WSDM Cup 2017, which aims at ranking triples from a knowledge base for two type-like relations: profession and nationality. We introduce a supervised ranking method along with the features we designed for this task. Our system has been top ranked with respect to average score difference and 2nd best in terms of Kendall’s tau.
-
TREC
The University of Stavanger at the TREC 2016 Tasks Track
Garigliotti, Darío,
and Balog, Krisztian
In Proceedings of the Twenty-Fifth Text REtrieval Conference
2017
[Abstract]
[PDF]
This paper describes our participation in the Task understanding task of the Tasks track at TREC 2016. We introduce a general probabilistic framework in which we combine query suggestions from web search engines with keyphrases generated from top ranked documents. We achieved top performance among all submitted systems, on both official evaluation metrics, which attests the effectiveness of our approach.
2016
-
The University of Stavanger at the TREC 2016 Tasks Track
Garigliotti, Darío,
and Balog, Krisztian
In TREC 2016 Working Notes
2016
[Abstract]
[PDF]
This paper describes our participation in the Task understanding task of the Tasks track at TREC 2016. We introduce a general probabilistic framework in which we combine query suggestions from web search engines with keyphrases generated from top ranked documents.
2015
-
ESWC
Open Knowledge Extraction Challenge
Nuzzolese, Andrea Giovanni,
Gentile, Anna Lisa,
Presutti, Valentina,
Gangemi, Aldo,
Garigliotti, Darı́o,
and Navigli, Roberto
In Semantic Web Evaluation Challenges - Second SemWebEval Challenge at the 12th Extended Semantic Web Conference
2015
[Abstract]
[PDF]
The Open Knowledge Extraction (OKE) challenge is aimed at promoting research in the automatic extraction of structured content from textual data and its representation and publication as Linked Data. We designed two extraction tasks: (1) Entity Recognition, Linking and Typing and (2) Class Induction and entity typing. The challenge saw the participations of four systems: CETUS-FOX and FRED participating to both tasks, Adel participating to Task 1 and OAK@Sheffield participat- ing to Task 2. In this paper we describe the OKE challenge, the tasks, the datasets used for training and evaluating the systems, the evaluation method, and obtained results.
M.Sc. thesis
-
An Interactive System for the Interpretation of Specifications
(Original title in Spanish: Un Sistema Interactivo para la Interpretación de Especificaciones)
Garigliotti, Darı́o
2014
[Abstract]
[PDF]
En este trabajo estudiamos el problema del tratamiento de una especificación de
software expresada en lenguaje natural.
Observamos y clasificamos fenómenos lingüísticos sobre un cuerpo de ejemplos de especificaciones.
A su vez, exploramos algunos sistemas presentados en la literatura relacionada,
identificando sus mejores características.
A partir de las mismas, diseñamos e implementamos un sistema que
interprete una especificación, expresada en un formato muy simple e informativo,
y obtenga un sistema de transiciones etiquetadas.
La estrategia de resolución combina
un enfoque interactivo con heurísticas ad-hoc de decisión.
Se enriquece la
representación con el tratamiento de fenómenos semánticos.
Varios lineamientos
son ofrecidos para su extensión, en particular, hacia un modelo de anotación de
ejemplos en el contexto educativo.
2013
-
JAIIO
Semi-supervised Learning for Word Sense Disambiguation
(Original title in Spanish: Desambiguación de Palabras Polisémicas mediante Aprendizaje Semi-supervisado)
Garigliotti, Darío
In Annals of 42nd JAIIO - Argentine Journals of Informatics
2013
[Abstract]
[PDF]
[Poster]
[Slides]
Este trabajo es una exploración sistemática del impacto de diferentes aspectos de los algoritmos clásicos de desambiguación -no supervisada- de sentidos.
Tras identificar los factores relevantes para su funcionamiento, muchos de los cuales estaban solamente implícitos en la descripción de estos algoritmos, implementamos una versión simplificada y levemente supervisada de un algoritmo clásico de reglas de decisión para desambiguación no supervisada.
Evaluamos el impacto de cada uno de estos factores en el desempeño del mismo, entre ellos: el leve etiquetado inicial de ejemplos, la ecuación de confiabilidad de una regla y los criterios de aceptación de tales reglas de decisión.
Los resultados obtenidos mediante una económica y poderosa evaluación con pseudo-palabras exhiben una performance aceptable en comparación con versiones muy optimizadas, y nos llevan a proponer prometedoras mejoras a futuro.