Shlomo Geva

From WikiPapers
Jump to: navigation, search

Shlomo Geva is an author.

Publications

Only those publications related to wikis are shown here.
Title Keyword(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
An evaluation framework for cross-lingual link discovery Assessment
Cross-lingual link discovery
Evaluation framework
Evaluation metrics
Validation
Wikipedia
Information Processing and Management English 2014 Cross-Lingual Link Discovery (CLLD) is a new problem in Information Retrieval. The aim is to automatically identify meaningful and relevant hypertext links between documents in different languages. This is particularly helpful in knowledge discovery if a multi-lingual knowledge base is sparse in one language or another, or the topical coverage in each language is different; such is the case with Wikipedia. Techniques for identifying new and topically relevant cross-lingual links are a current topic of interest at NTCIR where the CrossLink task has been running since the 2011 NTCIR-9. This paper presents the evaluation framework for benchmarking algorithms for cross-lingual link discovery evaluated in the context of NTCIR-9. This framework includes topics, document collections, assessments, metrics, and a toolkit for pooling, assessment, and evaluation. The assessments are further divided into two separate sets: manual assessments performed by human assessors; and automatic assessments based on links extracted from Wikipedia itself. Using this framework we show that manual assessment is more robust than automatic assessment in the context of cross-lingual link discovery. 0 0
An english-translated parallel corpus for the CJK wikipedia collections Chinese
Corpus
Cross-lingual information retrieval
Cross-lingual link discovery
English
Japanese
Korean
Machine learning
Wikipedia
Proceedings of the 17th Australasian Document Computing Symposium, ADCS 2012 English 2012 In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia. Copyright 0 0
Cross-lingual knowledge discovery: Chinese-to-English article linking in wikipedia Anchor identification
Chinese segmentation
Cross-lingual link discovery
Link mining
Link recommendation
Translation
Wikipedia
Lecture Notes in Computer Science English 2012 In this paper we examine automated Chinese to English link discovery in Wikipedia and the effects of Chinese segmentation and Chinese to English translation on the hyperlink recommendation. Our experimental results show that the implemented link discovery framework can effectively recommend Chinese-to-English cross-lingual links. The techniques described here can assist bi-lingual users where a particular topic is not covered in Chinese, is not equally covered in both languages, or is biased in one language; as well as for language learning. 0 0
Mobile applications of focused link discovery Focused Link Discovery
Mobile Information Seeking
User Studies Involving Documents
Wikipedia
ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium English 2011 Interaction with a mobile device remains difficult due to inherent physical limitations. This difficulty is particularly evident for search, which requires typing. We extend the One-Search-Only search paradigm by adding a novel link-browsing scheme built on top of automatic link discovery. A prototype was built for iPhone and tested with 12 subjects. A post-use interview survey suggests that the extended paradigm improves the mobile information seeking experience. 0 0
Overview of the INEX 2010 XML mining track: Clustering and classification of XML documents Classification
Clustering
Content
INEX
Structure
Wikipedia
XML document mining
Lecture Notes in Computer Science English 2011 This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2010 XML Mining track. The report also describes the approaches and results obtained by participants. 0 0
Overview of the INEX 2010 ad hoc track Lecture Notes in Computer Science English 2011 This paper gives an overview of the INEX 2010 Ad Hoc Track. The main goals of the Ad Hoc Track were three-fold. The first goal was to study focused retrieval under resource restricted conditions such as a small screen mobile device or a document summary on a hit-list. This leads to variants of the focused retrieval tasks that address the impact of result length/reading effort, thinking of focused retrieval as a form of "snippet" retrieval. The second goal was to extend the ad hoc retrieval test collection on the INEX 2009 Wikipedia Collection with additional topics and judgments. For this reason the Ad Hoc track topics and assessments stayed unchanged. The third goal was to examine the trade-off between effectiveness and efficiency by continuing the Efficiency Track as a task in the Ad Hoc Track. The INEX 2010 Ad Hoc Track featured four tasks: the Relevant in Context Task, the Restricted Relevant in Context Task, the Restrict Focused Task, and the Efficiency Task. We discuss the setup of the track, and the results for the four tasks. 0 0
Overview of the INEX 2010 link the wiki track INEX English 2011 0 0
Topical and structural linkage in wikipedia ECIR English 2011 0 0
Overview of the INEX 2009 Ad hoc track Lecture Notes in Computer Science English 2010 This paper gives an overview of the INEX 2009 Ad Hoc Track. The main goals of the Ad Hoc Track were three-fold. The first goal was to investigate the impact of the collection scale and markup, by using a new collection that is again based on a the Wikipedia but is over 4 times larger, with longer articles and additional semantic annotations. For this reason the Ad Hoc track tasks stayed unchanged, and the Thorough Task of INEX 2002-2006 returns. The second goal was to study the impact of more verbose queries on retrieval effectiveness, by using the available markup as structural constraints-now using both the Wikipedia's layout-based markup, as well as the enriched semantic markup-and by the use of phrases. The third goal was to compare different result granularities by allowing systems to retrieve XML elements, ranges of XML elements, or arbitrary passages of text. This investigates the value of the internal document structure (as provided by the XML mark-up) for retrieving relevant information. The INEX 2009 Ad Hoc Track featured four tasks: For the Thorough Task a ranked-list of results (elements or passages) by estimated relevance was needed. For the Focused Task a ranked-list of non-overlapping results (elements or passages) was needed. For the Relevant in Context Task non-overlapping results (elements or passages) were returned grouped by the article from which they came. For the Best in Context Task a single starting point (element start tag or passage start) for each article was needed. We discuss the setup of the track, and the results for the four tasks. 0 0
Overview of the INEX 2009 XML mining track: Clustering and classification of XML documents Classification
Clustering
INEX
Structure and content
Wikipedia
XML document mining
Lecture Notes in Computer Science English 2010 This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants. 0 0
Overview of the INEX 2009 link the Wiki track Anchor-to-BEP
Assessment
Evaluation
Focused Link Discovery
Wikipedia
Lecture Notes in Computer Science English 2010 In the third year of the Link the Wiki track, the focus has been shifted to anchor-to-bep link discovery. The participants were encouraged to utilize different technologies to resolve the issue of focused link discovery. Apart from the 2009 Wikipedia collection, the Te Ara collection was introduced for the first time in INEX. For the link the wiki tasks, 5000 file-to-file topics were randomly selected and 33 anchor-to-bep topics were nominated by the participants. The Te Ara collection does not contain hyperlinks and the task was to cross link the entire collection. A GUI tool for self-verification of the linking results was distributed. This helps participants verify the location of the anchor and bep. The assessment tool and the evaluation tool were revised to improve efficiency. Submission runs were evaluated against Wikipedia ground-truth and manual result set respectively. Focus-based evaluation was undertaken using a new metric. Evaluation results are presented and link discovery approaches are described. 0 0
Overview of the INEX 2009 link the wiki track Anchor-to-BEP
Assessment
Evaluation
Focused link discovery
Wikipedia
INEX English 2010 0 0
Overview of the INEX 2008 Ad hoc track Lecture Notes in Computer Science English 2009 This paper gives an overview of the INEX 2008 Ad Hoc Track. The main goals of the Ad Hoc Track were two-fold. The first goal was to investigate the value of the internal document structure (as provided by the XML mark-up) for retrieving relevant information. This is a continuation of INEX 2007 and, for this reason, the retrieval results are liberalized to arbitrary passages and measures were chosen to fairly compare systems retrieving elements, ranges of elements, and arbitrary passages. The second goal was to compare focused retrieval to article retrieval more directly than in earlier years. For this reason, standard document retrieval rankings have been derived from all runs, and evaluated with standard measures. In addition, a set of queries targeting Wikipedia have been derived from a proxy log, and the runs are also evaluated against the clicked Wikipedia pages. The INEX 2008 Ad Hoc Track featured three tasks: For the Focused Task a ranked-list of non-overlapping results (elements or passages) was needed. For the Relevant in Context Task non-overlapping results (elements or passages) were returned grouped by the article from which they came. For the Best in Context Task a single starting point (element start tag or passage start) for each article was needed. We discuss the results for the three tasks, and examine the relative effectiveness of element and passage retrieval. This is examined in the context of content only (CO, or Keyword) search as well as content and structure (CAS, or structured) search. Finally, we look at the ability of focused retrieval techniques to rank articles, using standard document retrieval techniques, both against the judged topics as well as against queries and clicks from a proxy log. 0 0
The importance of manual assessment in link discovery Assessment
Evaluation
INEX
Link discovery
Wikipedia
Proceedings - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009 English 2009 Using a ground truth extracted from the Wikipedia, and a ground truth created through manual assessment, we show that the apparent performance advantage seen in machine learning approaches to link discovery are an artifact of trivial links that are actively rejected by manual assessors. 0 0
The methodology of manual assessment in the evaluation of link discovery Evaluation
Link quality
Manual assessment
Wikipedia
ADCS 2009 - Proceedings of the Fourteenth Australasian Document Computing Symposium English 2009 The link graph extracted from the Wikipedia has often been used as the ground truth for measuring the performance of automated link discovery systems. Extensive manual assessments experiments at INEX 2008 recently showed that this is unsound and that manual assessment is essential. This paper describes the methodology for link discovery evaluation which was developed for use in the INEX 2009 Link-the-Wiki track. In this approach both manual and automatic assessment sets are generated and runs are evaluated using both. The approach offers a more reliable evaluation of link discovery methods than just automatic assessment. A new evaluation measure for focused link discovery is also introduced. 0 0
Word segmentation for Chinese Wikipedia using N-gram mutual information Boundary confidence
Chinese word segmentation
Mutual information
N-gram mutual information
ADCS 2009 - Proceedings of the Fourteenth Australasian Document Computing Symposium English 2009 In this paper, we propose an unsupervised segmentation approach, named n-gram mutual information, or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results. 0 0
GPX: Ad-Hoc queries and automated link discovery in the Wikipedia GPX
INEX
Information retrieval
Link Discovery
XML
Lecture Notes in Computer Science English 2008 The INEX 2007 evaluation was based on the Wikipedia collection. In this paper we describe some modifications to the GPX search engine and the approach taken in the Ad-hoc and the Link-the-Wiki tracks. In earlier version of GPX scores were recursively propagated from text containing nodes, through ancestors, all the way to the document root of the XML tree. In this paper we describe a simplification whereby the score of each node is computed directly, doing away with the score propagation mechanism. Results indicate slightly improved performance. The GPX search engine was used in the Link-the-Wiki track to identify prospective incoming links to new Wikipedia pages. We also describe a simple and efficient approach to the identification of prospective outgoing links in new Wikipedia pages. We present and discuss evaluation results. 0 0
Collaborative Knowledge Management: Evaluation of Automated Link Discovery in the Wikipedia English 2007 0 0
NLPX at INEX 2006 Lecture Notes in Computer Science English 2007 XML information retrieval (XML-IR) systems aim to better fulfil users' information needs than traditional IR systems by returning results lower than the document level. In order to use XML-IR systems users must encapsulate their structural and content information needs in a structured query. Historically, these structured queries have been formatted using formal languages such as NEXI. Unfortunately, formal query languages are very complex and too difficult to be used by experienced - let alone casual - users and are too closely bound to the underlying physical structure of the collection. INEX's NLP task investigates the potential of using natural language to specify structured queries. QUT has participated in the NLP task with our system NLPX since its inception. Here, we discuss the changes we've made to NLPX since last year, including our efforts to port NLPX to Wikipedia. Second, we present the results from the 2006 INEX track where NLPX was the best performing participant in the Thorough and Focused tasks. 0 0