From WikiPapers
(Redirected from Data sets)
Jump to: navigation, search
See also: List of datasets.

Dataset is included as keyword or extra keyword in 0 datasets, 0 tools and 154 publications.


There is no datasets for this keyword.


There is no tools for this keyword.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Cross-language and cross-encyclopedia article linking using mixed-language topic model and hypernym translation Wang Y.-C.
Wu C.-K.
Tsai R.T.-H.
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference English 2014 Creating cross-language article links among different online encyclopedias is now an important task in the unification of multilingual knowledge bases. In this paper, we propose a cross-language article linking method using a mixed-language topic model and hypernym translation features based on an SVM model to link English Wikipedia and Chinese Baidu Baike, the most widely used Wiki-like encyclopedia in China. To evaluate our approach, we compile a data set from the top 500 Baidu Baike articles and their corresponding English Wiki articles. The evaluation results show that our approach achieves 80.95% in MRR and 87.46% in recall. Our method does not heavily depend on linguistic characteristics and can be easily extended to generate crosslanguage article links among different online encyclopedias in other languages. 0 0
What makes a good team of Wikipedia editors? A preliminary statistical analysis Bukowski L.
Jankowski-Lorek M.
Jaroszewicz S.
Sydow M.
Lecture Notes in Computer Science English 2014 The paper concerns studying the quality of teams of Wikipedia authors with statistical approach. We report preparation of a dataset containing numerous behavioural and structural attributes and its subsequent analysis and use to predict team quality. We have performed exploratory analysis using partial regression to remove the influence of attributes not related to the team itself. The analysis confirmed that the key issue significantly influencing article's quality are discussions between teem members. The second part of the paper successfully uses machine learning models to predict good articles based on features of the teams that created them. 0 0
Exploiting the category structure of Wikipedia for entity ranking Rianne Kaptein
Jaap Kamps
Artificial Intelligence English 2013 The Web has not only grown in size, but also changed its character, due to collaborative content creation and an increasing amount of structure. Current Search Engines find Web pages rather than information or knowledge, and leave it to the searchers to locate the sought information within the Web page. A considerable fraction of Web searches contains named entities. We focus on how the Wikipedia structure can help rank relevant entities directly in response to a search request, rather than retrieve an unorganized list of Web pages with relevant but also potentially redundant information about these entities. Our results demonstrate the benefits of using topical and link structure over the use of shallow statistics. Our main findings are the following. First, we examine whether Wikipedia category and link structure can be used to retrieve entities inside Wikipedia as is the goal of the INEX (Initiative for the Evaluation of XML retrieval) Entity Ranking task. Category information proves to be a highly effective source of information, leading to large and significant improvements in retrieval performance on all data sets. Secondly, we study how we can use category information to retrieve documents for ad hoc retrieval topics in Wikipedia. We study the differences between entity ranking and ad hoc retrieval in Wikipedia by analyzing the relevance assessments. Considering retrieval performance, also on ad hoc retrieval topics we achieve significantly better results by exploiting the category information. Finally, we examine whether we can automatically assign target categories to ad hoc and entity ranking queries. Guessed categories lead to performance improvements that are not as large as when the categories are assigned manually, but they are still significant. We conclude that the category information in Wikipedia is a useful source of information that can be used for entity ranking as well as other retrieval tasks. © 2012 Elsevier B.V. All rights reserved. 0 0
Ontology-enriched multi-document summarization in disaster management using submodular function Wu K.
Li L.
Jing-Woei Li
Li T.
Information Sciences English 2013 In disaster management, a myriad of news and reports relevant to the disaster may be recorded in the form of text document. A challenging problem is how to provide concise and informative reports from a large collection of documents, to help domain experts analyze the trend of the disaster. In this paper, we explore the feasibility of using a domain-specific ontology to facilitate the summarization task, and propose TELESUM, an ontology-enriched multi-document summarization approach, where the submodularity hidden in among ontological concepts is investigated. Empirical experiments on the collection of press releases by Miami-Dade County Department of Emergency Management during Hurricane Wilma in 2005 demonstrate the efficacy and effectiveness of TELESUM in disaster management. Further, our proposed framework can be extended to summarizing general documents by employing public ontologies, e.g.; Wikipedia. Extensive evaluation on the generalized framework is conducted on DUC04-05 datasets, and shows that our method is competitive with other approaches. © 2012 Elsevier Inc. All rights reserved. 0 0
Selecting features with SVM Rzeniewicz J.
Szymanski J.
Lecture Notes in Computer Science English 2013 A common problem with feature selection is to establish how many features should be retained at least so that important information is not lost. We describe a method for choosing this number that makes use of Support Vector Machines. The method is based on controlling an angle by which the decision hyperplane is tilt due to feature selection. Experiments were performed on three text datasets generated from a Wikipedia dump. Amount of retained information was estimated by classification accuracy. Even though the method is parametric, we show that, as opposed to other methods, once its parameter is chosen it can be applied to a number of similar problems (e.g. one value can be used for various datasets originating from Wikipedia). For a constant value of the parameter, dimensionality was reduced by from 78% to 90%, depending on the data set. Relative accuracy drop due to feature removal was less than 0.5% in those experiments. 0 0
Streaming big data with self-adjusting computation Acar U.A.
Yirong Chen
DDFP 2013 - Proceedings of the 2013 ACM SIGPLAN Workshop on Data Driven Functional Programming, Co-located with POPL 2013 English 2013 Many big data computations involve processing data that changes incrementally or dynamically over time. Using existing techniques, such computations quickly become impractical. For example, computing the frequency of words in the first ten thousand paragraphs of a publicly available Wikipedia data set in a streaming fashion using MapReduce can take as much as a full day. In this paper, we propose an approach based on self-adjusting computation that can dramatically improve the efficiency of such computations. As an example, we can perform the aforementioned streaming computation in just a couple of minutes. Copyright 0 0
Use of transfer entropy to infer relationships from behavior Bauer T.L.
Colbaugh R.
Glass K.
Schnizlein D.
ACM International Conference Proceeding Series English 2013 This paper discusses the use of transfer entropy to infer relationships among entities. This is useful when one wants to understand relationships among entities but can only observe their behavior, but not direct interactions with one another. This is the kind of environment prevelant in network monitoring, where one can observe behavior coming into and leaving a network from many different hosts, but cannot directly observe which hosts are related to one another. In this paper, we show that networks of individuals inferred using the transfer entropy of Wikipedia editing behavior predicts observed "ground truth" social networks. At low levels of recall, transfer entropy can extract these social networks with a precision approximately 20 times higher than would be expected by chance. We'll discuss the algorithm, the data set, and various parameter considerations when attempting to apply this algorithm to a data set. Copyright 2012 ACM. 0 0
Wikipedia articles representation with matrix'u Szymanski J. Lecture Notes in Computer Science English 2013 In the article we evaluate different text representation methods used for a task of Wikipedia articles categorization. We present the Matrix'u application used for creating computational datasets of Wikipedia articles. The representations have been evaluated with SVM classifiers used for reconstruction human made categories. 0 0
A Linked Data platform for mining software repositories Keivanloo I.
Forbes C.
Hmood A.
Erfani M.
Neal C.
Peristerakis G.
Rilling J.
IEEE International Working Conference on Mining Software Repositories English 2012 The mining of software repositories involves the extraction of both basic and value-added information from existing software repositories. The repositories will be mined to extract facts by different stakeholders (e.g. researchers, managers) and for various purposes. To avoid unnecessary pre-processing and analysis steps, sharing and integration of both basic and value-added facts are needed. In this research, we introduce SeCold, an open and collaborative platform for sharing software datasets. SeCold provides the first online software ecosystem Linked Data platform that supports data extraction and on-the-fly inter-dataset integration from major version control, issue tracking, and quality evaluation systems. In its first release, the dataset contains about two billion facts, such as source code statements, software licenses, and code clones from 18 000 software projects. In its second release the SeCold project will contain additional facts mined from issue trackers and versioning systems. Our approach is based on the same fundamental principle as Wikipedia: researchers and tool developers share analysis results obtained from their tools by publishing them as part of the SeCold portal and therefore make them an integrated part of the global knowledge domain. The SeCold project is an official member of the Linked Data dataset cloud and is currently the eighth largest online dataset available on the Web. 0 0
An automatic method of managing resources based on wikipedia Yu X.
Zhang Z.
Huang Z.
Journal of Computational Information Systems English 2012 This paper presents an unsupervised method to automatically build resource space. Utilizing the rich content and structure of Wikipedia as a background knowledge to automatically interpret, label document and construct the resource space. It combines the methods of sub-tree decomposition, community detection and statistical topic model. The results on the three datasets demonstrate the efficiency of the proposed method. 0 0
Are buildings only instances? Exploration in architectural style categories Goel A.
Juneja M.
Jawahar C.V.
ACM International Conference Proceeding Series English 2012 Instance retrieval has emerged as a promising research area with buildings as the popular test subject. Given a query image or region, the objective is to find images in the database containing the same object or scene. There has been a recent surge in efforts in finding instances of the same building in challenging datasets such as the Oxford 5k dataset [19], Oxford 100k dataset and the Paris dataset [20]. We ascend one level higher and pose the question: Are Buildings Only Instances? Buildings located in the same geographical region or constructed in a certain time period in history often follow a specific method of construction. These architectural styles are characterized by certain features which distinguish them from other styles of architecture. We explore, beyond the idea of buildings as instances, the possibility that buildings can be categorized based on the architectural style. Certain characteristic features distinguish an architectural style from others. We perform experiments to evaluate how characteristic information obtained from low-level feature configurations can help in classification of buildings into architectural style categories. Encouraged by our observations, we mine characteristic features with semantic utility for different architectural styles from our dataset of European monuments. These mined features are of various scales, and provide an insight into what makes a particular architectural style category distinct. The utility of the mined characteristics is verified from Wikipedia. 0 0
Automatic classification and relationship extraction for multi-lingual and multi-granular events from Wikipedia Hienert D.
Wegener D.
Paulheim H.
CEUR Workshop Proceedings English 2012 Wikipedia is a rich data source for knowledge from all domains. As part of this knowledge, historical and daily events (news) are collected for different languages on special pages and in event portals. As only a small amount of events is available in structured form in DBpedia, we extract these events with a rule-based approach from Wikipedia pages. In this paper we focus on three aspects: (1) extending our prior method for extracting events for a daily granularity, (2) the automatic classification of events and (3) finding relationships between events. As a result, we have extracted a data set of about 170,000 events covering different languages and granularities. On the basis of one language set, we have automatically built categories for about 70% of the events of another language set. For nearly every event, we have been able to find related events. 0 0
BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli
Ponzetto S.P.
Artificial Intelligence English 2012 We present an automatic approach to the construction of BabelNet, a very large, wide-coverage multilingual semantic network. Key to our approach is the integration of lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition, Machine Translation is applied to enrich the resource with lexical information for all languages. We first conduct in vitro experiments on new and existing gold-standard datasets to show the high quality and coverage of BabelNet. We then show that our lexical resource can be used successfully to perform both monolingual and cross-lingual Word Sense Disambiguation: thanks to its wide lexical coverage and novel semantic relations, we are able to achieve state-of the-art results on three different SemEval evaluation tasks. © 2012 Elsevier B.V. 0 0
Cross-lingual knowledge linking across wiki knowledge bases Zhe Wang
Jing-Woei Li
Tang J.
WWW'12 - Proceedings of the 21st Annual Conference on World Wide Web English 2012 Wikipedia becomes one of the largest knowledge bases on the Web. It has attracted 513 million page views per day in January 2012. However, one critical issue for Wikipedia is that articles in different language are very unbalanced. For example, the number of articles on Wikipedia in English has reached 3.8 million, while the number of Chinese articles is still less than half million and there are only 217 thousand cross-lingual links between articles of the two languages. On the other hand, there are more than 3.9 million Chinese Wiki articles on Baidu Baike and, two popular encyclopedias in Chinese. One important question is how to link the knowledge entries distributed in different knowledge bases. This will immensely enrich the information in the online knowledge bases and benefit many applications. In this paper, we study the problem of cross-lingual knowledge linking and present a linkage factor graph model. Features are defined according to some interesting observations. Experiments on the Wikipedia data set show that our approach can achieve a high precision of 85.8% with a recall of 88.1%. The approach found 202,141 new cross-lingual links between English Wikipedia and Baidu Baike. 0 0
Cross-modality correlation propagation for cross-media retrieval Zhai X.
Peng Y.
Jie Xiao
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings English 2012 We consider the problem of cross-media retrieval, where the query and the retrieved results can be of different modalities. In this paper, we propose a novel cross-modality correlation propagation approach to simultaneously deal with positive correlation and negative correlation between media objects of different modalities, while existing works focus solely on the positive correlation. Negative correlation is very important because it provides the effective exclusive information. The correlation is modeled as must-link constraints and cannot-link constraints respectively. Furthermore, our approach is able to propagate the correlation between heterogeneous modalities. Experiments on the wikipedia dataset show the effectiveness of our cross-modality correlation propagation approach, compared with state-of-the-art methods. 0 0
Effective tag recommendation system based on topic ontology using Wikipedia and WordNet Subramaniyaswamy V.
Chenthur Pandian S.
International Journal of Intelligent Systems English 2012 In this paper, we proposed a novel approach based on topic ontology for tag recommendation. The proposed approach intelligently generates tag suggestions to blogs. In this approach, we construct topic ontology through enriching the set of categories in existing small ontology called as Open Directory Project. To construct topic ontology, a set of topics and their associated semantic relationships is identified automatically from the corpus-based external knowledge resources such as Wikipedia and WordNet. The construction relies on two folds such as concept acquisition and semantic relation extraction. In the first fold, a topic-mapping algorithm is developed to acquire the concepts from the semantic of Wikipedia. A semantic similarity-clustering algorithm is used to compute the semantic similarity measure to group the set of similar concepts. The second is the semantic relation extraction algorithm, which derives associated semantic relations between the set of extracted topics from the lexical patterns between synsets in WordNet. A suitable software prototype is created to implement the topic ontology construction process. A Jena API framework is used to organize the set of extracted semantic concepts and their corresponding relationship in the form of knowledgeable representation of Web ontology language. Thus, Protégé tool provides the platform to visualize the automatically constructed topic ontology successfully. Using the constructed topic ontology, we can generate and suggest the most suitable tags for the new resource to users. The applicability of topic ontology with a spreading activation algorithm supports efficient recommendation in practice that can recommend the most popular tags for a specific resource. The spreading activation algorithm can assign the interest scores to the existing extracted blog content and tags. The weight of the tags is computed based on the activation score determined from the similarity between the topics in constructed topic ontology and content of the existing blogs. High-quality tags that has the highest activation score is recommended to the users. Finally, we conducted experimental evaluation of our tag recommendation approach using a large set of real-world data sets. Our experimental results explore and compare the capabilities of our proposed topic ontology with the spreading activation tag recommendation approach with respect to the existing AutoTag mechanism. And also discuss about the improvement in precision and recall of recommended tags on the data sets of Delicious and BibSonomy. The experiment shows that tag recommendation using topic ontology results in the folksonomy enrichment. Thus, we report the results of an experiment mean to improve the performance of the tag recommendation approach and its quality. 0 0
Efficient updates for web-scale indexes over the cloud Antonopoulos P.
Konstantinou I.
Tsoumakos D.
Koziris N.
Proceedings - 2012 IEEE 28th International Conference on Data Engineering Workshops, ICDEW 2012 English 2012 In this paper, we present a distributed system which enables fast and frequent updates on web-scale Inverted Indexes. The proposed update technique allows incremental processing of new or modified data and minimizes the changes required to the index, significantly reducing the update time which is now independent of the existing index size. By utilizing Hadoop MapReduce, for parallelizing the update operations, and HBase, for distributing the Inverted Index, we create a high-performance, fully distributed index creation and update system. To the best of our knowledge, this is the first open source system that creates, updates and serves large-scale indexes in a distributed fashion. Experiments with over 23 million Wikipedia documents demonstrate the speed and robustness of our implementation: It scales linearly with the size of the updates and the degree of change in the documents and demonstrates a constant update time regardless of the size of the underlying index. Moreover, our approach significantly increases its performance as more computational resources are acquired: It incorporates a 15.4GB update batch to a 64.2GB indexed dataset in about 21 minutes using just 12 commodity nodes, 3.3 times faster compared to using two nodes. 0 0
Evaluating Textual Entailment Recognition for University Entrance Examinations Miyao Y.
Shima H.
Kanayama H.
Mitamura T.
ACM Transactions on Asian Language Information Processing English 2012 The present article addresses an attempt to apply questions in university entrance examinations to the evaluation of textual entailment recognition. Questions in several fields, such as history and politics, primarily test the examinee's knowledge in the form of choosing true statements from multiple choices. Answering such questions can be regarded as equivalent to finding evidential texts from a textbase such as textbooks and Wikipedia. Therefore, this task can be recast as recognizing textual entailment between a description in a textbase and a statement given in a question. We focused on the National Center Test for University Admission in Japan and converted questions into the evaluation data for textual entailment recognition by using Wikipedia as a textbase. Consequently, it is revealed that nearly half of the questions can be mapped into textual entailment recognition; 941 text pairs were created from 404 questions from six subjects. This data set is provided for a subtask of NTCIR RITE (Recognizing Inference in Text), and 16 systems from six teams used the data set for evaluation. The evaluation results revealed that the best system achieved a correct answer ratio of 56%, which is significantly better than a random. © 2012 ACM 1530-0226/2012/12-ART14 $15.00. 0 0
Expanding approach to information retrieval using semantic similarity analysis based on wordnet and wikipedia Fei Zhao
Fang F.
Yan F.
Jin H.
Zhang Q.
International Journal of Software Engineering and Knowledge Engineering English 2012 Performance of information retrieval (IR) systems greatly relies on textual keywords and retrieval documents. Inaccurate and incomplete retrieval results are always induced by query drift and ignorance of semantic relationship among terms. Expanding retrieval approach attempts to incorporate expansion terms into original query, such as unexplored words combing from pseudo-relevance feedback (PRF) or relevance feedback documents semantic words extracting from external corpus etc. In this paper a semantic analysis-based query expansion method for information retrieval using WordNet and Wikipedia as corpus are proposed. We derive semantic-related words from human knowledge repositories such as WordNet and Wikipedia, which are combined with words filtered by semantic mining from PRF document. Our approach automatically generates new semantic-based query from original query of IR. Experimental results on TREC datasets and Google search engine show that performance of information retrieval can be significantly improved using proposed method over previous results. 0 0
Exploiting Wikipedia for cross-lingual and multilingual information retrieval Sorg P.
Philipp Cimiano
Data and Knowledge Engineering English 2012 In this article we show how Wikipedia as a multilingual knowledge resource can be exploited for Cross-Language and Multilingual Information Retrieval (CLIR/MLIR). We describe an approach we call Cross-Language Explicit Semantic Analysis (CL-ESA) which indexes documents with respect to explicit interlingual concepts. These concepts are considered as interlingual and universal and in our case correspond either to Wikipedia articles or categories. Each concept is associated to a text signature in each language which can be used to estimate language-specific term distributions for each concept. This knowledge can then be used to calculate the strength of association between a term and a concept which is used to map documents into the concept space. With CL-ESA we are thus moving from a Bag-Of-Words model to a Bag-Of-Concepts model that allows language-independent document representations in the vector space spanned by interlingual and universal concepts. We show how different vector-based retrieval models and term weighting strategies can be used in conjunction with CL-ESA and experimentally analyze the performance of the different choices. We evaluate the approach on a mate retrieval task on two datasets: JRC-Acquis and Multext. We show that in the MLIR settings, CL-ESA benefits from a certain level of abstraction in the sense that using categories instead of articles as in the original ESA model delivers better results. © 2012 Elsevier B.V. All rights reserved. 0 0
Exploring simultaneous keyword and key sentence extraction: Improve graph-based ranking using Wikipedia Xiaolong Wang
Lei Wang
Jing-Woei Li
Li S.
ACM International Conference Proceeding Series English 2012 Summarization and Keyword Selection are two important tasks in NLP community. Although both aim to summarize the source articles, they are usually treated separately by using sentences or words. In this paper, we propose a two-level graph based ranking algorithm to generate summarization and extract keywords at the same time. Previous works have reached a consensus that important sentence is composed by important keywords. In this paper, we further study the mutual impact between them through context analysis. We use Wikipedia to build a two-level concept-based graph, instead of traditional term-based graph, to express their homogenous relationship and heterogeneous relationship. We run PageRank and HITS rank on the graph to adjust both homogenous and heterogeneous relationships. A more reasonable relatedness value will be got for key sentence selection and keyword selection. We evaluate our algorithm on TAC 2011 data set. Traditional term-based approach achieves a score of 0.255 in ROUGE-1 and a score of 0.037 and ROUGE-2 and our approach can improve them to 0.323 and 0.048 separately. 0 0
Fast top-K similarity queries via matrix compression Low Y.
Zheng A.X.
ACM International Conference Proceeding Series English 2012 In this paper, we propose a novel method to efficiently compute the top-K most similar items given a query item, where similarity is defined by the set of items that have the highest vector inner products with the query. The task is related to the classical k-Nearest-Neighbor problem, and is widely applicable in a number of domains such as information retrieval, online advertising and collaborative filtering. Our method assumes an in-memory representation of the dataset and is designed to scale to query lengths of 100,000s of terms. Our algorithm uses a generalized Holder's inequality to upper bound the inner product with the norms of the constituent vectors. We also propose a novel compression scheme that computes bounds for groups of candidate items, thereby speeding up computation and minimizing memory requirements per query. We conduct extensive experiments on the publicly available Wikipedia dataset, and demonstrate that, with a memory overhead of 21%, our method can provide 1-3 orders of magnitude improvement in query run-time compared to naive methods and state of the art competing methods. Our median top-10 word query time is 25 us on 7.5 million words and 2.3 million documents. 0 0
Group matrix factorization for scalable topic modeling Wang Q.
Cao Z.
Xu J.
Hua Li
SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval English 2012 Topic modeling can reveal the latent structure of text data and is useful for knowledge discovery, search relevance ranking, document classification, and so on. One of the major challenges in topic modeling is to deal with large datasets and large numbers of topics in real-world applications. In this paper, we investigate techniques for scaling up the non-probabilistic topic modeling approaches such as RLSI and NMF. We propose a general topic modeling method, referred to as Group Matrix Factorization (GMF), to enhance the scalability and efficiency of the non-probabilistic approaches. GMF assumes that the text documents have already been categorized into multiple semantic classes, and there exist class-specific topics for each of the classes as well as shared topics across all classes. Topic modeling is then formalized as a problem of minimizing a general objective function with regularizations and/or constraints on the class-specific topics and shared topics. In this way, the learning of class-specific topics can be conducted in parallel, and thus the scalability and efficiency can be greatly improved. We apply GMF to RLSI and NMF, obtaining Group RLSI (GRLSI) and Group NMF (GNMF) respectively. Experiments on a Wikipedia dataset and a real-world web dataset, each containing about 3 million documents, show that GRLSI and GNMF can greatly improve RLSI and NMF in terms of scalability and efficiency. The topics discovered by GRLSI and GNMF are coherent and have good readability. Further experiments on a search relevance dataset, containing 30,000 labeled queries, show that the use of topics learned by GRLSI and GNMF can significantly improve search relevance. 0 0
Harnessing Wikipedia semantics for computing contextual relatedness Jabeen S.
Gao X.
Andreae P.
Lecture Notes in Computer Science English 2012 This paper proposes a new method of automatically measuring semantic relatedness by exploiting Wikipedia as an external knowledge source. The main contribution of our research is to propose a relatedness measure based on Wikipedia senses and hyperlink structure for computing contextual relatedness of any two terms. We have evaluated the effectiveness of our approach using three datasets and have shown that our approach competes well with other well known existing methods. 0 0
Language models for keyword search over data graphs Mass Y.
Sagiv Y.
WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Search and Data Mining English 2012 In keyword search over data graphs, an answer is a nonredundant subtree that includes the given keywords. This paper focuses on improving the effectiveness of that type of search. A novel approach that combines language models with structural relevance is described. The proposed approach consists of three steps. First, language models are used to assign dynamic, query-dependent weights to the graph. Those weights complement static weights that are pre-assigned to the graph. Second, an existing algorithm returns candidate answers based on their weights. Third, the candidate answers are re-ranked by creating a language model for each one. The effectiveness of the proposed approach is verified on a benchmark of three datasets: IMDB, Wikipedia and Mondial. The proposed approach outperforms all existing systems on the three datasets, which is a testament to its robustness. It is also shown that the effectiveness can be further improved by augmenting keyword queries with very basic knowledge about the structure. Copyright 2012 ACM. 0 0
Leave or stay: The departure dynamics of wikipedia editors Dell Zhang
Karl Prior
Mark Levene
Mao R.
Van Liere D.
Lecture Notes in Computer Science English 2012 In this paper, we investigate how Wikipedia editors leave the community, i.e., become inactive, from the following three aspects: (1) how long Wikipedia editors will stay active in editing; (2) which Wikipedia editors are likely to leave; and (3) what reasons would make Wikipedia editors leave. The statistical models built on Wikipedia edit log datasets provide insights about the sustainable growth of Wikipedia. 0 0
Link prediction on evolving data using tensor-based common neighbor Cui H. Proceedings - 2012 5th International Symposium on Computational Intelligence and Design, ISCID 2012 English 2012 Recently there has been increasingly interest in researching links between objects in complex networks, which can be helpful in many data mining tasks. One of the fundamental researches of links between objects is link prediction. Many link prediction algorithms have been proposed and perform quite well, however, most of those algorithms only concerns network structure in terms of traditional graph theory, which lack information about evolving network. In this paper we proposed a novel tensor-based prediction method, which is designed through two steps: First, tracking time-dependent network snapshots in adjacency matrices which form a multi-way tensor by using exponential smoothing method. Second, apply Common Neighbor algorithm to compute the degree of similarity for each nodes. This algorithm is quite different from other tensor-based algorithms, which also mentioned in this paper. In order to estimate the accuracy of our link prediction algorithm, we employ various popular datasets of social networks and information platforms, such as Facebook and Wikipedia networks. The results show that our link prediction algorithm performances better than another tensor-based algorithms mentioned in this paper. 0 0
MOTIF-RE: Motif-based hypernym/hyponym relation extraction from wikipedia links Wei B.
Liu J.
Jun Ma
Zheng Q.
Weinan Zhang
Feng B.
Lecture Notes in Computer Science English 2012 Hypernym/hyponym relation extraction plays an essential role in taxonomy learning. The conventional methods based on lexico-syntactic patterns or machine learning usually make use of content-related features. In this paper, we find that the proportions of hyperlinks with different semantic type vary markedly in different network motifs. Based on this observation, we propose MOTIF-RE, an algorithm of extracting hypernym/hyponym relation from Wikipedia hyperlinks. The extraction process consists of three steps: 1) Build a directed graph from a set of domain-specific Wikipedia articles. 2) Count the occurrences of hyperlinks in every three-node network motif and create a feature vector for every hyperlink. 3) Train a classifier to identify semantic relation of hyperlinks. We created three domain-specific Wikipedia article sets to test MOTIF-RE. Experiments on individual dataset show that MOTIF-RE outperforms the baseline algorithm by about 30% in terms of F1-measure. Cross-domain experimental results show similar, which proves that MOTIF-RE has fairly good domain adaptation ability. 0 0
Mining spatio-temporal patterns in the presence of concept hierarchies Anh L.V.Q.
Gertz M.
Proceedings - 12th IEEE International Conference on Data Mining Workshops, ICDMW 2012 English 2012 In the past, approaches to mining spatial and spatio-temporal data for interesting patterns have mainly concentrated on data obtained through observations and simulations where positions of objects, such as areas, vehicles, or persons, are collected over time. In the past couple of years, however, new datasets have been built by automatically extracting facts, as subject-predicate-object triples, from semistructured information sources such as Wikipedia. Recently some approaches, for example, in the context of YAGO2, have extended such facts by adding temporal and spatial information. The presence of such new data sources gives rise to new approaches for discovering spatio-temporal patterns. In this paper, we present a framework in support of the discovery of interesting spatio-temporal patterns from knowledge base datasets. Different from traditional approaches to mining spatio-temporal data, we focus on mining patterns at different levels of granularity by exploiting concept hierarchies, which are a key ingredient in knowledge bases.We introduce a pattern specification language and outline an algorithmic approach to efficiently determine complex patterns. We demonstrate the utility of our framework using two different real-world datasets from YAGO2 and the Website 0 0
Mining web query logs to analyze political issues Ingmar Weber
Garimella V.R.K.
Borra E.
Proceedings of the 3rd Annual ACM Web Science Conference, WebSci'12 English 2012 We present a novel approach to using anonymized web search query logs to analyze and visualize political issues. Our starting point is a list of politically annotated blogs (left vs. right). We use this list to assign a numerical political leaning to queries leading to clicks on these blogs. Furthermore, we map queries to Wikipedia articles and to fact-checked statements from, as well as applying sentiment analysis to search results. With this rich, multi-faceted data set we obtain novel graphical visualizations of issues and discover connections between the different variables. Our findings include (i) an interest in "the other side" where queries about Democrat politicians have a right leaning and vice versa, (ii) evidence that "lies are catchy" and that queries pertaining to false statements are more likely to attract large volumes, and (iii) the observation that the more right-leaning a query it is, the more negative sentiments can be found in its search results. Copyright 0 0
Modeling topic hierarchies with the recursive Chinese restaurant process Kim J.H.
Kim D.
Soo-Hwan Kim
Oh A.
ACM International Conference Proceeding Series English 2012 Topic models such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet processes (HDP) are simple solutions to discover topics from a set of unannotated documents. While they are simple and popular, a major shortcoming of LDA and HDP is that they do not organize the topics into a hierarchical structure which is naturally found in many datasets. We introduce the recursive Chinese restaurant process (rCRP) and a nonparametric topic model with rCRP as a prior for discovering a hierarchical topic structure with unbounded depth and width. Unlike previous models for discovering topic hierarchies, rCRP allows the documents to be generated from a mixture over the entire set of topics in the hierarchy. We apply rCRP to a corpus of New York Times articles, a dataset of MovieLens ratings, and a set of Wikipedia articles and show the discovered topic hierarchies. We compare the predictive power of rCRP with LDA, HDP, and nested Chinese restaurant process (nCRP) using heldout likelihood to show that rCRP outperforms the others. We suggest two metrics that quantify the characteristics of a topic hierarchy to compare the discovered topic hierarchies of rCRP and nCRP. The results show that rCRP discovers a hierarchy in which the topics become more specialized toward the leaves, and topics in the immediate family exhibit more affinity than topics beyond the immediate family. 0 0
Models for efficient semantic data storage demonstrated on concrete example of DBpedia Lasek I.
Vojtas P.
CEUR Workshop Proceedings English 2012 In this paper, we introduce a benchmark to test efficiency of RDF data model for data storage and querying in relation to a concrete dataset.We created Czech DBpedia - a freely available dataset composed of data extracted from Czech Wikipedia. But during creation and querying of this dataset, we faced problems caused by a lack of performance of used RDF storage. We designed metrics to measure efficiency of data storage approaches. Our metric quantifies the impact of data decomposition in RDF triples. Results of our benchmark applied to the dataset of Czech DBpedia are presented. 0 0
Multi-aspect query summarization by composite query Song W.
Yu Q.
Xu Z.
Liu T.
Li S.
Wen J.-R.
SIGIR'12 - Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval English 2012 Conventional search engines usually return a ranked list of web pages in response to a query. Users have to visit several pages to locate the relevant parts. A promising future search scenario should involve: (1) understanding user intents; (2) providing relevant information directly to satisfy searchers' needs, as opposed to relevant pages. In this paper, we present a search paradigm to summarize a query's information from different aspects. Query aspects could be aligned to user intents. The generated summaries for query aspects are expected to be both specific and informative, so that users can easily and quickly find relevant information. Specifically, we use a Composite Query for Summarization" method, where a set of component queries are used for providing additional information for the original query. The system leverages the search engine to proactively gather information by submitting multiple component queries according to the original query and its aspects. In this way, we could get more relevant information for each query aspect and roughly classify information. By comparative mining the search results of different component queries, it is able to identify query (dependent) aspect words, which help to generate more specific and informative summaries. The experimental results on two data sets, Wikipedia and TREC ClueWeb2009, are encouraging. Our method outperforms two baseline methods on generating informative summaries. 0 0
Name-ethnicity classification and ethnicity-sensitive name matching Treeratpituk P.
Giles C.L.
Proceedings of the National Conference on Artificial Intelligence English 2012 Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify name-ethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85% accuracy. Next, we propose a novel alignment-based name matching algorithm, based on Smith-Waterman algorithm and logistic regression. Different name matching models are then trained for different name-ethnicity groups. Our preliminary experimental result on DBLP's disambiguated author dataset yields a performance of 99% precision and 89% recall. Surprisingly, textual features carry more weight than phonetic ones in name-ethnicity classification. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Named entity disambiguation based on explicit semantics Jacala M.
Tvarozek J.
Lecture Notes in Computer Science English 2012 In our work we present an approach to the Named Entity Disambiguation based on semantic similarity measure. We employ existing explicit semantics present in datasets such as Wikipedia to construct a disambiguation dictionary and vector-based word model. The analysed documents are transformed into semantic vectors using explicit semantic analysis. The relatedness is computed as cosine similarity between the vectors. The experimental evaluation shows that the proposed approach outperforms traditional approaches such as latent semantic analysis. 0 0
Named entity recognition an aid to improve multilingual entity filling in language-independent approach Bhagavatula M.
Santosh G.S.K.
Vasudeva Varma
International Conference on Information and Knowledge Management, Proceedings English 2012 This paper details the approach to identify Named Enti- ties (NEs) from a large non-English corpus and associate them with appropriate tags, requiring minimal human in- tervention and no linguistic expertise. The main objective in this paper is to focus on Indian languages like Telugu, Hindi, Tamil, Marathi, etc., which are considered to be resource-poor languages when compared to English. The inherent structure of Wikipedia was exploited in develop- ing an effcient co-occurrence frequency based NE identification algorithm for Indian Languages. We describe the methods by which English Wikipedia data can be used to bootstrap the identification of NEs in other languages which generates a list of NE's. Later, the paper focuses on uti- lizing this NE list to improve multilingual Entity Filling which showed promising results. On a dataset of 2,622 Marathi Wikipedia articles, with around 10,000 NEs man- ually tagged, an F-Measure of 81.25% was achieved by our system without availing language expertise. Similarly, an F-measure of 80.42% was achieved on around 12,000 NEs tagged within 2,935 Hindi Wikipedia articles. Copyright 2012 ACM. 0 0
Online sharing and integration of results from mining software repositories Keivanloo I. Proceedings - International Conference on Software Engineering English 2012 The mining of software repository involves the extraction of both basic and value-added information from existing software repositories. Depending on stakeholders (e.g., researchers, management), these repositories are mined several times for different application purposes. To avoid unnecessary pre-processing steps and improve productivity, sharing, and integration of extracted facts and results are needed. The motivation of this research is to introduce a novel collaborative sharing platform for software datasets that supports on-the-fly inter-datasets integration. We want to facilitate and promote a paradigm shift in the source code analysis domain, similar to the one by Wikipedia in the knowledge-sharing domain. In this paper, we present the SeCold project, which is the first online, publicly available software ecosystem Linked Data dataset. As part of this research, not only theoretical background on how to publish such datasets is provided, but also the actual dataset. SeCold contains about two billion facts, such as source code statements, software licenses, and code clones from over 18.000 software projects. SeCold is also an official member of the Linked Data cloud and one of the eight largest online Linked Data datasets available on the cloud. 0 0
Overview of metrics and their correlation patterns for multiple-metric topology analysis on heterogeneous graph ensembles Bounova G.
De Weck O.
Physical Review E - Statistical, Nonlinear, and Soft Matter Physics English 2012 This study is an overview of network topology metrics and a computational approach to analyzing graph topology via multiple-metric analysis on graph ensembles. The paper cautions against studying single metrics or combining disparate graph ensembles from different domains to extract global patterns. This is because there often exists considerable diversity among graphs that share any given topology metric, patterns vary depending on the underlying graph construction model, and many real data sets are not actual statistical ensembles. As real data examples, we present five airline ensembles, comprising temporal snapshots of networks of similar topology. Wikipedia language networks are shown as an example of a nontemporal ensemble. General patterns in metric correlations, as well as exceptions, are discussed by representing the data sets via hierarchically clustered correlation heat maps. Most topology metrics are not independent and their correlation patterns vary across ensembles. In general, density-related metrics and graph distance-based metrics cluster and the two groups are orthogonal to each other. Metrics based on degree-degree correlations have the highest variance across ensembles and cluster the different data sets on par with principal component analysis. Namely, the degree correlation, the s metric, their elasticities, and the rich club moments appear to be most useful in distinguishing topologies. 0 0
Predicting user tags using semantic expansion Chandramouli K.
Piatrik T.
Izquierdo E.
Communications in Computer and Information Science English 2012 Manually annotating content such as Internet videos, is an intellectually expensive and time consuming process. Furthermore, keywords and community-provided tags lack consistency and present numerous irregularities. Addressing the challenge of simplifying and improving the process of tagging online videos, which is potentially not bounded to any particular domain, we present an algorithm for predicting user-tags from the associated textual metadata in this paper. Our approach is centred around extracting named entities exploiting complementary textual resources such as Wikipedia and Wordnet. More specifically to facilitate the extraction of semantically meaningful tags from a largely unstructured textual corpus we developed a natural language processing framework based on GATE architecture. Extending the functionalities of the in-built GATE named entities, the framework integrates a bag-of-articles algorithm for effectively searching through the Wikipedia articles for extracting relevant articles. The proposed framework has been evaluated against MediaEval 2010 Wild Wild Web dataset, which consists of large collection of Internet videos. 0 0
Probabilistic deduplication for cluster-based storage systems Frey D.
Kermarrec A.-M.
Kloudas K.
Proceedings of the 3rd ACM Symposium on Cloud Computing, SoCC 2012 English 2012 The need to backup huge quantities of data has led to the development of a number of distributed deduplication techniques that aim to reproduce the operation of centralized, single-node backup systems in a cluster-based environment. At one extreme, stateful solutions rely on indexing mechanisms to maximize deduplication. However the cost of these strategies in terms of computation and memory resources makes them unsuitable for large-scale storage systems. At the other extreme, stateless strategies store data blocks based only on their content, without taking into account previous placement decisions, thus reducing the cost but also the effectiveness of deduplication. In this work, we propose, Produck, a stateful, yet lightweight cluster-based backup system that provides deduplication rates close to those of a single-node system at a very low computational cost and with minimal memory overhead. In doing so, we provide two main contributions: a lightweight probabilistic node-assignment mechanism and a new bucket-based load-balancing strategy. The former allows Produck to quickly identify the servers that can provide the highest deduplication rates for a given data block. The latter efficiently spreads the load equally among the nodes. Our experiments compare Produck against state-of-the-art alternatives over a publicly available dataset consisting of 16 full Wikipedia backups, as well as over a private one consisting of images of the environments available for deployment on the Grid5000 experimental platform. Our results show that, on average, Produck provides (i) up to 18% better deduplication compared to a stateless minhash-based technique, and (ii) an 18-fold reduction in computational cost with respect to a stateful Bloom-filter-based solution. 0 0
REWOrD: Semantic relatedness in the web of data Pirro G. Proceedings of the National Conference on Artificial Intelligence English 2012 This paper presents REWOrD, an approach to compute semantic relatedness between entities in the Web of Data representing real word concepts. REWOrD exploits the graph nature of RDF data and the SPARQL query language to access this data. Through simple queries, REWOrD constructs weighted vectors keeping the informativeness of RDF predicates used to make statements about the entities being compared. The most informative path is also considered to further refine informativeness. Relatedness is then computed by the cosine of the weighted vectors. Differently from previous approaches based on Wikipedia, REWOrD does not require any preprocessing or custom data transformation. Indeed, it can leverage whatever RDF knowledge base as a source of background knowledge. We evaluated REWOrD in different settings by using a new dataset of real word entities and investigate its flexibility. As compared to related work on classical datasets, REWOrD obtains comparable results while, on one side, it avoids the burden of preprocessing and data transformation and, on the other side, it provides more flexibility and applicability in a broad range of domains. Copyright © 2012, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Scientific cyberlearning resources referential metadata creation via information retrieval Xiaojiang Liu
Jia H.
Proceedings of the ACM/IEEE Joint Conference on Digital Libraries English 2012 The goal of this research is to describe an innovative method of creating scientific referential metadata for a cyberinfrastructure-enabled learning environment to enhance student and scholar learning experiences. By using information retrieval and meta-search approaches, different types of referential metadata, such as related Wikipedia Pages, Datasets, Source Code, Video Lectures, Presentation Slides, and (online) Tutorials, for an assortment of publications and scientific topics will be automatically retrieved, associated, and ranked. 0 0
Text classification using Wikipedia knowledge Su C.
Yanne P.
YanChun Zhang
ICIC Express Letters, Part B: Applications English 2012 In the real world, there are large amounts of unlabeled text documents, but traditional approaches usually require a lot of labeled documents, which are expensive to obtain. In this paper we propose an approach using the Wikipedia for text classification. We firstly extract the related wiki documents with the given keywords, then label the documents with the representative features selected from the related wiki documents, and finally build an SVM text classifier. Experimental results on 20-Newsgroup dataset show that the proposed method performs well and stably. 0 0
Towards better understanding and utilizing relations in DBpedia Linyun Fu
Haofen Wang
Jin W.
Yiqin Yu
Web Intelligence and Agent Systems English 2012 This paper is concerned with the problems of understanding the relations in automatically extracted semantic datasets such as DBpedia and utilizing them in semantic queries such as SPARQL. Although DBpedia has achieved a great success in supporting convenient navigation and complex queries over the extracted semantic data from Wikipedia, the browsing mechanism and the organization of the relations in the extracted data are far from satisfactory. Some relations have anomalous names and are hard to be understood even by experts if looking at the relation names only; there exist synonymous and polysemous relations which may cause incomplete or noisy query results. In this paper, we propose to solve these problems by 1) exploiting the Wikipedia category system to facilitate relation understanding and query constraint selection, 2) exploring various relation representation models for similar/super-/sub-relation detection to help the users select proper relations in their queries. A prototype system has been implemented and extensive experiments are performed to illustrate the effectiveness of the proposed approach. © 2012-IOS Press and the authors. All rights reserved. 0 0
Using lexical and thematic knowledge for name disambiguation Wang J.
Zhao W.X.
Yan R.
Wei H.
Nie J.-Y.
Li X.
Lecture Notes in Computer Science English 2012 In this paper we present a novel approach to disambiguate names based on two different types of semantic information: lexical and thematic. We propose to use translation-based language models to resolve the synonymy problem in every word match, and to use topic-based ranking function to capture rich thematic contexts for names. We test three ranking functions that combine lexical relatedness and thematic relatedness. The experiments on Wikipedia data set and TAC-KBP 2010 data set show that our proposed method is very effective for name disambiguation. 0 0
Visualization of Wiki-based collaboration through two-mode network patterns Modritscher F.
Taferner W.
Proceedings of the 12th IEEE International Conference on Advanced Learning Technologies, ICALT 2012 English 2012 Nowadays Wikis are considered to be a useful tool for teaching and learning. However, well-known Wiki solutions do not provide sufficient facilities for analyzing and exploring networked collaboration. In this paper we present a method for detecting and visualizing structural patterns of collaboration in Wikis. Furthermore we summarize findings from applying our approach on a smaller and two larger datasets. Overall, our method allows characterizing Wikis according to collaboration patterns on the basis of two-mode networks but it also enables users to explore large Wiki corpora and provides visual feedback on content creation. 0 0
WSR: A semantic relatedness measure based on Wikipedia structure Sun C.-C.
Shen D.-R.
Shan J.
Nie T.-Z.
Yu G.
Jisuanji Xuebao/Chinese Journal of Computers Chinese 2012 This paper proposes a semantic relatedness measure based on Wikipedia structure: WikiStruRel (WSR). Nowadays, Wikipedia is the largest and the fastest-growing online encyclopedia, consisting of two net-like structures: an article referenced network and a category tree (actually a tree-like graph), which include lots of explicitly defined semantic information. WSR explicitly analyzes the article referenced network and the category tree from Wikipedia and computes semantic relatedness between words. While WSR achieves effective accuracy and large coverage by testing on three common datasets, the measure doesn't have to deal with text, resulting in low cost. 0 0
A self organizing document map algorithm for large scale hyperlinked data inspired by neuronal migration Kotaro Nakayama
Yutaka Matsuo
Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011 English 2011 Web document clustering is one of the research topics that is being pursued continuously due to the large variety of applications. Since Web documents usually have variety and diversity in terms of domains, content and quality, one of the technical difficulties is to find a reasonable number and size of clusters. In this research, we pay attention to SOMs (Self Organizing Maps) because of their capability of visualized clustering that helps users to investigate characteristics of data in detail. The SOM is widely known as a "scalable" algorithm because of its capability to handle large numbers of records. However, it is effective only when the vectors are small and dense. Although several research efforts on making the SOM scalable have been conducted, technical issues on scalability and performance for sparse high-dimensional data such as hyperlinked documents still remain. In this paper, we introduce MIGSOM, an SOM algorithm inspired by a recent discovery on neuronal migration. The two major advantages of MIGSOM are its scalability for sparse high-dimensional data and its clustering visualization functionality. In this paper, we describe the algorithm and implementation, and show the practicality of the algorithm by applying MIGSOM to a huge scale real data set: Wikipedia's hyperlink data. 0 0
Categorising social tags to improve folksonomy-based recommendations Ivan Cantador
Ioannis Konstas
Jose J.M.
Journal of Web Semantics English 2011 In social tagging systems, users have different purposes when they annotate items. Tags not only depict the content of the annotated items, for example by listing the objects that appear in a photo, or express contextual information about the items, for example by providing the location or the time in which a photo was taken, but also describe subjective qualities and opinions about the items, or can be related to organisational aspects, such as self-references and personal tasks. Current folksonomy-based search and recommendation models exploit the social tag space as a whole to retrieve those items relevant to a tag-based query or user profile, and do not take into consideration the purposes of tags. We hypothesise that a significant percentage of tags are noisy for content retrieval, and believe that the distinction of the personal intentions underlying the tags may be beneficial to improve the accuracy of search and recommendation processes. We present a mechanism to automatically filter and classify raw tags in a set of purpose-oriented categories. Our approach finds the underlying meanings (concepts) of the tags, mapping them to semantic entities belonging to external knowledge bases, namely WordNet and Wikipedia, through the exploitation of ontologies created within the W3C Linking Open Data initiative. The obtained concepts are then transformed into semantic classes that can be uniquely assigned to content- and context-based categories. The identification of subjective and organisational tags is based on natural language processing heuristics. We collected a representative dataset from Flickr social tagging system, and conducted an empirical study to categorise real tagging data, and evaluate whether the resultant tags categories really benefit a recommendation model using the Random Walk with Restarts method. The results show that content- and context-based tags are considered superior to subjective and organisational tags, achieving equivalent performance to using the whole tag space. © 2010 Elsevier B.V. All rights reserved. 0 0
Categorization of wikipedia articles with spectral clustering Szymanski J. Lecture Notes in Computer Science English 2011 The article reports application of clustering algorithms for creating hierarchical groups within Wikipedia articles. We evaluate three spectral clustering algorithms based on datasets constructed with usage of Wikipedia categories. Selected algorithm has been implemented in the system that categorize Wikipedia search results in the fly. 0 0
Citizens as database: Conscious ubiquity in data collection Richter K.-F.
Winter S.
Lecture Notes in Computer Science English 2011 Crowd sourcing [1], citzens as sensors [2], user-generated content [3,4], or volunteered geographic information [5] describe a relatively recent phenomenon that points to dramatic changes in our information economy. Users of a system, who often are not trained in the matter at hand, contribute data that they collected without a central authority managing or supervising the data collection process. The individual approaches vary and cover a spectrum from conscious user actions ('volunteered') to passive modes ('citizens as sensors'). Volunteered user-generated content is often used to replace existing commercial or authoritative datasets, for example, Wikipedia as an open encyclopaedia, or OpenStreetMap as an open topographic dataset of the world. Other volunteered content exploits the rapid update cycles of such mechanisms to provide improved services. For example, reports damages related to streets; Google, TomTom and other dataset providers encourage their users to report updates of their spatial data. In some cases, the database itself is the service; for example, Flickr allows users to upload and share photos. At the passive end of the spectrum, data mining methods can be used to further elicit hidden information out of the data. Researchers identified, for example, landmarks defining a town from Flickr photo collections [6], and commercial services track anonymized mobile phone locations to estimate traffic flow and enable real-time route planning. 0 0
Clustering blogs using document context similarity and spectral graph partitioning Ayyasamy R.K.
Alhashmi S.M.
Eu-Gene S.
Tahayna B.
Advances in Intelligent and Soft Computing English 2011 Semantic-based document clustering has been a challenging problem over the past few years and its execution depends on modeling the underlying content and its similarity metrics. Existing metrics evaluate pair wise text similarity based on text content, which is referred as content similarity. The performances of these measures are based on co-occurrences, and ignore the semantics among words. Although, several research works have been carried out to solve this problem, we propose a novel similarity measure by exploiting external knowledge base-Wikipedia to enhance document clustering task. Wikipedia articles and the main categories were used to predict and affiliate them to their semantic concepts. In this measure, we incorporate context similarity by constructing a vector with each dimension representing contents similarity between a document and other documents in the collection. Experimental result conducted on TREC blog dataset confirms that the use of context similarity measure, can improve the precision of document clustering significantly. 0 0
Coherence progress: A measure of interestingness based on fixed compressors Schaul T.
Pape L.
Glasmachers T.
Graziano V.
Schmidhuber J.
Lecture Notes in Computer Science English 2011 The ability to identify novel patterns in observations is an essential aspect of intelligence. In a computational framework, the notion of a pattern can be formalized as a program that uses regularities in observations to store them in a compact form, called a compressor. The search for interesting patterns can then be stated as a search to better compress the history of observations. This paper introduces coherence progress, a novel, general measure of interestingness that is independent of its use in a particular agent and the ability of the compressor to learn from observations. Coherence progress considers the increase in coherence obtained by any compressor when adding an observation to the history of observations thus far. Because of its applicability to any type of compressor, the measure allows for an easy, quick, and domain-specific implementation. We demonstrate the capability of coherence progress to satisfy the requirements for qualitatively measuring interestingness on a Wikipedia dataset. 0 0
Combining heterogeneous knowledge resources for improved distributional semantic models Szarvas G.
Torsten Zesch
Iryna Gurevych
Lecture Notes in Computer Science English 2011 The Explicit Semantic Analysis (ESA) model based on term cooccurrences in Wikipedia has been regarded as state-of-the-art semantic relatedness measure in the recent years. We provide an analysis of the important parameters of ESA using datasets in five different languages. Additionally, we propose the use of ESA with multiple lexical semantic resources thus exploiting multiple evidence of term cooccurrence to improve over the Wikipedia-based measure. Exploiting the improved robustness and coverage of the proposed combination, we report improved performance over single resources in word semantic relatedness, solving word choice problems, classification of semantic relations between nominals, and text similarity. 0 0
Concept based modeling approach for blog classification using fuzzy similarity Ayyasamy R.K.
Tahayna B.
Alhashmi S.M.
Eu-Gene S.
Proceedings - 2011 8th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2011 English 2011 As information technology is developing in a faster pace, there is a steep increase in social networking where the user can share their knowledge, views, criticism through various ways such as blogging, facebook, microblogging, news, forums, etc. Among these various ways, blogs play a different role as it is a personal site for each user, and blogger writes lengthy posts on various topics. Several research works are carried out, to classify blogs based on machine learning techniques. In this paper, we describe a method for classifying blog posts automatically using fuzzy similarity. We perform, experiments using TREC dataset and applied our approach to six different fuzzy similarity measures. Experimental results proved that Einstein fuzzy similarity measures performs better than the other measures. 0 0
Concept-based information retrieval using explicit semantic analysis Egozi O.
Shaul Markovitch
Evgeniy Gabrilovich
ACM Transactions on Information Systems English 2011 Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keywordbased text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results. 0 0
Enhancing accessibility of microblogging messages using semantic knowledge Hu X.
Tang L.
Hongyan Liu
International Conference on Information and Knowledge Management, Proceedings English 2011 The volume of microblogging messages is increasing exponentially with the popularity of microblogging services. With a large number of messages appearing in user interfaces, it hinders user accessibility to useful information buried in disorganized, incomplete, and unstructured text messages. In order to enhance user accessibility, we propose to aggregate related microblogging messages into clusters and automatically assign them semantically meaningful labels. However, a distinctive feature of microblogging messages is that they are much shorter than conventional text documents. These messages provide inadequate term co occurrence information for capturing semantic associations. To address this problem, we propose a novel framework for organizing unstructured microblogging messages by transforming them to a semantically structured representation. The proposed framework first captures informative tree fragments by analyzing a parse tree of the message, and then exploits external knowledge bases (Wikipedia and WordNet) to enhance their semantic information. Empirical evaluation on a Twitter dataset shows that our framework significantly outperforms existing state-of-the-art methods. 0 0
Evaluation of OML and AERMOD Olesen H.R.
Berkowicz R.
Lofstrom P.
International Journal of Environment and Pollution English 2011 Results from an evaluation of three dispersion models are presented: the currently operational OML model, a new, improved 'Research Version' of OML, and the US AERMOD model. The evaluation is based on the Prairie Grass data set. For these data the OML Research Version appears superior to the other two models. Further, the paper discusses problems and pitfalls of the Prairie Grass data set. The criteria for exclusion of data have tremendous impact on evaluation metrics. A new Wiki on Atmospheric Dispersion has the potential to become a very useful focal point to pool and communicate experiences on data sets such as Prairie Grass. Copyright 0 0
Exploring entity relations for named entity disambiguation Ploch D. ACL HLT 2011 - 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of Student Session English 2011 Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset. 0 0
Extracting events from Wikipedia as RDF triples linked to widespread semantic web datasets Carlo Aliprandi
Francesco Ronzano
Andrea Marchetti
Maurizio Tesconi
Salvatore Minutoli
Lecture Notes in Computer Science English 2011 Many attempts have been made to extract structured data from Web resources, exposing them as RDF triples and interlinking them with other RDF datasets: in this way it is possible to create clouds of highly integrated Semantic Web data collections. In this paper we describe an approach to enhance the extraction of semantic contents from unstructured textual documents, in particular considering Wikipedia articles and focusing on event mining. Starting from the deep parsing of a set of English Wikipedia articles, we produce a semantic annotation compliant with the Knowledge Annotation Format (KAF). We extract events from the KAF semantic annotation and then we structure each event as a set of RDF triples linked to both DBpedia and WordNet. We point out examples of automatically mined events, providing some general evaluation of how our approach may discover new events and link them to existing contents. 0 0
Finding patterns in behavioral observations by automatically labeling forms of wikiwork in Barnstars David W. McDonald
Sara Javanmardi
Mark Zachry
WikiSym 2011 Conference Proceedings - 7th Annual International Symposium on Wikis and Open Collaboration English 2011 Our everyday observations about the behaviors of others around us shape how we decide to act or interact. In social media the ability to observe and interpret others' behavior is limited. This work describes one approach to leverage everyday behavioral observations to develop tools that could improve understanding and sense making capabilities of contributors, managers and researchers of social media systems. One example of behavioral observation is Wikipedia Barnstars. Barnstars are a type of award recognizing the activities of Wikipedia editors. We mine the entire English Wikipedia to extract barnstar observations. We develop a multi-label classifier based on a random forest technique to recognize and label distinct forms of observed and acknowledged activity. We evaluate the classifier through several means including use of separate training and testing datasets and the by application of the classifier to previously unlabeled data. We use the classifier to identify Wikipedia editors who have been observed with some predominant types of behavior and explore whether those patterns of behavior are evident and how observers seem to be making the observations. We discuss how these types of activity observations can be used to develop tools and potentially improve understanding and analysis in wikis and other online communities. 0 1
Geodesic distances for web document clustering Tekir S.
Mansmann F.
Keim D.
IEEE SSCI 2011: Symposium Series on Computational Intelligence - CIDM 2011: 2011 IEEE Symposium on Computational Intelligence and Data Mining English 2011 While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article. 0 0
Greedy and randomized feature selection for web search ranking Pan F.
Converse T.
Ahn D.
Salvetti F.
Donato G.
Proceedings - 11th IEEE International Conference on Computer and Information Technology, CIT 2011 English 2011 Modern search engines have to be fast to satisfy users, so there are hard back-end latency requirements. The set of features useful for search ranking functions, though, continues to grow, making feature computation a latency bottleneck. As a result, not all available features can be used for ranking, and in fact, much of the time only a small percentage of these features can be used. Thus, it is crucial to have a feature selection mechanism that can find a subset of features that both meets latency requirements and achieves high relevance. To this end, we explore different feature selection methods using boosted regression trees, including both greedy approaches (i.e., selecting the features with the highest relative influence as computed by boosted trees; discounting importance by feature similarity) and randomized approaches (i.e., best-only genetic algorithm; a proposed more efficient randomized method with feature-importance-based backward elimination). We evaluate and compare these approaches using two data sets, one from a commercial Wikipedia search engine and the other from a commercial Web search engine. The experimental results show that the greedy approach that selects top features with the highest relative influence performs close to the full-feature model, and the randomized feature selection with feature-importance-based backward elimination outperforms all other randomized and greedy approaches, especially on the Wikipedia data. 0 0
HAMEX - A handwritten and audio dataset of mathematical expressions Quiniou S.
Mouchere H.
Saldarriaga S.P.
Viard-Gaudin C.
Morin E.
Petitrenaud S.
Medjkoune S.
Proceedings of the International Conference on Document Analysis and Recognition, ICDAR English 2011 In this paper, we present HAMEX, a new public dataset that contains mathematical expressions available in their on-line handwritten form and in their audio spoken form. We have designed this dataset so that, given a mathematical expression, its handwritten signal and its audio signal can be used jointly to design multimodal recognition systems. Here, we describe the different steps that allowed us to acquire this dataset, from the creation of the mathematical expression corpora (including expressions from Wikipedia pages) to the segmentation and the transcription of the collected data, via the data collection process itself. Currently, the dataset contains 4 350 on-line handwritten mathematical expressions written by 58 writers, and the corresponding audio expressions (in French) spoken by 58 speakers. The ground truth is also provided both for the handwritten expressions (as INKML files with the digital ink, the symbol segmentation, and the MATHML structure) and for the audio expressions (as XML files with the transcriptions of the spoken expressions). 0 0
Harnessing different knowledge sources to measure semantic relatedness under a uniform model Zhang Z.
Gentile A.L.
Ciravegna F.
EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference English 2011 Measuring semantic relatedness between words or concepts is a crucial process to many Natural Language Processing tasks. Exiting methods exploit semantic evidence from a single knowledge source, and are predominantly evaluated only in the general domain. This paper introduces a method of harnessing different knowledge sources under a uniform model for measuring semantic relatedness between words or concepts. Using Wikipedia and WordNet as examples, and evaluated in both the general and biomedical domains, it successfully combines strengths from both knowledge sources and outperforms state-of-the-art on many datasets. 0 0
Image tag refinement using tag semantic and visual similarity Cheng W.
Xiaolong Wang
Proceedings of 2011 International Conference on Computer Science and Network Technology, ICCSNT 2011 English 2011 Social tagging on online websites provides users interfaces of describing resources with their own tags, and vast user-provided image tags facilitate image retrieval and management. However, these tags are often not related to the actual image content, affecting the performance of tag related applications. In this paper, a novel approach to automatically refine the image tags is proposed. Firstly, information entropy of the tag is defined to refine tag frequency to predict tag initial relevance. Then, tag correlation is calculated from two sides. One side is to measure semantic similarity of tag pairs using the structured information of the free encyclopedia Wikipedia. The other one is to compute the visual similarity of tag pairs based on the visual representation of the tag. Finally, to re-rank the original tags, a fast random walk with restart is used and the top ones are reserved as the final tags. Experimental results conducted on dataset NUS-WIDE demonstrate the promising effectiveness of our approach. 0 0
Improving query expansion for image retrieval via saliency and picturability Leong C.W.
Hassan S.
Ruiz M.E.
Rada Mihalcea
Lecture Notes in Computer Science English 2011 In this paper, we present a Wikipedia-based approach to query expansion for the task of image retrieval, by combining salient encyclopaedic concepts with the picturability of words. Our model generates the expanded query terms in a definite two-stage process instead of multiple iterative passes, requires no manual feedback, and is completely unsupervised. Preliminary results show that our proposed model is effective in a comparative study on the ImageCLEF 2010 Wikipedia dataset. 0 0
Language resources extracted from Wikipedia Vrandecic D.
Sorg P.
Studer R.
KCAP 2011 - Proceedings of the 2011 Knowledge Capture Conference English 2011 Wikipedia provides an interesting amount of text for more than hundred languages. This also includes languages where no reference corpora or other linguistic resources are easily available. We have extracted background language models built from the content of Wikipedia in various languages. The models generated from Simple and English Wikipedia are compared to language models derived from other established corpora. The differences between the models in regard to term coverage, term distribution and correlation are described and discussed. We provide access to the full dataset and create visualizations of the language models that can be used exploratory. The paper describes the newly released dataset for 33 languages, and the services that we provide on top of them. 0 0
ListOPT: Learning to optimize for XML ranking Gao N.
Deng Z.-H.
Yu H.
Jiang J.-J.
Lecture Notes in Computer Science English 2011 Many machine learning classification technologies such as boosting, support vector machine or neural networks have been applied to the ranking problem in information retrieval. However, since the purpose of these learning-to-rank methods is to directly acquire the sorted results based on the features of documents, they are unable to combine and utilize the existing ranking methods proven to be effective such as BM25 and PageRank. To solve this defect, we conducted a study on learning-to-optimize, which is to construct a learning model or method for optimizing the free parameters in ranking functions. This paper proposes a listwise learning-to-optimize process ListOPT and introduces three alternative differentiable query-level loss functions. The experimental results on the XML dataset of Wikipedia English show that these approaches can be successfully applied to tuning the parameters used in an existing highly cited ranking function BM25. Furthermore, we found that the formulas with optimized parameters indeed improve the effectiveness compared with the original ones. 0 0
Measuring the development of wikipedia He Z. 2011 International Conference on Internet Technology and Applications, iTAP 2011 - Proceedings English 2011 The paper summarizes the developing trend of Wikipedia which is one of important Internet application using current revision data set. It models the process of mass collaboration in Wikipedia using power distribution, namely Pareto distribution. This model suggests that the majority of edits in Wikipedia is contributed by a small group of people, while most participants in Wikipedia only give minor contributions. Additionally, we use maximum likelihood to formulate an equation to predict future development trend of Wikipedia. 0 0
Modelling provenance of DBpedia resources using Wikipedia contributions Fabrizio Orlandi
Alexandre Passant
Journal of Web Semantics English 2011 DBpedia is one of the largest datasets in the linked Open Data cloud. Its centrality and its cross-domain nature makes it one of the most important and most referred to knowledge bases on the Web of Data, generally used as a reference for data interlinking. Yet, in spite of its authoritative aspect, there is no work so far tackling the provenance aspect of DBpedia statements. By being extracted from Wikipedia, an open and collaborative encyclopedia, delivering provenance information about it would help to ensure trustworthiness of its data, a major need for people using DBpedia data for building applications. To overcome this problem, we propose an approach for modelling and managing provenance on DBpedia using Wikipedia edits, and making this information available on the Web of Data. In this paper, we describe the framework that we implemented to do so, consisting in (1) a lightweight modelling solution to semantically represent provenance of both DBpedia resources and Wikipedia content, along with mappings to popular ontologies such as the W7 - what, when, where, how, who, which, and why - and OPM - open provenance model - models, (2) an information extraction process and a provenance-computation system combining Wikipedia articles' history with DBpedia information, (3) a set of scripts to make provenance information about DBpedia statements directly available when browsing this source, as well as being publicly exposed in RDF for letting software agents consume it. © 2011 Elsevier B.V. 0 0
On using crowdsourcing and active learning to improve classification performance Costa J.
Silva C.
Antunes M.
Ribeiro B.
International Conference on Intelligent Systems Design and Applications, ISDA English 2011 Crowdsourcing is an emergent trend for general-purpose classification problem solving. Over the past decade, this notion has been embodied by enlisting a crowd of humans to help solve problems. There are a growing number of real-world problems that take advantage of this technique, such as Wikipedia, Linux or Amazon Mechanical Turk. In this paper, we evaluate its suitability for classification, namely if it can outperform state-of-the-art models by combining it with active learning techniques. We propose two approaches based on crowdsourcing and active learning and empirically evaluate the performance of a baseline Support Vector Machine when active learning examples are chosen and made available for classification to a crowd in a web-based scenario. The proposed crowdsourcing active learning approach was tested with Jester data set, a text humour classification benchmark, resulting in promising improvements over baseline results. 0 0
Processing Wikipedia dumps: A case-study comparing the XGrid and mapreduce approaches Thiebaut D.
Yanyan Li
Jaunzeikare D.
Cheng A.
Recto E.R.
Riggs G.
Zhao X.T.
Stolpestad T.
Nguyen C.L.T.
CLOSER 2011 - Proceedings of the 1st International Conference on Cloud Computing and Services Science English 2011 We present a simple comparison of the performance of three different cluster platforms: Apple's XGrid, and Hadoop the open-source version of Google's MapReduce as the total execution time taken by each to parse a 27-GByte XML dump of the English Wikipedia. A local hadoop cluster of Linux workstation, as well as an Elastic MapReduce cluster rented from Amazon are used. We show that for this specific workload, XGrid yields the fastest execution time, with the local Hadoop cluster a close second. The overhead of fetching data from Amazon's Simple Storage System (S3), along with the inability to skip the reduce, sort, and merge phases on Amazon penalizes this platform targeted for much larger data sets. 0 0
Query relaxation for entity-relationship search Elbassuoni S.
Maya Ramanath
Gerhard Weikum
Lecture Notes in Computer Science English 2011 Entity-relationship-structured data is becoming more important on the Web. For example, large knowledge bases have been automatically constructed by information extraction from Wikipedia and other Web sources. Entities and relationships can be represented by subject-property-object triples in the RDF model, and can then be precisely searched by structured query languages like SPARQL. Because of their Boolean-match semantics, such queries often return too few or even no results. To improve recall, it is thus desirable to support users by automatically relaxing or reformulating queries in such a way that the intention of the original user query is preserved while returning a sufficient number of ranked results. In this paper we describe comprehensive methods to relax SPARQL-like triple-pattern queries in a fully automated manner. Our framework produces a set of relaxations by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified result list, with ranking based on any ranking function for structured queries over RDF-data. Our experimental evaluation, with two different datasets about movies and books, shows the effectiveness of the automatically generated relaxations and the improved quality of query results based on assessments collected on the Amazon Mechanical Turk platform. 0 0
Ranking multilingual documents using minimal language dependent resources Santosh G.S.K.
Kiran Kumar N.
Vasudeva Varma
Lecture Notes in Computer Science English 2011 This paper proposes an approach of extracting simple and effective features that enhances multilingual document ranking (MLDR). There is limited prior research on capturing the concept of multilingual document similarity in determining the ranking of documents. However, the literature available has worked heavily with language specific tools, making them hard to reimplement for other languages. Our approach extracts various multilingual and monolingual similarity features using a basic language resource (bilingual dictionary). No language-specific tools are used, hence making this approach extensible for other languages. We used the datasets provided by Forum for Information Retrieval Evaluation (FIRE) for their 2010 Adhoc Cross-Lingual document retrieval task on Indian languages. Experiments have been performed with different ranking algorithms and their results are compared. The results obtained showcase the effectiveness of the features considered in enhancing multilingual document ranking. 0 0
Semantic tag recommendation using concept model Chenliang Li
Anwitaman Datta
Aixin Sun
SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval English 2011 The common tags given by multiple users to a particular document are often semantically relevant to the document and each tag represents a specific topic. In this paper, we attempt to emulate human tagging behavior to recommend tags by considering the concepts contained in documents. Specifically, we represent each document using a few most relevant concepts contained in the document, where the concept space is derived from Wikipedia. Tags are then recommended based on the tag concept model derived from the annotated documents of each tag. Evaluated on a Delicious dataset of more than 53K documents, the proposed technique achieved comparable tag recommendation accuracy as the state-of-the-art, while yielding an order of magnitude speed-up. 0 0
Simple English Wikipedia: A new text simplification task William Coster
David Kauchak
ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies English 2011 In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based translation approach for simplification. 0 0
Simple supervised document geolocation with geodesic grids Wing B.P.
Baldridge J.
ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies English 2011 We investigate automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effectivemeans of summarizing large document collections and it is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document's raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset. 0 0
Using Mahout for clustering Wikipedia's latest articles: A comparison between k-means and fuzzy c-means in the cloud Rong C.
Esteves R.M.
Proceedings - 2011 3rd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2011 English 2011 this paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature. 1 0
Using similarity-based approaches for continuous ontology development Ramezani M. International Journal on Semantic Web and Information Systems English 2011 This paper presents novel algorithms for learning semantic relations from an existing ontology or concept hierarchy. The authors suggest recommendation of semantic relations for supporting continuous ontology development, i.e., the development of ontologies during their use in social semantic bookmarking, semantic wiki, or other Web 2.0 style semantic applications. This paper assists users in placing a newly added concept in a concept hierarchy. The proposed algorithms are evaluated using datasets from Wikipedia category hierarchy and provide recommendations. Copyright 0 0
Vandalism detection in Wikipedia: A high-performing, feature-rich model and its reduction through Lasso Sara Javanmardi
David W. McDonald
Lopes C.V.
WikiSym 2011 Conference Proceedings - 7th Annual International Symposium on Wikis and Open Collaboration English 2011 User generated content (UGC) constitutes a significant fraction of the Web. However, some wiiki-based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient online algorithms for better tools to assist users in vandalism detection. We describe an efficient and accurate classifier that performs vandalism detection in UGC sites. We show the results of our classifier in the PAN Wikipedia dataset. We explore the effectiveness of a combination of 66 individual features that produce an AUC of 0.9553 on a test dataset - the best result to our knowledge. Using Lasso optimization we then reduce our feature - rich model to a much smaller and more efficient model of 28 features that performs almost as well - the drop in AUC being only 0.005. We describe how this approach can be generalized to other user generated content systems and describe several applications of this classifier to help users identify potential vandalism. 0 0
WikiLabel: An encyclopedic approach to labeling documents en masse Nomoto T. International Conference on Information and Knowledge Management, Proceedings English 2011 This paper presents a particular approach to collective labeling of multiple documents, which works by associating the documents with Wikipedia pages and labeling them with headings the pages carry. The approach has an obvious advantage over past approaches in that it is able to produce fluent labels, as they are hand-written by human editors. We carried out some experiments on the TDT5 dataset, which found that the approach works rather robustly for an arbitrary set of documents in the news domain. Comparisons were made with some baselines, including the state of the art, with results strongly in favor of our approach. 0 0
A baseline approach for detecting sentences containing uncertainty Sang E.T.K. CoNLL-2010: Shared Task - Fourteenth Conference on Computational Natural Language Learning, Proceedings of the Shared Task English 2010 We apply a baseline approach to the CoNLL-2010 shared task data sets on hedge detection. Weights have been assigned to cue words marked in the training data based on their occurrences in certain and uncertain sentences. New sentences received scores that correspond with those of their best scoring cue word, if present. The best acceptance scores for uncertain sentences were determined using 10-fold cross validation on the training data. This approach performed reasonably on the shared task's biological (F=82.0) and Wikipedia (F=62.8) data sets. 0 0
A classification algorithm of signed networks based on link analysis Qu Z.
Yafang Wang
Wang J.
Zhang F.
Qin Z.
2010 International Conference on Communications, Circuits and Systems, ICCCAS 2010 - Proceedings English 2010 In the signed networks the links between nodes can be either positive (means relations are friendship) or negative (means relations are rivalry or confrontation), which are very useful for analysis the real social network. After study data sets from Wikipedia and Slashdot networks, We find that the signs of links in the fundamental social networks can be used to classified the nodes and used to forecast the potential emerged sign of links in the future with high accuracy, using models that established across these diverse data sets. Based on the models, the proposed algorithm in the artwork provides perception into some of the underlying principles that extract from signed links in the networks. At the same time, the algorithm shed light on the social computing applications by which the attitude of a person toward another can be predicted from evidence provided by their around friends relationships. 0 0
A content-based image retrieval system based on unsupervised topological learning Rogovschi N.
Grozavu N.
Proc. - 6th Intl. Conference on Advanced Information Management and Service, IMS2010, with ICMIA2010 - 2nd International Conference on Data Mining and Intelligent Information Technology Applications English 2010 Internet offers to its users an ever-increasing number of information. Among those, the multimodal data (images, text, video, sound) are widely requested by users, and there is a strong need for effective ways to process and to manage it, respectively. Most of existed algorithms/frameworks are doing only images annotations and the search is doing by this annotations, or combined with some clustering results, but most of them do not allow a quick browsing of these images. Even if the search is very quickly, but if the number of images is very large, the system must give the possibility to the user to browse this data. In this paper, an image retrieval system is presented, including detailed descriptions of used lwo-SOM (local weighting observations Self-Organizing Map) approach and a new interactive learning process using user information/response. Also, we show the use of unsupervised learning on an images dataset, we do not dispose of the labels, and we will not take into account the corresponding text for the images. The used real dataset contains 17812 images extracted from wikipedia pages, each of which is characterized by its color and texture. 0 0
A hedgehop over a max-margin framework using hedge cues Georgescul M. CoNLL-2010: Shared Task - Fourteenth Conference on Computational Natural Language Learning, Proceedings of the Shared Task English 2010 In this paper, we describe the experimental settings we adopted in the context of the 2010 CoNLL shared task for detecting sentences containing uncertainty. The classification results reported on are obtained using discriminative learning with features essentially incorporating lexical information. Hyper-parameters are tuned for each domain: using BioScope training data for the biomedical domain and Wikipedia training data for the Wikipedia test set. By allowing an efficient handling of combinations of large-scale input features, the discriminative approach we adopted showed highly competitive empirical results for hedge detection on the Wikipedia dataset: our system is ranked as the first with an F-score of 60.17%. 0 0
A lucene and maximum entropy model based hedge detection system Long Chen
Di Eugenio B.
CoNLL-2010: Shared Task - Fourteenth Conference on Computational Natural Language Learning, Proceedings of the Shared Task English 2010 This paper describes the approach to hedge detection we developed, in order to participate in the shared task at CoNLL-2010. A supervised learning approach is employed in our implementation. Hedge cue annotations in the training data are used as the seed to build a reliable hedge cue set. Maximum Entropy (MaxEnt) model is used as the learning technique to determine uncertainty. By making use of Apache Lucene, we are able to do fuzzy string match to extract hedge cues, and to incorporate part-of-speech (POS) tags in hedge cues. Not only can our system determine the certainty of the sentence, but is also able to find all the contained hedges. Our system was ranked third on the Wikipedia dataset. In later experiments with different parameters, we further improved our results, with a 0.612 F-score on the Wikipedia dataset, and a 0.802 F-score on the biological dataset. 0 0
A monolingual tree-based translation model for sentence simplification Zhu Z.
Bernhard D.
Iryna Gurevych
Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference English 2010 In this paper, we consider sentence simplification as a special form of translation with the complex sentence as the source and the simple sentence as the target. We propose a Tree-based Simplification Model (TSM), which, to our knowledge, is the first statistical simplification model covering splitting, dropping, reordering and substitution integrally. We also describe an efficient method to train our model with a large-scale parallel dataset obtained from the Wikipedia and Simple Wikipedia. The evaluation shows that our model achieves better readability scores than a set of baseline systems. 0 0
A retrieval method for earth science data based on integrated use of Wikipedia and domain ontology Masashi Tatedoko
Toshiyuki Shimizu
Akinori Saito
Masatoshi Yoshikawa
Lecture Notes in Computer Science English 2010 Due to the recent advancement in observation technologies and progress in information technologies, the total amount of earth science data has increased at an explosive pace. However, it is not easy to search and discover earth science data because earth science requires high degree of expertness. In this paper, we propose a retrieval method for earth science data which can be used by non-experts such as scientists from other field, or students interested in earth science. In order to retrieve relevant data sets from a query, which may not include technical terminologies, supplementing terms are extracted by utilizing knowledge bases; Wikipedia and domain ontology. We evaluated our method using actual earth science data. The data, the queries, and the relevance assessments for our experiments were made by the researchers of earth science. The results of our experiments show that our method has achieved good recall and precision. 0 0
Aligning WordNet synsets and wikipedia articles Fernando S.
Stevenson M.
AAAI Workshop - Technical Report English 2010 This paper examines the problem of finding articles in Wikipedia to match noun synsets in WordNet. The motivation is that these articles enrich the synsets with much more information than is already present in WordNet. Two methods are used. The first is title matching, following redirects and disambiguation links. The second is information retrieval over the set of articles. The methods are evaluated over a random sample set of 200 noun synsets which were manually annotated. With 10 candidate articles retrieved for each noun synset, the methods achieve recall of 93%. The manually annotated data set and the automatically generated candidate article sets are available online for research purposes. Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. 0 0
Analysis of structural relationships for hierarchical cluster labeling Muhr M.
Roman Kern
Michael Granitzer
SIGIR 2010 Proceedings - 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval English 2010 Cluster label quality is crucial for browsing topic hierarchies obtained via document clustering. Intuitively, the hierarchical structure should influence the labeling accuracy. However, most labeling algorithms ignore such structural properties and therefore, the impact of hierarchical structures on the labeling accuracy is yet unclear. In our work we integrate hierarchical information, i.e. sibling and parent-child relations, in the cluster labeling process. We adapt standard labeling approaches, namely Maximum Term Frequency, Jensen-Shannon Divergence, χ 2 Test, and Information Gain, to take use of those relationships and evaluate their impact on 4 different datasets, namely the Open Directory Project, Wikipedia, TREC Ohsumed and the CLEF IP European Patent dataset. We show, that hierarchical relationships can be exploited to increase labeling accuracy especially on high-level nodes. 0 0
Automatic evaluation of topic coherence Newman D.
Lau J.H.
Grieser K.
Baldwin T.
NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference English 2010 This paper introduces the novel task of topic coherence evaluation, whereby a set of words, as generated by a topic model, is rated for coherence or interpretability. We apply a range of topic scoring models to the evaluation task, drawing on WordNet, Wikipedia and the Google search engine, and existing research on lexical similarity/relatedness. In comparison with human scores for a set of learned topics over two distinct datasets, we show a simple co-occurrence measure based on point-wise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of inter-annotator correlation, and that other Wikipedia-based lexical relatedness methods also achieve strong results. Google produces strong, if less consistent, results, while our results over WordNet are patchy at best. 0 0
BabelNet: Building a very large multilingual semantic network Roberto Navigli
Ponzetto S.P.
ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2010 In this paper we present BabelNet - a very large, wide-coverage multilingual semantic network. The resource is automatically constructed by means of a methodology that integrates lexicographic and encyclopedic knowledge from WordNet and Wikipedia. In addition Machine Translation is also applied to enrich the resource with lexical information for all languages. We conduct experiments on new and existing gold-standard datasets to show the high quality and coverage of the resource. 0 0
Consolidating tools for model evaluation Olesen H.R.
Chang C.J.
International Journal of Environment and Pollution English 2010 An overview is provided of some central tools and data sets that are currently available for evaluation of atmospheric dispersion models. The paper serves as a guide to the Model Validation Kit, which was introduced already in 1993, but has undergone a recent revision. The Model Validation Kit is a package of field data sets and software for model evaluation plus various supplementary materials. Further, the paper outlines main features of a corresponding package that implements the evaluation methodology of the American Society for Testing and Materials (ASTM), as specified in its standard guide D6589 on statistical evaluation of dispersion models. The paper gives a review of features and limitations of the two packages. Copyright 0 0
Dynamics of genre and domain intents Sushmita S.
Piwowarski B.
Lalmas M.
Lecture Notes in Computer Science English 2010 As the type of content available on the web is becoming increasingly diverse, a particular challenge is to properly determine the types of documents sought by a user, that is the domain intent (e.g. image, video) and/or the genre intent (e.g. blog, wikipedia). In this paper, we analysed the Microsoft 2006 RFP click dataset to obtain an understanding of domain and genre intents and their dynamics i.e. how intents evolve within search sessions and their effect on query reformulation. 0 0
E-Silkroad: A sample of combining social media with cultural tourism Wang Q.
Qi X.
Xu J.
CMM'10 - Proceedings of the 1st ACM Workshop on Connected Multimedia, Co-located with ACM Multimedia 2010 English 2010 With the development of Web2.0, very large scale resources of multimedia have emerged in the internet. In this paper, we present a novel framework of building a tour guide based on the online knowledge resources, e.g., e-Silkroad, a photographic guide of traditional Silkroad. The tour guide is jointly established by text information from Wikipedia and images from flickr website. Our method starts from a keyword "silkroad" in Wiki and typical cities are extracted and regarded as the key threads of the guide. Then a great number of images and their description tags are downloaded from Flickr website. To highlight the most interesting place and more active tourist, the framework computes the hot spots and photographers in the dataset. To introduce each place along the silkroad, all the images are classified into four categories by its content, including person, food, man-made, and sights. Finally, the images are registered into Google Maps according to the geog-tag descriptions along silk routes to generate e-Silkroad. In our evaluation experiment, 20676 images were downloaded from 35 key cities along silkroad. Experimental results show that it is effective from social media to cultural tourism under the connected environment. 0 0
Enhancing Short Text Clustering with Small External Repositories Petersen H.
Poon J.
Conferences in Research and Practice in Information Technology Series English 2010 The automatic clustering of textual data according to their semantic concepts is a challenging, yet important task. Choosing an appropriate method to apply when clustering text depends on the nature of the documents being analysed. For example, traditional clustering algorithms can struggle to correctly model collections of very short text due to their extremely sparse nature. In recent times, much attention has been directed to finding methods for adequately clustering short text. Many popular approaches employ large, external document repositories, such as Wikipedia or the Open Directory Project, to incorporate additional world knowledge into the clustering process. However the sheer size of many of these external collections can make these techniques difficult or time consuming to apply. This paper also employs external document collections to aid short text clustering performance. The external collections are referred to in this paper as Background Knowledge. In contrast to most previous literature a separate collection of Background Knowledge is obtained for each short text dataset. However, this Background Knowledge contains several orders of magnitude fewer documents than commonly used repositories like Wikipedia. A simple approach is described where the Background Knowledge is used to re-express short text in terms of a much richer feature space. A discussion of how best to cluster documents in this feature space is presented. A solution is proposed, and an experimental evaluation is performed that demonstrates significant improvement over clustering based on standard metrics with several publicly available datasets represented in the richer feature space. 0 0
Exploiting CCG structures with tree Kernels for speculation detection Sanchez L.M.
Li B.
Vogel C.
CoNLL-2010: Shared Task - Fourteenth Conference on Computational Natural Language Learning, Proceedings of the Shared Task English 2010 Our CoNLL-2010 speculative sentence detector disambiguates putative keywords based on the following considerations: a speculative keyword may be composed of one or more word tokens; a speculative sentence may have one or more speculative keywords; and if a sentence contains at least one real speculative keyword, it is deemed speculative. A tree kernel classifier is used to assess whether a potential speculative keyword conveys speculation. We exploit information implicit in tree structures. For prediction efficiency, only a segment of the whole tree around a speculation keyword is considered, along with morphological features inside the segment and information about the containing document. A maximum entropy classifier is used for sentences not covered by the tree kernel classifier. Experiments on the Wikipedia data set show that our system achieves 0.55 F-measure (in-domain). 0 0
Exploring the semantics behind a collection to improve automated image annotation Llorente A.
Motta E.
Stefan Ruger
Lecture Notes in Computer Science English 2010 The goal of this research is to explore several semantic relatedness measures that help to refine annotations generated by a baseline non-parametric density estimation algorithm. Thus, we analyse the benefits of performing a statistical correlation using the training set or using the World Wide Web versus approaches based on a thesaurus like WordNet or Wikipedia (considered as a hyperlink structure). Experiments are carried out using the dataset provided by the 2009 edition of the ImageCLEF competition, a subset of the MIR-Flickr 25k collection. Best results correspond to approaches based on statistical correlation as they do not depend on a prior disambiguation phase like WordNet and Wikipedia. Further work needs to be done to assess whether proper disambiguation schemas might improve their performance. 0 0
Geographical classification of documents using evidence from Wikipedia Odon De Alencar R.
Davis Jr. C.A.
Goncalves M.A.
Proceedings of the 6th Workshop on Geographic Information Retrieval, GIR'10 English 2010 Obtaining or approximating a geographic location for search results often motivates users to include place names and other geography-related terms in their queries. Previous work shows that queries that include geography-related terms correspond to a significant share of the users' demand. Therefore, it is important to recognize the association of documents to places in order to adequately respond to such queries. This paper describes strategies for text classification into geography-related categories, using evidence extracted from Wikipedia. We use terms that correspond to entry titles and the connections between entries in Wikipedia's graph to establish a semantic network from which classification features are generated. Results of experiments using a news data-set, classified over Brazilian states, show that such terms constitute valid evidence for the geographical classification of documents, and demonstrate the potential of this technique for text classification. Copyright 0 0
Knowledge-rich Word Sense Disambiguation rivaling supervised systems Ponzetto S.P.
Roberto Navigli
ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2010 One of the main obstacles to high-performance Word Sense Disambiguation (WSD) is the knowledge acquisition bottleneck. In this paper, we present a methodology to automatically extend WordNet with large amounts of semantic relations from an encyclopedic resource, namely Wikipedia. We show that, when provided with a vast amount of high-quality semantic relations, simple knowledge-lean disambiguation algorithms compete with state-of-the-art supervised WSD systems in a coarse-grained all-words setting and outperform them on gold-standard domain-specific datasets. 0 0
Learning Word-Class Lattices for definition and hypernym extraction Roberto Navigli
Velardi P.
ACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference English 2010 Definition extraction is the task of automatically identifying definitional sentences within texts. The task has proven useful in many research areas including ontology learning, relation extraction and question answering. However, current approaches - mostly focused on lexicosyntactic patterns - suffer from both low recall and precision, as definitional sentences occur in highly variable syntactic structures. In this paper, we propose Word- Class Lattices (WCLs), a generalization of word lattices that we use to model textual definitions. Lattices are learned from a dataset of definitions from Wikipedia. Our method is applied to the task of definition and hypernym extraction and compares favorably to other pattern generalization methods proposed in the literature. 0 0
Lurking? Cyclopaths? A quantitative lifecycle analysis of user behavior in a geowiki Katherine Panciera
Reid Priedhorsky
Thomas Erickson
Loren Terveen
Conference on Human Factors in Computing Systems - Proceedings English 2010 Online communities produce rich behavioral datasets, e.g., Usenet news conversations, Wikipedia edits, and Facebook friend networks. Analysis of such datasets yields important insights (like the "long tail" of user participation) and suggests novel design interventions (like targeting users with personalized opportunities and work requests). However, certain key user data typically are unavailable, specifically viewing, pre-registration, and non-logged-in activity. The absence of data makes some questions hard to answer; ac- cess to it can strengthen, extend, or cast doubt on previous results. We report on analysis of user behavior in Cyclopath, a geographic wiki and route-finder for bicyclists. With access to viewing and non-logged-in activity data, we were able to: (a) replicate and extend prior work on user lifecycles in Wikipedia, (b) bring to light some pre-registration activity, thus testing for the presence of "educational lurking," and (c) demonstrate the locality of geographic activity and how editing and viewing are geographically correlated. 0 0
Mash-up approach for web video category recommendation Song Y.-C.
Hua Li
Proceedings - 4th Pacific-Rim Symposium on Image and Video Technology, PSIVT 2010 English 2010 With the advent of web 2.0, billions of videos are now freely available online. Meanwhile, rich user generated information for these videos such as tags and online encyclopedia offer us a chance to enhance the existing video analysis technologies. In this paper, we propose a mash-up framework to realize video category recommendation by leveraging web information from different sources. Under this framework, we build a web video dataset from the YouTube API, and construct a concept collection for web video category recommendation (CCWV-CR) from this dataset, which consists of the web video concepts with small semantic gap and high categorization distinguishability. Besides, Wikipedia Propagation is proposed to optimize the video similarity measurement. The experiments on the large-scale dataset with 80,031 web videos demonstrate that: (1) the mash-up category recommendation framework has a great improvement than the existing state-of-art methods. (2) CCWV-CR is an efficient feature space for video category recommendation. (3) Wikipedia Propagation could boost the performance of video category recommendation. 0 0
Multi-view bootstrapping for relation extraction by exploring web features and linguistic features Yulan Yan
Hua Li
Yutaka Matsuo
Mitsuru Ishizuka
Lecture Notes in Computer Science English 2010 Binary semantic relation extraction from Wikipedia is particularly useful for various NLP and Web applications. Currently frequent pattern miningbased methods and syntactic analysis-based methods are two types of leading methods for semantic relation extraction task. With a novel view on integrating syntactic analysis on Wikipedia text with redundancy information from the Web, we propose a multi-view learning approach for bootstrapping relationships between entities with the complementary between theWeb view and linguistic view. On the one hand, from the linguistic view, linguistic features are generated from linguistic parsing on Wikipedia texts by abstracting away from different surface realizations of semantic relations. On the other hand, Web features are extracted from the Web corpus to provide frequency information for relation extraction. Experimental evaluation on a relational dataset demonstrates that linguistic analysis on Wikipedia texts and Web collective information reveal different aspects of the nature of entity-related semantic relationships. It also shows that our multiview learning method considerably boosts the performance comparing to learning with only one view of features, with the weaknesses of one view complement the strengths of the other. 0 0
Overview of the INEX 2009 XML mining track: Clustering and classification of XML documents Nayak R.
De Vries C.M.
Kutty S.
Shlomo Geva
Ludovic Denoyer
Patrick Gallinari
Lecture Notes in Computer Science English 2010 This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants. 0 0
Predicting positive and negative links in online social networks Leskovec J.
Huttenlocher D.
Kleinberg J.
Proceedings of the 19th International Conference on World Wide Web, WWW '10 English 2010 We study online social networks in which relationships can be either positive (indicating relations such as friendship) or negative (indicating relations such as opposition or antagonism). Such a mix of positive and negative links arise in a variety of online settings; we study datasets from Epinions, Slashdot and Wikipedia. We find that the signs of links in the underlying social networks can be predicted with high accuracy, using models that generalize across this diverse range of sites. These models provide insight into some of the fundamental principles that drive the formation of signed links in networks, shedding light on theories of balance and status from social psychology; they also suggest social computing applications by which the attitude of one user toward another can be estimated from evidence provided by their relationships with other members of the surrounding social network. 0 0
Retrieving landmark and non-landmark images from community photo collections Yannis Avrithis
Yannis Kalantidis
Giorgos Tolias
Evaggelos Spyrou
MM'10 - Proceedings of the ACM Multimedia 2010 International Conference English 2010 State of the art data mining and image retrieval in community photo collections typically focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. We propose an image clustering scheme that, seen as vector quantization compresses a large corpus of images by grouping visually consistent ones while providing a guaranteed distortion bound. This allows us, for instance, to represent the visual content of all thousands of images depicting the Parthenon in just a few dozens of scene maps and still be able to retrieve any single, isolated, non-landmark image like a house or graffiti on a wall. Starting from a geo-tagged dataset, we first group images geographically and then visually, where each visual cluster is assumed to depict different views of the the same scene. We align all views to one reference image and construct a 2D scene map by preserving details from all images while discarding repeating visual features. Our indexing, retrieval and spatial matching scheme then operates directly on scene maps. We evaluate the precision of the proposed method on a challenging one-million urban image dataset. 0 0
Semantic relatedness approach for named entity disambiguation Gentile A.L.
Zhang Z.
Linsi Xia
Iria J.
Communications in Computer and Information Science English 2010 Natural Language is a mean to express and discuss about concepts, objects, events, i.e., it carries semantic contents. One of the ultimate aims of Natural Language Processing techniques is to identify the meaning of the text, providing effective ways to make a proper linkage between textual references and their referents, that is, real world objects. This work addresses the problem of giving a sense to proper names in a text, that is, automatically associating words representing Named Entities with their referents. The proposed methodology for Named Entity Disambiguation is based on Semantic Relatedness Scores obtained with a graph based model over Wikipedia. We show that, without building a Bag of Words representation of the text, but only considering named entities within the text, the proposed paradigm achieves results competitive with the state of the art on two different datasets. 0 0
Similarity search and locality sensitive hashing using ternary content addressable memories Shinde R.
Goel A.
Gupta P.
Dutta D.
Proceedings of the ACM SIGMOD International Conference on Management of Data English 2010 Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Nearest neighbor search (NNS) algorithms are often used to retrieve similar entries, given a query. While there exist efficient techniques for exact query lookup using hashing, similarity search using exact nearest neighbors suffers from a "curse of dimensionality", i.e. for high dimensional spaces, best known solutions offer little improvement over brute force search and thus are unsuitable for large scale streaming applications. Fast solutions to the approximate NNS problem include Locality Sensitive Hashing (LSH) based techniques, which need storage polynomial in n with exponent greater than 1, and query time sublinear, but still polynomial in n, where n is the size of the database. In this work we present a new technique of solving the approximate NNS problem in Euclidean space using a Ternary Content Addressable Memory (TCAM), which needs near linear space and has O(1) query time. In fact, this method also works around the best known lower bounds in the cell probe model for the query time using a data structure near linear in the size of the data base. TCAMs are high performance associative memories widely used in networking applications such as address lookups and access control lists. A TCAM can query for a bit vector within a database of ternary vectors, where every bit position represents 0, 1 or*. The*is a wild card representing either a 0 or a 1. We leverage TCAMs to design a variant of LSH, called Ternary Locality Sensitive Hashing (TLSH) wherein we hash database entries represented by vectors in the Euclidean space into {0,1,*}. By using the added functionality of a TLSH scheme with respect to the*character, we solve an instance of the approximate nearest neighbor problem with 1 TCAM access and storage nearly linear in the size of the database. We validate our claims with extensive simulations using both real world (Wikipedia) as well as synthetic (but illustrative) datasets. We observe that using a TCAM of width 288 bits, it is possible to solve the approximate NNS problem on a database of size 1 million points with high accuracy. Finally, we design an experiment with TCAMs within an enterprise ethernet switch (Cisco Catalyst 4500) to validate that TLSH can be used to perform 1.5 million queries per second per 1Gb/s port. We believe that this work can open new avenues in very high speed data mining. 0 0
Symbolic representation of text documents Guru D.S.
Harish B.S.
Manjunath S.
COMPUTE 2010 - The 3rd Annual ACM Bangalore Conference English 2010 This paper presents a novel method of representing a text document by the use of interval valued symbolic features. A method of classification of text documents based on the proposed representation is also presented. The newly proposed model significantly reduces the dimension of feature vectors and also the time taken to classify a given document. Further, extensive experimentations are conducted on vehicles-wikipedia datasets to evaluate the performance of the proposed model. The experimental results reveal that the obtained results are on par with the existing results for vehicles-wikipedia dataset. However, the advantage of the proposed model is that it takes relatively a less time for classification as it is based on a simple matching strategy. 0 0
The impact of research design on the half-life of the wikipedia category system Wang J.
Ma F.
Cheng J.
2010 International Conference on Computer Design and Applications, ICCDA 2010 English 2010 The Wikipedia category system has shown a phenomenon of life or obsolescence similar as periodical literatures do, so this paper aims to investigate how the factors related to study design and research process, involving the observation points and the time span, play an impact on the obsolescence of the Wikipedia category system. For the impact of different observation points, we make use of the datasets at different time points under the same time span and the results show that the observation points do have an obvious influence on the category cited half-life; And for the impact of time span, we use the datasets with different intervals at the same time point and the results indicate that the time span has a certain impact on the categories' obsolescence. Based on the deep analysis, the paper further proposes some useful suggestions for the similar studies on information obsolescence in the future. 0 1
The role of local and global weighting in assessing the semantic similarity of texts using latent semantic analysis Lintean M.
Moldovan C.
Rus V.
McNamara D.
Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, FLAIRS-23 English 2010 In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analysis' (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document frequencies (IDF) collected from the English Wikipedia, and entropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combinations of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sentence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model detection based on student-articulated paragraphs in MetaTutor, another intelligent tutoring system. Our experiments revealed that for sentence-level texts a combination of type frequency local weighting in combination with either IDF or binary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts. Copyright © 2010, Association for the Advancement of Artificial Intelligence ( All rights reserved. 0 0
Using machine learning to support continuous ontology development Ramezani M.
Witschel H.F.
Braun S.
Zacharias V.
Lecture Notes in Computer Science English 2010 This paper presents novel algorithms to support the continuous development of ontologies; i.e. the development of ontologies during their use in social semantic bookmarking, semantic wiki or other social semantic applications. Our goal is to assist users in placing a newly added concept in a concept hierarchy. The proposed algorithm is evaluated using a data set from Wikipedia and provides good quality recommendation. These results point to novel possibilities to apply machine learning technologies to support social semantic applications. 0 0
Visualizing large-scale RDF data using subsets, summaries, and sampling in oracle Sundara S.
Atre M.
Kolovski V.
Sanmay Das
ZongDa Wu
Chong E.I.
Srinivasan J.
Proceedings - International Conference on Data Engineering English 2010 The paper addresses the problem of visualizing large scale RDF data via a 3-S approach, namely, by using, 1) Subsets: to present only relevant data for visualisation; both static and dynamic subsets can be specified, 2) Summaries: to capture the essence of RDF data being viewed; summarized data can be expanded on demand thereby allowing users to create hybrid (summary-detail) fisheye views of RDF data, and 3) Sampling: to further optimize visualization of large-scale data where a representative sample suffices. The visualization scheme works with both asserted and inferred triples (generated using RDF(S) and OWL semantics). This scheme is implemented in Oracle by developing a plug-in for the Cytoscape graph visualization tool, which uses functions defined in a Oracle PL/SQL package, to provide fast and optimized access to Oracle Semantic Store containing RDF data. Interactive visualization of a synthesized RDF data set (LUBM 1 million triples), two native RDF datasets (Wikipedia 47 million triples and UniProt 700 million triples), and an OWL ontology (eClassOwl with a large class hierarchy including over 25,000 OWL classes, 5,000 properties, and 400,000 class-properties) demonstrates the effectiveness of our visualization scheme. 0 0
WikiAnalytics: Disambiguation of keyword search results on highly heterogeneous structured data Andrey Balmin
Curtmola E.
Proceedings of the ACM SIGMOD International Conference on Management of Data English 2010 Wikipedia infoboxes is an example of a seemingly structured, yet extraordinarily heterogenous dataset, where any given record has only a tiny fraction of all possible fields. Such data cannot be queried using traditional means without a massive a priori integration effort, since even for a simple request the result values span many record types and fields. On the other hand, the solutions based on keyword search are too imprecise to capture user's intent. To address these limitations, we propose a system, referred to herein as WIKIANALYTICS, that utilizes a novel search paradigm in order to derive tables of precise and complete results from Wikipedia infobox records. The user starts with a keyword search query that finds a superset of the result records, and then browses clusters of records deciding which are and are not relevant. WIKIANALYTICS uses three categories of clustering features based on record types, fields, and values that matched the query keywords, respectively. Since the system cannot predict which combination of features will be important to the user, it efficiently generates all possible clusters of records by all sets of features. We utilize a novel data structure, universal navigational lattice (UNL), that compactly encodes all possible clusters. WIKIANALYTICS provides a dynamic and intuitive interface that lets the user explore the UNL and construct homogeneous structured tables, which can be further queried and aggregated using the conventional tools. 0 0
WikiWars: A new corpus for research on temporal expressions Mazur P.
Dale R.
EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference English 2010 The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. The corpus contains around 120000 tokens, and 2600 TIMEX2 expressions, thus comparing favourably in size to other existing corpora used in these areas. We describe the preparation of the corpus, and compare the profile of the data with other existing temporally annotated corpora. We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. The corpus is publicly available for research purposes. 0 0
Wisdom of crowds versus wisdom of linguists - Measuring the semantic relatedness of words Torsten Zesch
Iryna Gurevych
Natural Language Engineering English 2010 In this article, we present a comprehensive study aimed at computing semantic relatedness of word pairs. We analyze the performance of a large number of semantic relatedness measures proposed in the literature with respect to different experimental conditions, such as (i) the datasets employed, (ii) the language (English or German), (iii) the underlying knowledge source, and (iv) the evaluation task (computing scores of semantic relatedness, ranking word pairs, solving word choice problems). To our knowledge, this study is the first to systematically analyze semantic relatedness on a large number of datasets with different properties, while emphasizing the role of the knowledge source compiled either by the wisdom of linguists (i.e., classical wordnets) or by the wisdom of crowds (i.e., collaboratively constructed knowledge sources like Wikipedia). The article discusses benefits and drawbacks of different approaches to evaluating semantic relatedness. We show that results should be interpreted carefully to evaluate particular aspects of semantic relatedness. For the first time, we employ a vector based measure of semantic relatedness, relying on a concept space built from documents, to the first paragraph of Wikipedia articles, to English WordNet glosses, and to GermaNet based pseudo glosses. Contrary to previous research (Strube and Ponzetto 2006; Gabrilovich and Markovitch 2007; Zesch et al. 2007), we find that wisdom of crowds based resources are not superior to wisdom of linguists based resources. We also find that using the first paragraph of a Wikipedia article as opposed to the whole article leads to better precision, but decreases recall. Finally, we present two systems that were developed to aid the experiments presented herein and are freely available1 for research purposes: (i) DEXTRACT, a software to semi-automatically construct corpus-driven semantic relatedness datasets, and (ii) JWPL, a Java-based high-performance Wikipedia Application Programming Interface (API) for building natural language processing (NLP) applications. Copyright 0 0
2009 5th International Conference on Collaborative Computing: Networking, Applications and Worksharing, CollaborateCom 2009 No author name available 2009 5th International Conference on Collaborative Computing: Networking, Applications and Worksharing, CollaborateCom 2009 English 2009 The proceedings contain 68 papers. The topics discussed include: multi-user multi-account interaction in groupware supporting single-display collaboration; supporting collaborative work through flexible process execution; dynamic data services: data access for collaborative networks in a multi-agent systems architecture; integrating external user profiles in collaboration applications; a collaborative framework for enforcing server commitments, and for regulating server interactive behavior in SOA-based systems; CASTLE: a social framework for collaborative anti-phishing databases; VisGBT: visually analyzing evolving datasets for adaptive learning; an IT appliance for remote collaborative review of mechanisms of injury to children in motor vehicle crashes; user contribution and trust in Wikipedia; and a new perspective on experimental analysis of N-tier systems: evaluating database scalability, multi-bottlenecks, and economical operation. 0 0
A new multiple kernel approach for visual concept learning Jiang Yang
Yanyan Li
Tian Y.
Duan L.
Gao W.
Lecture Notes in Computer Science English 2009 In this paper, we present a novel multiple kernel method to learn the optimal classification function for visual concept. Although many carefully designed kernels have been proposed in the literature to measure the visual similarity, few works have been done on how these kernels really affect the learning performance. We propose a Per-Sample Based Multiple Kernel Learning method (PS-MKL) to investigate the discriminative power of each training sample in different basic kernel spaces. The optimal, sample-specific kernel is learned as a linear combination of a set of basic kernels, which leads to a convex optimization problem with a unique global optimum. As illustrated in the experiments on the Caltech 101 and the Wikipedia MM dataset, the proposed PS-MKL outperforms the traditional Multiple Kernel Learning methods (MKL) and achieves comparable results with the state-of-the-art methods of learning visual concepts. 0 0
An axiomatic approach for result diversification Gollapudi S.
Sharma A.
WWW'09 - Proceedings of the 18th International World Wide Web Conference English 2009 Understanding user intent is key to designing an effective ranking system in a search engine. In the absence of any explicit knowledge of user intent, search engines want to diversify results to improve user satisfaction. In such a setting, the probability ranking principle-based approach of presenting the most relevant results on top can be sub-optimal, and hence the search engine would like to trade-off relevance for diversity in the results. In analogy to prior work on ranking and clustering systems, we use the axiomatic approach to characterize and design diversification systems. We develop a set of natural axioms that a diversification system is expected to satisfy, and show that no diversification function can satisfy all the axioms simultaneously. We illustrate the use of the axiomatic framework by providing three example diversification objectives that satisfy different subsets of the axioms. We also uncover a rich link to the facility dispersion problem that results in algorithms for a number of diversification objectives. Finally, we propose an evaluation methodology to characterize the objectives and the underlying axioms. We conduct a large scale evaluation of our objectives based on two data sets: a data set derived from the Wikipedia disambiguation pages and a product database. Copyright is held by the International World Wide Web Conference Committee (IW3C2). 0 0
Automatic link detection: A sequence labeling approach Gardner J.J.
Xiong L.
International Conference on Information and Knowledge Management, Proceedings English 2009 The popularity of Wikipedia and other online knowledge bases has recently produced an interest in the machine learning community for the problem of automatic linking. Automatic hyperlinking can be viewed as two sub problems - link detection which determines the source of a link, and link disambiguation which determines the destination of a link. Wikipedia is a rich corpus with hyperlink data provided by authors. It is possible to use this data to train classifiers to be able to mimic the authors in some capacity. In this paper, we introduce automatic link detection as a sequence labeling problem. Conditional random fields (CRFs) are a probabilistic framework for labeling sequential data. We show that training a CRF with different types of features from the Wikipedia dataset can be used to automatically detect links with almost perfect precision and high recall. Copyright 2009 ACM. 0 0
Binrank: Scaling dynamic authority-based search using materialized subgraphs Heasoo Hwang
Andrey Balmin
Berthold Reinwald
Erik Nijkamp
Proceedings - International Conference on Data Engineering English 2009 Dynamic authority-based keyword search algorithms, such as ObjectRank and personalized PageRank, leverage semantic link information to provide high quality, high recall search in databases and on the Web. Conceptually, these algorithms require a query-time PageRank-style iterative computation over the full graph. This computation is too expensive for large graphs, and not feasible at query time. Alternatively, building an index of pre-computed results for some or all keywords involves very expensive preprocessing. We introduce BinRank, a system that approximates ObjectRank results by utilizing a hybrid approach inspired by materialized views in traditional query processing. We materialize a number of relatively small subsets of the data graph in such a way that any keyword query can be answered by running ObjectRank on only one of the sub-graphs. BinRank generates the sub-graphs by partitioning all the terms in the corpus based on their co-occurrence, executing ObjectRank for each partition using the terms to generate a set of random walk starting points, and keeping only those objects that receive nonnegligible scores. The intuition is that a sub-graph that contains all objects and links relevant to a set of related terms should have all the information needed to rank objects with respect to one of these terms. We demonstrate that BinRank can achieve sub-second query execution time on the English Wikipedia dataset, while producing high quality search results that closely approximate the results of ObjectRank on the original graph. The Wikipedia link graph contains about 108 edges, which is at least two orders of magnitude larger than what prior state of the art dynamic authority-based search systems have been able to demonstrate. Our experimental evaluation investigates the trade-off between query execution time, quality of the results, and storage requirements of BinRank. 0 0
BorderFlow: A local graph clustering algorithm for natural language processing Ngomo A.-C.N.
Schumacher F.
Lecture Notes in Computer Science English 2009 In this paper, we introduce BorderFlow, a novel local graph clustering algorithm, and its application to natural language processing problems. For this purpose, we first present a formal description of the algorithm. Then, we use BorderFlow to cluster large graphs and to extract concepts from word similarity graphs. The clustering of large graphs is carried out on graphs extracted from the Wikipedia Category Graph. The subsequent low-bias extraction of concepts is carried out on two data sets consisting of noisy and clean data. We show that BorderFlow efficiently computes clusters of high quality and purity. Therefore, BorderFlow can be integrated in several other natural language processing applications. 0 0
Building a text classifier by a keyword and Wikipedia knowledge Qiang Qiu
YanChun Zhang
Junping Zhu
Qu W.
Lecture Notes in Computer Science English 2009 Traditional approach for building text classifiers usually require a lot of labeled documents, which are expensive to obtain. In this paper, we propose a new text classification approach based on a keyword and Wikipedia knowledge, so as to avoid labeling documents manually. Firstly, we retrieve a set of related documents about the keyword from Wikipedia. And then, with the help of related Wikipedia pages, more positive documents are extracted from the unlabeled documents. Finally, we train a text classifier with these positive documents and unlabeled documents. The experiment result on 20Newsgroup dataset show that the proposed approach performs very competitively compared with NB-SVM, a PU learner, and NB, a supervised learner. 0 0
China physiome project: A comprehensive framework for anatomical and physiological databases from the China digital human and the visible rat Han D.
Qiaoling Liu
Luo Q.
Proceedings of the IEEE English 2009 The connection study between biological structure and function, as well as between anatomical data and mechanical or physiological models, has been of increasing significance with the rapid advancement in experimental physiology and computational physiology. The China Physiome Project (CPP) is dedicated in optimization of the connection exploration based on standardization and integration of the structural datasets and their derivatives of cryosectional images with various standards, collaboration mechanisms, and online services. The CPP framework hereby incorporates the three-dimensional anatomical models of human and rat anatomy, the finite-element models of whole-body human skeleton, and the multiparticle radiological dosimetry data of both the human and rat computational phantoms. The ontology of CPP was defined using MeSH and, with its all standardized models description implemented by M3L, a multiscale modeling language based on XML. Provided services based on Wiki concept include collaboration research, modeling version control, data sharing, online analysis of M3L documents. As a sample case, a multiscale model for human heart modeling, in which familial hypertrophic cardiomyopathy was studied according to the structure-function relations from genetic level to organ level, is integrated into the framework and given for demonstration of the functionality of multiscale physiological modeling based on CPP. 0 0
Coloring RDF triples to capture provenance Flouris G.
Fundulaki I.
Pediaditis P.
Theoharis Y.
Christophides V.
Lecture Notes in Computer Science English 2009 Recently, the W3C Linking Open Data effort has boosted the publication and inter-linkage of large amounts of RDF datasets on the Semantic Web. Various ontologies and knowledge bases with millions of RDF triples from Wikipedia and other sources, mostly in e-science, have been created and are publicly available. Recording provenance information of RDF triples aggregated from different heterogeneous sources is crucial in order to effectively support trust mechanisms, digital rights and privacy policies. Managing provenance becomes even more important when we consider not only explicitly stated but also implicit triples (through RDFS inference rules) in conjunction with declarative languages for querying and updating RDF graphs. In this paper we rely on colored RDF triples represented as quadruples to capture and manipulate explicit provenance information. 0 0
Context based wikipedia linking Michael Granitzer
Seifert C.
Zechner M.
Lecture Notes in Computer Science English 2009 Automatically linking Wikipedia pages can be done either content based by exploiting word similarities or structure based by exploiting characteristics of the link graph. Our approach focuses on a content based strategy by detecting Wikipedia titles as link candidates and selecting the most relevant ones as links. The relevance calculation is based on the context, i.e. the surrounding text of a link candidate. Our goal was to evaluate the influence of the link-context on selecting relevant links and determining a links best-entry-point. Results show, that a whole Wikipedia page provides the best context for resolving link and that straight forward inverse document frequency based scoring of anchor texts achieves around 4% less Mean Average Precision on the provided data set. 0 0
Efficient indices using graph partitioning in RDF triple stores Yulan Yan
Chao Wang
Zhou A.
Qian W.
Ma L.
Yue Pan
Proceedings - International Conference on Data Engineering English 2009 With the advance of the Semantic Web, varying RDF data were increasingly generated, published, queried, and reused via the Web. For example, the DBpedia, a community effort to extract structured data from Wikipedia articles, broke 100 million RDF triples in its latest release. Initiated by Tim Berners-Lee, likewise, the Linking Open Data (LOD) project has published and interlinked many open licence datasets which consisted of over 2 billion RDF triples so far. In this context, fast query response over such large scaled data would be one of the challenges to existing RDF data stores. In this paper, we propose a novel triple indexing scheme to help RDF query engine fast locate the instances within a small scope. By considering the RDF data as a graph, we would partition the graph into multiple subgraph pieces and store them individually, over which a signature tree would be built up to index the URIs. When a query arrives, the signature tree index is used to fast locate the partitions that might include the matches of the query by its constant URIs. Our experiments indicate that the indexing scheme dramatically reduces the query processing time in most cases because many partitions would be early filtered out and the expensive exact matching is only performed over a quite small scope against the original dataset. 0 0
Entropy-based metrics for evaluating schema reuse Luo X.
Shinavier J.
Lecture Notes in Computer Science English 2009 Schemas, which provide a way to give structure to information, are becoming more and more important for information integration. The model described here provides concrete metrics of the momentary "health" of an application and its evolution over time, as well as a means of comparing one application with another. Building upon the basic notions of actors, concepts, and instances, the presented technique defines and measures the information entropy of a number of simple relationships among these objects. The technique itself is evaluated against data sets drawn from the Freebase collaborative database, the Swoogle search engine, and an instance of Semantic MediaWiki. 0 0
Exploiting internal and external semantics for the clustering of short texts using world knowledge Hu X.
Sun N.
Zhang C.
Chua T.-S.
International Conference on Information and Knowledge Management, Proceedings English 2009 Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases - Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods. Copyright 2009 ACM. 0 0
Google challenge: Incremental-learning for web video categorization on robust semantic feature space Song Y.-C.
Zhang Y.-D.
Xiaodan Zhang
Cao J.
Li J.-T.
MM'09 - Proceedings of the 2009 ACM Multimedia Conference, with Co-located Workshops and Symposiums English 2009 With the advent of video sharing websites, the amount of videos on the internet grows rapidly. Web video categorization is an efficient methodology to organize the huge amount of data. In this paper, we propose an effective web video categorization algorithm for the large scale dataset. It includes two factors: 1) For the great diversity of web videos, we develop an effective semantic feature space called Concept Collection for Web Video Categorization (CCWV-CD) to represent web videos, which consists of concepts with small semantic gap and high distinguishing ability. Meanwhile, the online Wikipedia API is employed to diffuse the concept correlations in this space. 2) We propose an incremental support vector machine with fixed number of support vectors (n-ISVM) to fit the large scale incremental learning problem in web video categorization. Extensive experiments are conducted on the dataset of 80024 most representative videos on YouTube demonstrate that the semantic space with Wikipedia prorogation is more representative for web videos, and n-ISVM outperforms other algorithms in efficiency when performs the incremental learning. 0 0
Large scale incremental web video categorization Xiaodan Zhang
Song Y.-C.
Cao J.
Zhang Y.-D.
Li J.-T.
1st International Workshop on Web-Scale Multimedia Corpus, WSMC'09, Co-located with the 2009 ACM International Conference on Multimedia, MM'09 English 2009 With the advent of video sharing websites, the amount of videos on the internet grows rapidly. Web video categorization is an efficient methodology for organizing the huge amount of videos. In this paper we investigate the characteristics of web videos, and make two contributions for the large scale incremental web video categorization. First, we develop an effective semantic feature space Concept Collection for Web Video with Categorization Distinguishability (CCWV-CD), which is consisted of concepts with small semantic gap, and the concept correlations are diffused by a novel Wikipedia Propagation (WP) method. Second, we propose an incremental support vector machine with fixed number of support vectors (n-ISVM) for large scale incremental learning. To evaluate the performance of CCWV-CD, WP and n-ISVM, we conduct extensive experiments on the dataset of 80,021 most representative videos on a video sharing website. The experiment results show that the CCWV-CD and WP is more representative for web videos, and the n-ISVM algorithm greatly improves the efficiency in the situation of incremental learning. Copyright 2009 ACM. 0 0
Large-scale taxonomy mapping for restructuring and integrating Wikipedia Ponzetto S.P.
Roberto Navigli
IJCAI International Joint Conference on Artificial Intelligence English 2009 We present a knowledge-rich methodology for disambiguating Wikipedia categories with WordNet synsets and using this semantic information to restructure a taxonomy automatically generated from the Wikipedia system of categories. We evaluate against a manual gold standard and show that both category disambiguation and taxonomy restructuring perform with high accuracy. Besides, we assess these methods on automatically generated datasets and show that we are able to effectively enrich WordNet with a large number of instances from Wikipedia. Our approach produces an integrated resource, thus bringing together the fine-grained classification of instances in Wikipedia and a well-structured top-level taxonomy from WordNet. 0 0
MagicCube: Choosing the best snippet for each aspect of an entity Yafang Wang
Li Zhao
YanChun Zhang
International Conference on Information and Knowledge Management, Proceedings English 2009 Wikis are currently used in business to provide knowledge management systems, especially for individual organizations. However, building wikis manually is a laborious and time-consuming work. To assist founding wikis, we propose a methodology in this paper to automatically select the best snippets for entities as their initial explanations. Our method consists of two steps. First, we focus on extracting snippets from a given set of web pages for each entity. Starting from a seed sentence, a snippet grows up by adding the most relevant neighboring sentences into itself. The sentences are chosen by the Snippet Growth Model, which employs a distance function and an influence function to make decisions. Secondly, we pick out the best snippet for each aspect of an entity. The combination of all the selected snippets serves as the primary description of the entity. We present three ever-increasing methods to handle selection process. Experimental results based on a real data set show that our proposed method works effectively in producing primary descriptions for entities such as employee names. Copyright 2009 ACM. 0 0
Towards semantic tagging in collaborative environments Chandramouli K.
Kliegr T.
Svatek V.
Izquierdo E.
DSP 2009: 16th International Conference on Digital Signal Processing, Proceedings English 2009 Tags pose an efficient and effective way of organization of resources, but they are not always available. A technique called SCM/THD investigated in this paper extracts entities from free-text annotations, and using the Lin similarity measure over the WordNet thesaurus classifies them into a controlled vocabulary of tags. Hypernyms extracted from Wikipedia are used to map uncommon entities to Wordnet synsets. In collaborative environments, users can assign multiple annotations to the same object hence increasing the amount of information available. Assuming that the semantics of the annotations overlap, this redundancy can be exploited to generate higher quality tags. A preliminary experiment presented in the paper evaluates the consistency and quality of tags generated from multiple annotations of the same image. The results obtained on an experimental dataset comprising of 62 annotations from four annotators show that the accuracy of a simple majority vote surpasses the average accuracy obtained through assessing the annotations individually by 18%. A moderate-strength correlation has been found between the quality of generated tags and the consistency of annotations. 0 0
Vispedia: On-demand data integration for interactive visualization and exploration Bryan Chan
Justin Talbot
Wu L.
Sakunkoo N.
Mike Cammarano
Pat Hanrahan
SIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems English 2009 Wikipedia is an example of the large, collaborative, semi-structured data sets emerging on the Web. Typically, before these data sets can be used, they must transformed into structured tables via data integration. We present Vispedia, a Web-based visualization system which incorporates data integration into an iterative, interactive data exploration and analysis process. This reduces the upfront cost of using heterogeneous data sets like Wikipedia. Vispedia is driven by a keyword-query-based integration interface implemented using a fast graph search. The search occurs interactively over DBpedia's semantic graph of Wikipedia, without depending on the existence of a structured ontology. This combination of data integration and visualization enables a broad class of non-expert users to more effectively use the semi-structured data available on the Web. 0 0
Clustering XML documents using closed frequent subtrees: A structural similarity approach Kutty S.
Thanh Tran
Nayak R.
Yanyan Li
Lecture Notes in Computer Science English 2008 This paper presents the experimental study conducted over the INEX 2007 Document Mining Challenge corpus employing a frequent subtree-based incremental clustering approach. Using the structural information of the XML documents, the closed frequent subtrees are generated. A matrix is then developed representing the closed frequent subtree distribution in documents. This matrix is used to progressively cluster the XML documents. In spite of the large number of documents in INEX 2007 Wikipedia dataset, the proposed frequent subtree-based incremental clustering approach was successful in clustering the documents. 0 0
Document clustering using incremental and pairwise approaches Thanh Tran
Nayak R.
Bruza P.
Lecture Notes in Computer Science English 2008 This paper presents the experiments and results of a clustering approach for clustering of the large Wikipedia dataset in the INEX 2007 Document Mining Challenge. The clustering approach employed makes use of an incremental clustering method and a pairwise clustering method. The approach enables us to perform the clustering task on a large dataset by first reducing the dimension of the dataset to an undefined number of clusters using the incremental method. The lower-dimension dataset is then clustered to a required number of clusters using the pairwise method. In this way, clustering of the large number of documents is performed successfully and the accuracy of the clustering solution is achieved. 0 0
EFS: Expert finding system based on wikipedia link pattern analysis Yang K.-H.
Chen C.-Y.
Lee H.-M.
Ho J.-M.
Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics English 2008 Building an expert finding system is very important for many applications especially in the academic environment. Previous work uses e-mails or web pages as corpus to analyze the expertise for each expert. In this paper, we present an Expert Finding System, abbreviated as EFS to build experts' profiles by using their journal publications. For a given proposal, the EFS first looks up the Wikipedia web site to get relative link information, and then list and rank all associated experts by using those information. In our experiments, we use a real-world dataset which comprises of 882 people and 13,654 papers, and are categorized into 9 expertise domains. Our experimental results show that the EFS works well on several expertise domains like "Artificial Intelligence" and "Image & Pattern Recognition" etc. 0 0
Enriching the crosslingual link structure of wikipedia - A classification-based approach Sorg P.
Philipp Cimiano
AAAI Workshop - Technical Report English 2008 The crosslingual link structure of Wikipedia represents a valuable resource which can be exploited for crosslingual natural language processing applications. However, this requires that it has a reasonable coverage and is furthermore accurate. For the specific language pair German/English that we consider in our experiments, we show that roughly 50% of the articles are linked from German to English and only 14% from English to German. These figures clearly corroborate the need for an approach to automatically induce new cross-language links, especially in the light of such a dynamically growing resource such as Wikipedia. In this paper we present a classification-based approach with the goal of inferring new cross-language links. Our experiments show that this approach has a recall of 70% with a precision of 94% for the task of learning cross-language links on a test dataset. 0 0
Importance of semantic representation: Dataless classification Chang M.-W.
Lev Ratinov
Dan Roth
Srikumar V.
Proceedings of the National Conference on Artificial Intelligence English 2008 Traditionally, text categorization has been studied as the problem of training of a classifier using labeled data. However, people can categorize documents into named categories without any explicit training because we know the meaning of category names. In this paper, we introduce Dataless Classification, a learning protocol that uses world knowledge to induce classifiers without the need for any labeled data. Like humans, a dataless classifier interprets a string of words as a set of semantic concepts. We propose a model for dataless classification and show that the label name alone is often sufficient to induce classifiers. Using Wikipedia as our source of world knowledge, we get 85.29% accuracy on tasks from the 20 Newsgroup dataset and 88.62% accuracy on tasks from a Yahoo! Answers dataset without any labeled or unlabeled data from the datasets. With unlabeled data, we can further improve the results and show quite competitive performance to a supervised learning algorithm that uses 100 labeled examples. Copyright © 2008, Association for the Advancement of Artificial Intelligence ( All rights reserved. 0 0
L3S at INEX 2007: Query expansion for entity ranking using a highly accurate ontology Gianluca Demartini
Firan C.S.
Tereza Iofciu
Lecture Notes in Computer Science English 2008 Entity ranking on Web scale datasets is still an open challenge. Several resources, as for example Wikipedia-based ontologies, can be used to improve the quality of the entity ranking produced by a system. In this paper we focus on the Wikipedia corpus and propose algorithms for finding entities based on query relaxation using category information. The main contribution is a methodology for expanding the user query by exploiting the semantic structure of the dataset. Our approach focuses on constructing queries using not only keywords from the topic, but also information about relevant categories. This is done leveraging on a highly accurate ontology which is matched to the character strings of the topic. The evaluation is performed using the INEX 2007 Wikipedia collection and entity ranking topics. The results show that our approach performs effectively, especially for early precision metrics. 0 0
Named entity normalization in user generated content Jijkoun V.
Khalid M.A.
Marx M.
Maarten de Rijke
Proceedings of SIGIR 2008 Workshop on Analytics for Noisy Unstructured Text Data, AND'08 English 2008 Named entity recognition is important for semantically oriented retrieval tasks, such as question answering, entity retrieval, biomedical retrieval, trend detection, and event and entity tracking. In many of these tasks it is important to be able to accurately normalize the recognized entities, i.e., to map surface forms to unambiguous references to real world entities. Within the context of structured databases, this task (known as record linkage and data, de-duplication) has been a topic of active research for more than five decades. For edited content, such as news articles, the named entity normalization (NEN) task is one that has recently attracted considerable attention. We consider the task in the challenging context of user generated content (UGC), where it forms a key ingredient of tracking and media-analysis systems. A baseline NEN system from the literature (that normalizes surface forms to Wikipedia pages) performs considerably worse on UGC than on edited news: accuracy drops from 80% to 65% for a Dutch language data set and from 94% to 77% for English. We identify several sources of errors: entity recognition errors, multiple ways of referring to the same entity and ambiguous references. To address these issues we propose five improvements to the baseline NEN algorithm, to arrive at a language independent NEN system that achieves overall accuracy scores of 90% on the English data set and 89% on the Dutch data set. We show that each of the improvements contributes to the overall score of our improved NEN algorithm, and conclude with an error analysis on both Dutch and English language UGC. The NEN system is computationally efficient and runs with very modest computational requirements. Copyright 2008 ACM. 0 0
Object image retrieval by exploiting online knowledge resources Gang Wang
Forsyth D.
26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR English 2008 We describe a method to retrieve images found on web pages with specified object class labels, using an analysis of text around the image and of image appearance. Our method determines whether an object is both described in text and appears in a image using a discriminative image model and a generative text model. Our models are learnt by exploiting established online knowledge resources (Wikipedia pages for text; Flickr and Caltech data sets for image). These resources provide rich text and object appearance information. We describe results on two data sets. The first is Berg's collection of ten animal categories; on this data set, we outperform previous approaches [7, 33]. We have also collected five more categories. Experimental results show the effectiveness of our approach on this new data set. 0 0
Using wiktionary for computing semantic relatedness Torsten Zesch
Muller C.
Iryna Gurevych
Proceedings of the National Conference on Artificial Intelligence English 2008 We introduce Wiktionary as an emerging lexical semantic resource that can be used as a substitute for expert-made resources in AI applications. We evaluate Wiktionary on the pervasive task of computing semantic relatedness for English and German by means of correlation with human rankings and solving word choice problems. For the first time, we apply a concept vector based measure to a set of different concept representations like Wiktionary pseudo glosses, the first paragraph of Wikipedia articles, English WordNet glosses, and GermaNet pseudo glosses. We show that: (i) Wiktionary is the best lexical semantic resource in the ranking task and performs comparably to other resources in the word choice task, and (ii) the concept vector based approach yields the best results on all datasets in both evaluations. Copyright © 2008, Association for the Advancement of Artificial Intelligence ( All rights reserved. 0 1
Vispedia*: Interactive visual exploration of wikipedia data via search-based integration Bryan Chan
Wu L.
Justin Talbot
Mike Cammarano
Pat Hanrahan
IEEE Transactions on Visualization and Computer Graphics English 2008 Wikipedia is an example of the collaborative, semi-structured data sets emerging on the Web. These data sets have large, non-uniform schema that require costly data integration into structured tables before visualization can begin. We present Vispedia, a Web-based visualization system that reduces the cost of this data integration. Users can browse Wikipedia, select an interesting data table, then use a search interface to discover, integrate, and visualize additional columns of data drawn from multiple Wikipedia articles. This interaction is supported by a fast path search algorithm over DBpedia, a semantic graph extracted from Wikipedia's hyperlink structure. Vispedia can also export the augmented data tables produced for use in traditional visualization systems. We believe that these techniques begin to address the "long tail" of visualization by allowing a wider audience to visualize a broader class of data. We evaluated this system in a first-use formative lab study. Study participants were able to quickly create effective visualizations for a diverse set of domains, performing data integration as needed. 0 0
A comparison of dimensionality reduction techniques for Web structure mining Chikhi N.F.
Rothenburger B.
Aussenac-Gilles N.
Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007 English 2007 In many domains, dimensionality reduction techniques have been shown to be very effective for elucidating the underlying semantics of data. Thus, in this paper we investigate the use of various dimensionality reduction techniques (DRTs) to extract the implicit structures hidden in the web hyperlink connectivity. We apply and compare four DRTs, namely, Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), Independent Component Analysis (ICA) and Random Projection (RP). Experiments conducted on three datasets allow us to assert the following: NMF outperforms PCA and ICA in terms of stability and interpretability of the discovered structures; the wellknown WebKb dataset used in a large number of works about the analysis of the hyperlink connectivity seems to be not adapted for this task and we suggest rather to use the recent Wikipedia dataset which is better suited. 0 0
Concordance-based entity-oriented search Bautin M.
Skiena S.
Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007 English 2007 We consider the problem of finding the relevant named entities in response to a search query over a given text corpus. Entity search can readily be used to augment conventional web search engines for a variety of applications. To assess the significance of entity search, we analyzed the AOL dataset of 36 million web search queries with respect to two different sets of entities: namely (a) 2.3 million distinct entities extracted from a news text corpus and (b) 2.9 million Wikipedia article titles. The results clearly indicate that search engines should be aware of entities, for under various criteria of matching between 18-39% of all web search queries can be recognized as specifically searching for entities, while 73-87% of all queries contain entities. Our entity search engine creates a concordance document for each entity, consisting of all the sentences in the corpus containing that entity. We then index and search these documents using open-source search software. This gives a ranked list of entities as the result of search. Visit for a demonstration of our entity search engine over a large news corpus. We evaluate our system by comparing the results of each query to the list of entities that have highest statistical juxtaposition scores with the queried entity. Juxtaposition score is a measure of how strongly two entities are related in terms of a probabilistic upper bound. The results show excellent performance, particularly over well-characterized classes of entities such as people. 0 0
Discovering unknown connections - The DBpedia relationship finder Janette Lehmann
Schuppel J.
Sören Auer
The Social Semantic Web 2007 - Proceedings of the 1st Conference on Social Semantic Web, CSSW 2007 English 2007 The Relationship Finder is a tool for exploring connections between objects in a Semantic Web knowledge base. It offers a new way to get insights about elements in an ontology, in particular for large amounts of instance data. For this reason, we applied the idea to the DBpedia data set, which contains an enormous amount of knowledge extracted from Wikipedia. We describe the workings of the Relationship Finder algorithm and present some interesting statistical discoveries about DBpedia and Wikipedia. 0 0
Exploitingwikipedia as external knowledge for named entity recognition Jun'ichi Kazama
Kentaro Torisawa
EMNLP-CoNLL 2007 - Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning English 2007 We explore the use of Wikipedia as external knowledge to improve named entity recognition (NER). Our method retrieves the corresponding Wikipedia entry for each candidate word sequence and extracts a category label from the first sentence of the entry, which can be thought of as a definition part. These category labels are used as features in a CRF-based NE tagger. We demonstrate using the CoNLL 2003 dataset that the Wikipedia category labels extracted by such a simple method actually improve the accuracy of NER. 0 0
Featureless similarities for terms clustering using tree-traversing ants Wong W.
Wei Liu
Bennamoun M.
ACM International Conference Proceeding Series English 2006 Besides being difficult to scale between different domains and to handle knowledge fluctuations, the results of terms clustering presented by existing ontology engineering systems are far from desirable. In this paper, we propose a new version of ant-based method for clustering terms known as Tree-Traversing Ants (TTA). With the help of the Normalized Google Distance (NGD) and n° of Wikipedia (n°W) as measures for similarity and distance between terms, we attempt to achieve an adaptable clustering method that is highly scalable across domains. Initial experiments with two datasets show promising results and demonstrated several advantages that are not simultaneously present in standard ant-based and other conventional clustering methods. Copyright 0 0
Relevance feedback models for recommendation Utiyama M.
Michihiro Yamamoto
COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference English 2006 We extended language modeling approaches in information retrieval (IR) to combine collaborative filtering (CF) and content-based filtering (CBF). Our approach is based on the analogy between IR and CF, especially between CF and relevance feedback (RF). Both CF and RF exploit users' preference/relevance judgments to recommend items. We first introduce a multinomial model that combines CF and CBF in a language modeling framework. We then generalize the model to another multinomial model that approximates the Polya distribution. This generalized model outperforms the multinomial model by 3.4% for CBF and 17.4% for CF in recommending English Wikipedia articles. The performance of the generalized model for three different datasets was comparable to that of a state-of-the-art item-based CF method. 0 0
Stimulating collaborative development in operations research with labor Ven K.
Sorensen K.
Verelst J.
Sevaux M.
OSS 2005 - Proceedings of the 1st International Conference on Open Source Systems English 2005 In this paper we describe the development of libOR, an on-line library for the operations research (OR) community. The design and operation of this website is inspired by the Open Source movement and recent developments such as Wikipedia. In operations research, data sets are exchanged between researchers in order to test the performance of newly developed algorithms. Currently, the exchange of these data sets suffers from many problems. One of the main problems is that data sets are currently exchanged through a centrally maintained website, which makes it slow to respond to new developments. By applying an Open Source approach to content creation, we hope to spur the diffusion of information within the operations research community. 0 0