(Alternative names for this keyword)
|Export and share|
|BibTeX, CSV, RDF, JSON|
|Browse properties · List of keywords|
Classification is included as keyword or extra keyword in 0 datasets, 0 tools and 30 publications.
There is no datasets for this keyword.
There is no tools for this keyword.
|Title||Author(s)||Published in||Language||DateThis property is a special property in this wiki.||Abstract||R||C|
|Exploiting the wisdom of the crowds for characterizing and connecting heterogeneous resources||Kawase R.
Pereira Nunes B.
|HT 2014 - Proceedings of the 25th ACM Conference on Hypertext and Social Media||English||2014||Heterogeneous content is an inherent problem for cross-system search, recommendation and personalization. In this paper we investigate differences in topic coverage and the impact of topics in different kinds of Web services. We use entity extraction and categorization to create fingerprints that allow for meaningful comparison. As a basis taxonomy, we use the 23 main categories of Wikipedia Category Graph, which has been assembled over the years by the wisdom of the crowds. Following a proof of concept of our approach, we analyze differences in topic coverage and topic impact. The results show many differences between Web services like Twitter, Flickr and Delicious, which reflect users' behavior and the usage of each system. The paper concludes with a user study that demonstrates the benefits of fingerprints over traditional textual methods for recommendations of heterogeneous resources.||0||0|
|Automated Decision support for human tasks in a collaborative system: The case of deletion in wikipedia||Gelley B.S.
|Proceedings of the 9th International Symposium on Open Collaboration, WikiSym + OpenSym 2013||English||2013||Wikipedia's low barriers to participation have the unintended effect of attracting a large number of articles whose topics do not meet Wikipedia's inclusion standards. Many are quickly deleted, often causing their creators to stop contributing to the site. We collect and make available several datasets of deleted articles, heretofore inaccessible, and use them to create a model that can predict with high precision whether or not an article will be deleted. We report precision of 98.6% and recall of 97.5% in the best case and high precision with lower, but still useful, recall, in the most difficult case. We propose to deploy a system utilizing this model on Wikipedia as a set of decision-support tools to help article creators evaluate and improve their articles before posting, and new article patrollers make more informed decisions about which articles to delete and which to improve. Categories and Subject Descriptors H.5.3. Collaborative Computing; Computer Supported Collaborative Work General Terms Measurement, Performance, Human Factors,. Copyright 2010 ACM.||0||0|
|Object recognition in wikimage data based on local invariant image features||Tomasev N.
|Proceedings - 2013 IEEE 9th International Conference on Intelligent Computer Communication and Processing, ICCP 2013||English||2013||Object recognition is an essential task in content-based image retrieval and classification. This paper deals with object recognition in WIKImage data, a collection of publicly available annotated Wikipedia images. WIKImage comprises a set of 14 binary classification problems with significant class imbalance. Our approach is based on using the local invariant image features and we have compared 3 standard and widely used feature types: SIFT, SURF and ORB. We have examined how the choice of representation affects the k-nearest neighbor data topology and have shown that some feature types might be more appropriate than others for this particular problem. In order to assess the difficulty of the data, we have evaluated 7 different k-nearest neighbor classification methods and shown that the recently proposed hubness-aware classifiers might be used to either increase the accuracy of prediction, or the macro-averaged F-score. However, our results indicate that further improvements are possible and that including the textual feature information might prove beneficial for system performance.||0||0|
|WikiDetect: Automatic vandalism detection for Wikipedia using linguistic features||Cioiu D.
|Lecture Notes in Computer Science||English||2013||Vandalism of the content has always been one of the greatest problems for Wikipedia, yet only few completely automatic solutions for solving it have been developed so far. Volunteers still spend large amounts of time correcting vandalized page edits, instead of using this time to improve the quality of the content of articles. The purpose of this paper is to introduce a new vandalism detection system, that only uses natural language processing and machine learning techniques. The system has been evaluated on a corpus of real vandalized data in order to test its performance and justify the design choices. The same expert annotated wikitext, extracted from the encyclopedia's database, is used to evaluate different vandalism detection algorithms. The paper presents a critical analysis of the obtained results, comparing them to existing solutions, and suggests different statistical classification methods that bring several improvements to the task at hand.||0||0|
|Are buildings only instances? Exploration in architectural style categories||Goel A.
|ACM International Conference Proceeding Series||English||2012||Instance retrieval has emerged as a promising research area with buildings as the popular test subject. Given a query image or region, the objective is to find images in the database containing the same object or scene. There has been a recent surge in efforts in finding instances of the same building in challenging datasets such as the Oxford 5k dataset , Oxford 100k dataset and the Paris dataset . We ascend one level higher and pose the question: Are Buildings Only Instances? Buildings located in the same geographical region or constructed in a certain time period in history often follow a specific method of construction. These architectural styles are characterized by certain features which distinguish them from other styles of architecture. We explore, beyond the idea of buildings as instances, the possibility that buildings can be categorized based on the architectural style. Certain characteristic features distinguish an architectural style from others. We perform experiments to evaluate how characteristic information obtained from low-level feature configurations can help in classification of buildings into architectural style categories. Encouraged by our observations, we mine characteristic features with semantic utility for different architectural styles from our dataset of European monuments. These mined features are of various scales, and provide an insight into what makes a particular architectural style category distinct. The utility of the mined characteristics is verified from Wikipedia.||0||0|
|Classifying image galleries into a taxonomy using metadata and wikipedia||Kramer G.
|Lecture Notes in Computer Science||English||2012||This paper presents a method for the hierarchical classification of image galleries into a taxonomy. The proposed method links textual gallery metadata to Wikipedia pages and categories. Entity extraction from metadata, entity ranking, and selection of categories is based on Wikipedia and does not require labeled training data. The resulting system performs well above a random baseline, and achieves a (micro-averaged) F-score of 0.59 on the 9 top categories of the taxonomy and 0.40 when using all 57 categories.||0||0|
|Detecting Wikipedia vandalism with a contributing efficiency-based approach||Tang X.
|Lecture Notes in Computer Science||English||2012||The collaborative nature of wiki has distinguished Wikipedia as an online encyclopedia but also makes the open contents vulnerable against vandalism. The current vandalism detection methods relying on basic statistic language features work well for explicitly offensive edits that perform massive changes. However, these techniques are evadable for the elusive vandal edits which make only a few unproductive or dishonest modifications. In this paper we proposed a contributing efficiency-based approach to detect the vandalism in Wikipedia and implement it with machine-learning based classifiers that incorporate the contributing efficiency along with other languages features. The results of extensional experiment show that the contributing efficiency can improve the recall of machine learning-based vandalism detection algorithms significantly.||0||0|
|Document classification by computing an echo in a very simple neural network||Brouard C.||Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI||English||2012||In this paper we present a new classification system called ECHO. This system is based on a principle of echo and applied to document classification. It computes the score of a document for a class by combining a bottom-up and a top-down propagation of activation in a very simple neural network. This system bridges a gap between Machine Learning methods and Information Retrieval since the bottom-up and the top-down propagations can be seen as the measures of the specificity and exhaustivity which underlie the models of relevance used in Information Retrieval. The system has been tested on the Reuters 21578 collection and in the context of an international challenge on large scale hierarchical text classification with corpus extracted from Dmoz and Wikipedia. Its comparison with other classification systems has shown its efficiency.||0||0|
|Feature transformation method enhanced vandalism detection in wikipedia||Chang T.
|Lecture Notes in Computer Science||English||2012||A very example of web 2.0 application is Wikipedia, an online encyclopedia where anyone can edit and share information. However, blatantly unproductive edits greatly undermine the quality of Wikipedia. Their irresponsible acts force editors to waste time undoing vandalisms. For the purpose of improving information quality on Wikipedia and freeing the maintainer from such repetitive tasks, machine learning methods have been proposed to detect vandalism automatically. However, most of them focused on mining new features which seem to be inexhaustible to be discovered. Therefore, the question of how to make the best use of these features needs to be tackled. In this paper, we leverage feature transformation techniques to analyze the features and propose a framework using these methods to enhance detection. Experiment results on the public dataset PAN-WVC-10 show that our method is effective and it provides another useful method to help detect vandalism in Wikipedia.||0||0|
|Need to categorize: A comparative look at the categories of Universal Decimal Classification system and Wikipedia||Salah A.A.
|Leonardo||English||2012||This study analyzes the differences between the category structure of the Universal Decimal Classification (UDC) system (which is one of the widely used library classification systems in Europe) and Wikipedia. In particular, the authors compare the emerging structure of category-links to the structure of classes in the UDC. The authors scrutinize the question of how knowledge maps of the same domain differ when they are created socially (i.e. Wikipedia) as opposed to when they are created formally (UDC) using classification theory. As a case study, we focus on the category of "Arts".||0||0|
|TCSST: Transfer classification of short & sparse text using external data||Long G.
|ACM International Conference Proceeding Series||English||2012||Short & sparse text is becoming more prevalent on the web, such as search snippets, micro-blogs and product reviews. Accurately classifying short & sparse text has emerged as an important while challenging task. Existing work has considered utilizing external data (e.g. Wikipedia) to alleviate data sparseness, by appending topics detected from external data as new features. However, training a classifier on features concatenated from different spaces is not easy considering the features have different physical meanings and different significance to the classification task. Moreover, it exacerbates the "curse of dimensionality" problem. In this study, we propose a transfer classification method, TCSST, to exploit the external data to tackle the data sparsity issue. The transfer classifier will be learned in the original feature space. Considering that the labels of the external data may not be readily available or sufficiently enough, TCSST further exploits the unlabeled external data to aid the transfer classification. We develop novel strategies to allow TCSST to iteratively select high quality unlabeled external data to help with the classification. We evaluate the performance of TCSST on both benchmark as well as real-world data sets. Our experimental results demonstrate that the proposed method is effective in classifying very short & sparse text, consistently outperforming existing and baseline methods.||0||0|
|Validation and discovery of genotype-phenotype associations in chronic diseases using linked data||Pathak J.
|Studies in Health Technology and Informatics||English||2012||This study investigates federated SPARQL queries over Linked Open Data (LOD) in the Semantic Web to validate existing, and potentially discover new genotype-phenotype associations from public datasets. In particular, we report our preliminary findings for identifying such associations for commonly occurring chronic diseases using the Online Mendelian Inheritance in Man (OMIM) and Database for SNPs (dbSNP) within the LOD knowledgebase and compare them with Gene Wiki for coverage and completeness. Our results indicate that Semantic Web technologies can play an important role for in-silico identification of novel disease-gene-SNP associations, although additional verification is required before such information can be applied and used effectively. © 2012 European Federation for Medical Informatics and IOS Press. All rights reserved.||0||0|
|A web 2.0 approach for organizing search results using Wikipedia||Darvish Morshedi Hosseini M.
|Lecture Notes in Computer Science||English||2011||Most current search engines return a ranked list of results in response to the user's query. This simple approach may require the user to go through a long list of results to find the documents related to his information need. A common alternative is to cluster the search results and allow the user to browse the clusters, but this also imposes two challenges: 'how to define the clusters' and 'how to label the clusters in an informative way'. In this study, we propose an approach which uses Wikipedia as the source of information to organize the search results and addresses these two challenges. In response to a query, our method extracts a hierarchy of categories from Wikipedia pages and trains classifiers using web pages related to these categories. The search results are organized in the extracted hierarchy using the learned classifiers. Experiment results confirm the effectiveness of the proposed approach.||0||0|
|Beyond the bag-of-words paradigm to enhance information retrieval applications||Paolo Ferragina||Proceedings - 4th International Conference on SImilarity Search and APplications, SISAP 2011||English||2011||The typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is the one based on the bag-of-words paradigm. It eventually transforms a text into an array of terms, possibly weighted (with tf-idf scores or derivatives), and then represents that array via points in highly-dimensional space. It is therefore syntactical and unstructured, in the sense that different terms lead to different dimensions. Co-occurrence detection and other processing steps have been thus proposed (see e.g. LSI, Spectral analysis ) to identify the existence of those relations, but yet everyone is aware of the limitations of this approach especially in the expanding context of short (and thus poorly composed) texts, such as the snippets of search-engine results, the tweets of a Twitter channel, the items of a news feed, the posts of a blog, or the advertisement messages, etc.. A good deal of recent work is attempting to go beyond this paradigm by enriching the input text with additional structured annotations. This general idea has been declined in the literature in two distinct ways. One consists of extending the classic term-based vector-space model with additional dimensions corresponding to features (concepts) extracted from an external knowledge base, such as DMOZ, Wikipedia, or even the whole Web (see e.g. [4, 5, 12]). The pro of this approach is to extend the bag-of-words scheme with more concepts, thus possibly allowing the identification of related texts which are syntactically far apart. The cons resides in the contamination of these vectors by un-related (but common) concepts retrieved via the syntactic queries. The second way consists of identifying in the input text short-and-meaningful sequences of terms (aka spots) which are then connected to unambiguous concepts drawn from a catalog. The catalog can be formed by either a small set of specifically recognized types, most often People and Locations (aka Named Entities, see e.g. [13, 14]), or it can consists of millions of concepts drawn from a large knowledge base, such as Wikipedia. This latter catalog is ever-expanding and currently offers the best trade-off between a catalog with a rigorous structure but with low coverage (like WordNet, CYC, TAP), and a large text collection with wide coverage but unstructured and noised content (like the whole Web). To understand how this annotation works, let us consider the following short news: "Diego Maradona won against Mexico". The goal of the annotation is to detect "Diego Maradona" and"Mexico" as spots, and then hyper-link them with theWikipedia pages which deal with the ex Argentina's coach and the football team of Mexico. The annotator uses as spots the anchor texts which occur in Wikipedia pages, and as possible concepts for each spot the (possibly many) pages pointed in Wikipedia by that spot/anchor||0||0|
|Characterization and prediction of Wikipedia edit wars||Róbert Sumi
|WebSci Conference||English||2011||We present a new, eficient method for automatically detecting conict cases and test it on five diferent language Wikipedias. We discuss how the number of edits, reverts, the length of discussions deviate in such pages from those following the general workow.||4||2|
|From names to entities using thematic context distance||Pilz A.
|International Conference on Information and Knowledge Management, Proceedings||English||2011||Name ambiguity arises from the polysemy of names and causes uncertainty about the true identity of entities referenced in unstructured text. This is a major problem in areas like information retrieval or knowledge management, for example when searching for a specific entity or updating an existing knowledge base. We approach this problem of named entity disambiguation (NED) using thematic information derived from Latent Dirichlet Allocation (LDA) to compare the entity mention's context with candidate entities in Wikipedia represented by their respective articles. We evaluate various distances over topic distributions in a supervised classification setting to find the best suited candidate entity, which is either covered in Wikipedia or unknown. We compare our approach to a state of the art method and show that it achieves significantly better results in predictive performance, regarding both entities covered in Wikipedia as well as uncovered entities. We show that our approach is in general language independent as we obtain equally good results for named entity disambiguation using the English, the German and the French Wikipedia.||0||0|
|Overview of the INEX 2010 XML mining track: Clustering and classification of XML documents||De Vries C.M.
|Lecture Notes in Computer Science||English||2011||This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2010 XML Mining track. The report also describes the approaches and results obtained by participants.||0||0|
|Towards effective short text deep classification||Xiaohua Sun
|SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval||English||2011||Recently, more and more short texts (e.g., ads, tweets) appear on the Web. Classifying short texts into a large taxonomy like ODP or Wikipedia category system has become an important mining task to improve the performance of many applications such as contextual advertising and topic detection for micro-blogging. In this paper, we propose a novel multi-stage classification approach to solve the problem. First, explicit semantic analysis is used to add more features for both short texts and categories. Second, we leverage information retrieval technologies to fetch the most relevant categories for an input short text from thousands of candidates. Finally, a SVM classifier is applied on only a few selected categories to return the final answer. Our experimental results show that the proposed method achieved significant improvements on classification accuracy compared with several existing state of art approaches.||0||0|
|Centroid-based classification enhanced with Wikipedia||Abdullah Bawakid
|Proceedings - 9th International Conference on Machine Learning and Applications, ICMLA 2010||English||2010||Most of the traditional text classification methods employ Bag of Words (BOW) approaches relying on the words frequencies existing within the training corpus and the testing documents. Recently, studies have examined using external knowledge to enrich the text representation of documents. Some have focused on using WordNet which suffers from different limitations including the available number of words, synsets and coverage. Other studies used different aspects of Wikipedia instead. Depending on the features being selected and evaluated and the external knowledge being used, a balance between recall, precision, noise reduction and information loss has to be applied. In this paper, we propose a new Centroid-based classification approach relying on Wikipedia to enrich the representation of documents through the use of Wikpedia's concepts, categories structure, links, and articles text. We extract candidate concepts for each class with the help of Wikipedia and merge them with important features derived directly from the text documents. Different variations of the system were evaluated and the results show improvements in the performance of the system.||0||0|
|Elusive vandalism detection in Wikipedia: A text stability-based approach||Wu Q.
|International Conference on Information and Knowledge Management, Proceedings||English||2010||The open collaborative nature of wikis encourages participation of all users, but at the same time exposes their content to vandalism. The current vandalism-detection techniques, while effective against relatively obvious vandalism edits, prove to be inadequate in detecting increasingly prevalent sophisticated (or elusive) vandal edits. We identify a number of vandal edits that can take hours, even days, to correct and propose a text stability-based approach for detecting them. Our approach is focused on the likelihood of a certain part of an article being modified by a regular edit. In addition to text-stability, our machine learning-based technique also takes into account edit patterns. We evaluate the performance of our approach on a corpus comprising of 15000 manually labeled edits from the Wikipedia Vandalism PAN corpus. The experimental results show that text-stability is able to improve the performance of the selected machine-learning algorithms significantly.||0||0|
|Elusive vandalism detection in wikipedia: a text stability-based approach||Qinyi Wu
|Fine grained classification of named entities in Wikipedia||Tkachenko M.
|HP Laboratories Technical Report||English||2010||This report describes the study on classifying Wikipedia articles into an extended set of named entity classes. We employed semi-automatic method to extend Wikipedia class annotation and created a training set for 15 named entity classes. We implemented two classifiers. A binary named-entity classifier decides between articles about named entities and other articles. A support vector machine (SVM) classifier trained on a variety ofWikipedia features determines the class of a named entity. Combination of the two classifiers helped us to boost classification quality and obtain classification quality that is better than state of the art. © Copyright 2010 Hewlett-Packard Development Company, L.P.||0||0|
|Overview of the INEX 2009 XML mining track: Clustering and classification of XML documents||Nayak R.
De Vries C.M.
|Lecture Notes in Computer Science||English||2010||This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants.||0||0|
|Symbolic representation of text documents||Guru D.S.
|COMPUTE 2010 - The 3rd Annual ACM Bangalore Conference||English||2010||This paper presents a novel method of representing a text document by the use of interval valued symbolic features. A method of classification of text documents based on the proposed representation is also presented. The newly proposed model significantly reduces the dimension of feature vectors and also the time taken to classify a given document. Further, extensive experimentations are conducted on vehicles-wikipedia datasets to evaluate the performance of the proposed model. The experimental results reveal that the obtained results are on par with the existing results for vehicles-wikipedia dataset. However, the advantage of the proposed model is that it takes relatively a less time for classification as it is based on a simple matching strategy.||0||0|
|Tweets mining using Wikipedia and impurity cluster measurement||Qingcai Chen
|ISI 2010 - 2010 IEEE International Conference on Intelligence and Security Informatics: Public Safety and Security||English||2010||Twitter is one of the fastest growing online social networking services. Tweets can be categorized into trends, and are related with tags and follower/following social relationships. The categorization is neither accurate nor effective due to the short length of tweet messages and noisy data corpus. In this paper, we attempt to overcome these challenges with an extended feature vector along with a semi-supervised clustering technique. In order to achieve this goal, the training set is expanded with Wikipedia topic search result, and the feature set is extended. When building the clustering model and doing the classification, impurity measurement is introduced into our classifier platform. Our experiment results show that the proposed techniques outperform other classifiers with reasonable precision and recall.||0||0|
|Identifying document topics using the Wikipedia category network||Peter Schönhofen||Web Intelli. and Agent Sys.||English||2009||In the last few years the size and coverage of Wikipedia, a community edited, freely available on-line encyclopedia has reached the point where it can be effectively used to identify topics discussed in a document, similarly to an ontology or taxonomy. In this paper we will show that even a fairly simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and also by performing classification and clustering on 20 Newsgroups and RCV1, representing documents by their Wikipedia categories instead of (or in addition to) their texts.||0||1|
|Overview of videoCLEF 2008: Automatic generation of topic-based feeds for dual language audio-visual content||Larson M.
|Lecture Notes in Computer Science||English||2009||The VideoCLEF track, introduced in 2008, aims to develop and evaluate tasks related to analysis of and access to multilingual multimedia content. In its first year, VideoCLEF piloted the Vid2RSS task, whose main subtask was the classification of dual language video (Dutch-language television content featuring English-speaking experts and studio guests). The task offered two additional discretionary subtasks: feed translation and automatic keyframe extraction. Task participants were supplied with Dutch archival metadata, Dutch speech transcripts, English speech transcripts and ten thematic category labels, which they were required to assign to the test set videos. The videos were grouped by class label into topic-based RSS-feeds, displaying title, description and keyframe for each video. Five groups participated in the 2008 VideoCLEF track. Participants were required to collect their own training data; both Wikipedia and general web content were used. Groups deployed various classifiers (SVM, Naive Bayes and k-NN) or treated the problem as an information retrieval task. Both the Dutch speech transcripts and the archival metadata performed well as sources of indexing features, but no group succeeded in exploiting combinations of feature sources to significantly enhance performance. A small scale fluency/adequacy evaluation of the translation task output revealed the translation to be of sufficient quality to make it valuable to a non-Dutch speaking English speaker. For keyframe extraction, the strategy chosen was to select the keyframe from the shot with the most representative speech transcript content. The automatically selected shots were shown, with a small user study, to be competitive with manually selected shots. Future years of VideoCLEF will aim to expand the corpus and the class label list, as well as to extend the track to additional tasks.||0||0|
|Using community-generated contents as a substitute corpus for metadata generation||Meyer M.
|International Journal of Advanced Media and Communication||English||2008||Metadata is crucial for reuse of Learning Resources. However, in the area of e-Learning, suitable training corpora for automatic classification methods are hardly available. This paper proposes the use of community-generated substitute corpora for classification methods. As an example for such a substitute corpus, the free online Encyclopaedia Wikipedia is used as a training corpus for domain-independent classification and keyword extraction of Learning Resources.||0||0|
|Using community-generated contents as a substitute corpus for metadata generation.||Christoph Rensing and Ralf Steinmetz Marek Meyer||International Journal of Advanced Media and Communications, , No. 1, 2008||2008||Metadata is crucial for reuse of Learning Resources. However, in the area of e-Learning, suitable training corpora for automatic classification methods are hardly available. This paper proposes the use of community-generated substitute corpora for classification methods. As an example for such a substitute corpus, the free online Encyclopaedia Wikipedia is used as a training corpus for domain-independent classification and keyword extraction of Learning Resources.||0||0|
|Categorizing Learning Objects Based On Wikipedia as Substitute Corpus||Marek Meyer
|First International Workshop on Learning Object Discovery & Exchange (LODE'07), September 18, 2007, Crete, Greece||2007||As metadata is often not sufficiently provided by authors of Learning Resources, automatic metadata generation methods are used to create metadata afterwards. One kind of metadata is categorization, particularly the partition of Learning Resources into distinct subject cat- egories. A disadvantage of state-of-the-art categorization methods is that they require corpora of sample Learning Resources. Unfortunately, large corpora of well-labeled Learning Resources are rare. This paper presents a new approach for the task of subject categorization of Learning Re- sources. Instead of using typical Learning Resources, the free encyclope- dia Wikipedia is applied as training corpus. The approach presented in this paper is to apply the k-Nearest-Neighbors method for comparing a Learning Resource to Wikipedia articles. Different parameters have been evaluated regarding their impact on the categorization performance.||0||1|