Web mining

From WikiPapers
Jump to: navigation, search

Web mining is included as keyword or extra keyword in 0 datasets, 0 tools and 19 publications.


There is no datasets for this keyword.


There is no tools for this keyword.


Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Augmenting concept definition in gloss vector semantic relatedness measure using wikipedia articles Pesaranghader A.
Rezaei A.
Lecture Notes in Electrical Engineering English 2014 Semantic relatedness measures are widely used in text mining and information retrieval applications. Considering these automated measures, in this research paper we attempt to improve Gloss Vector relatedness measure for more accurate estimation of relatedness between two given concepts. Generally, this measure, by constructing concepts definitions (Glosses) from a thesaurus, tries to find the angle between the concepts' gloss vectors for the calculation of relatedness. Nonetheless, this definition construction task is challenging as thesauruses do not provide full coverage of expressive definitions for the particularly specialized concepts. By employing Wikipedia articles and other external resources, we aim at augmenting these concepts' definitions. Applying both definition types to the biomedical domain, using MEDLINE as corpus, UMLS as the default thesaurus, and a reference standard of 68 concept pairs manually rated for relatedness, we show exploiting available resources on the Web would have positive impact on final measurement of semantic relatedness. 0 0
Leveraging open source tools for Web mining Pennete K.C. Lecture Notes in Electrical Engineering English 2014 Web mining is the most pursued research area and often the most challenging one. Using web mining, corporates and individuals alike are inquisitively pursuing to unravel the hidden knowledge underneath the diverse gargantuan volumes of web data. This paper tries to present how a researcher can leverage the colossal knowledge available in open access sites such as Wikipedia as a source of information rather than subscribing to closed networks of knowledge and use open source tools rather than prohibitively priced commercial mining tools to do web mining. The paper illustrates a step-by-step usage of R and RapidMiner in web mining to enable a novice to understand the concepts as well as apply it in real world. 0 0
Representation and verification of attribute knowledge Zhang C.
Niu Z.
Shi C.
Tan M.
Fu H.
Xu S.
Lecture Notes in Computer Science English 2013 With the increasing growth and popularization of the Internet, knowledge extraction from the web is an important issue in the fields of web mining, ontology engineering and intelligent information processing. The availability of real big corpora and the development of technologies of internet network and machine learning make it feasible to acquire massive knowledge from the web. In addition, many web-based encyclopedias such as Wikipedia and Baidu Baike include much structured knowledge. However, knowledge qualities including the incorrectness, inconsistency, and incompleteness become a serious obstacle for the wide practical applications of those extracted and structured knowledge. In this paper, we build a taxonomy of relations between attributes of concepts, and propose a taxonomy of attribute relations driven approach to evaluating the knowledge about attribute values of attributes of entities. We also address an application of our approach to building and verifying attribute knowledge of entities in different domains. 0 0
Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval Ye Z.
Huang J.X.
He B.
Hong Lin
Journal of the American Society for Information Science and Technology English 2012 Wikipedia is characterized by its dense link structure and a large number of articles in different languages, which make it a notable Web corpus for knowledge extraction and mining, in particular for mining the multilingual associations. In this paper, motivated by a psychological theory of word meaning, we propose a graph-based approach to constructing a cross-language association dictionary (CLAD) from Wikipedia, which can be used in a variety of cross-language accessing and processing applications. In order to evaluate the quality of the mined CLAD, and to demonstrate how the mined CLAD can be used in practice, we explore two different applications of the mined CLAD to cross-language information retrieval (CLIR). First, we use the mined CLAD to conduct cross-language query expansion; and, second, we use it to filter out translation candidates with low translation probabilities. Experimental results on a variety of standard CLIR test collections show that the CLIR retrieval performance can be substantially improved with the above two applications of CLAD, which indicates that the mined CLAD is of sound quality. 0 0
Pattern for python De Smedt T.
Daelemans W.
Journal of Machine Learning Research English 2012 Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/ pattern. 0 0
Using web mining for discovering spatial patterns and hot spots for spatial generalization Burdziej J.
Piotr Gawrysiak
Lecture Notes in Computer Science English 2012 In this paper we propose a novel approach to spatial data generalization, in which web user behavior information influences the generalization and mapping process. Our approach relies on combining usage information from web resources such as Wikipedia with search engines index statistics in order to determine an importance score for geographical objects that is used during map preparation. 0 0
Wiki as Ontology for knowledge discovery on WWW Yin L.
Wang J.
Huang Y.
Proceedings - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012 English 2012 Due to the increasing amount of data Available online, the World Wide Web has becoming one of the most valuable resources for information retrievals and knowledge discovery. Web mining technologies (usually divided into Content mining, Structure mining and Usage mining) are the right solutions for knowledge discovery on WWW. In fact the work depends on two essential issues: One is the knowledge itself, which means analyze what's the required information; the other is how does machine know the requirement well, which means to realize a feasible method for computation and the complex semantic measurement. This paper aimed to discuss three aspects of knowledge we defined: Content, structure and Prior. It means knowledge discovery on WWW should consider content features, structure relations and Priors from background simultaneously. A practice of Wiki as ontology also proposed in this paper. The multiuser writing system will bring chance as large corpus, we applied the linked data for construction of a dynamic semantic network. And which can be used in short text computation such as query expansion and so on. For the consideration of swarm intelligence the key issues and lessons are given in this paper, linked data such as wiki will provide chances and challenges for computability in the future. 0 0
Extracting multi-dimensional relations: A generative model of groups of entities in a corpus Au Yeung C.-M.
Iwata T.
International Conference on Information and Knowledge Management, Proceedings English 2011 Extracting relations among different entities from various data sources has been an important topic in data mining. While many methods focus only on a single type of relations, real world entities maintain relations that contain much richer information. We propose a hierarchical Bayesian model for extracting multi-dimensional relations among entities from a text corpus. Using data from Wikipedia, we show that our model can accurately predict the relevance of an entity given the topic of the document as well as the set of entities that are already mentioned in that document. 0 0
Ontology-based data instantiation using web service Rezazadeh R.
Shadgar B.
Osareh A.
Rezazadeh A.
Proceedings - UKSim 5th European Modelling Symposium on Computer Modelling and Simulation, EMS 2011 English 2011 The Semantic Web aims at creating a platform where information has its semantics and can be understood and processed by computers themselves with minimum human interference. Ontology theory and its related technology have been developed to help construct such a platform because ontology promises to encode certain levels of semantics for information and offers a set of common vocabulary for people or computer to communicate with. In this article, we introduced the open-source software called ontology instantiate. This software has been created for book ontology construction and instantiation using web services. This software helps users to instantiate ontology of book information on Amazon web site. This software also allows the user to merge another book ontology in its produced ontology and integrates them in the form unit ontology. This software for integration of these ontologies uses a wide range of similarity measures, including semantic similarity, string-based similarity and structural similarity. The tree is used for investigating the structural similarity. Dictionaries like Wikipedia, Word Net, Google and Yahoo is used for investigating semantic similarity and string-based similarity. 0 0
Quality evaluation of wikipedia articles through edit history and editor groups Se Wang
Mizuho Iwaihara
APWeb English 2011 0 0
Computing semantic relatedness between named entities using Wikipedia Hongyan Liu
Yirong Chen
Proceedings - International Conference on Artificial Intelligence and Computational Intelligence, AICI 2010 English 2010 In this paper the authors suggest a novel approach that uses Wikipedia to measure the semantic relatedness between Chinese named entities, such as names of persons, books, softwares, etc. The relatedness is measured through articles in Wikipedia that are related to the named entities. The authors select a set of "definition words" which are hyperlinks from these articles, and then compute the relatedness between two named entities as the relatedness between two sets of definition words. The authors propose two ways to measure the relatedness between two definition words: by Wiki-articles related to the words or by categories of the words. Proposed approaches are compared with several other baseline models through experiments. The experimental results show that this method renders satisfactory results. 0 0
"All You Can Eat" Ontology-Building: Feeding Wikipedia to Cyc Samuel Sarjant
Catherine Legg
Michael Robinson
Olena Medelyan
WI-IAT English 2009 In order to achieve genuine web intelligence, building some kind of large general machine-readable conceptual scheme (i.e. ontology) seems inescapable. Yet the past 20 years have shown that manual ontology-building is not practicable. The recent explosion of free user-supplied knowledge on the Web has led to great strides in automatic ontology-building, but quality-control is still a major issue. Ideally one should automatically build onto an already intelligent base. We suggest that the long-running Cyc project is able to assist here. We describe methods used to add 35K new concepts mined from Wikipedia to collections in ResearchCyc entirely automatically. Evaluation with 22 human subjects shows high precision both for the new concepts’ categorization, and their assignment as individuals or collections. Most importantly we show how Cyc itself can be leveraged for ontological quality control by ‘feeding’ it assertions one by one, enabling it to reject those that contradict its other knowledge. 0 0
Are wikipedia resources useful for discovering answers to list questions within web snippets? Alejandro Figueroa Lecture Notes in Business Information Processing English 2009 This paper presents LiSnQA, a list question answering system that extracts answers to list queries from the short descriptions of web-sites returned by search engines, called web snippets. LiSnQA mines Wikipedia resources in order to obtain valuable information that assists in the extraction of these answers. The interesting facet of LiSnQA is, that in contrast to current systems, it does not account for lists in Wikipedia, but for its redirections, categories, sandboxes, and first definition sentences. Results show that these resources strengthen the answering process. 0 0
Automatic multilingual lexicon generation using wikipedia as a resource Shahid A.R.
Kazakov D.
ICAART 2009 - Proceedings of the 1st International Conference on Agents and Artificial Intelligence English 2009 This paper proposes a method for creating a multilingual dictionary by taking the titles of Wikipedia pages in English and then finding the titles of the corresponding articles in other languages. The creation of such multilingual dictionaries has become possible as a result of exponential increase in the size of multilingual information on the web. Wikipedia is a prime example of such multilingual source of information on any conceivable topic in the world, which is edited by the readers. Here, a web crawler has been used to traverse Wikipedia following the links on a given page. The crawler takes out the title along with the titles of the corresponding pages in other targeted languages. The result is a set of words and phrases that are translations of each other. For efficiency, the URLs are organized using hash tables. A lexicon has been constructed which contains 7-tuples corresponding to 7 different languages, namely: English, German, French, Polish, Bulgarian, Greek and Chinese. 0 0
Concept vector extraction from Wikipedia category network Masumi Shirakawa
Kotaro Nakayama
Takahiro Hara
Shojiro Nishio
ICUIMC English 2009 0 0
Towards design principles for effective context-and perspective-based web mining Vaishnavi V.K.
Vandenberg A.
YanChun Zhang
Duraisamy S.
Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology, DESRIST '09 English 2009 A practical and scalable web mining solution is needed that can assist the user in processing existing web-based resources to discover specific, relevant information content. This is especially important for researcher communities where data deployed on the World Wide Web are characterized by autonomous, dynamically evolving, and conceptually diverse information sources. The paper describes a systematic design research study that is based on prototyping/evaluation and abstraction using existing and new techniques incorporated as plug and play components into a research workbench. The study investigates an approach, DISCOVERY, for using (1) context/perspective information and (2) social networks such as ODP or Wikipedia for designing practical and scalable human-web systems for finding web pages that are relevant and meet the needs and requirements of a user or a group of users. The paper also describes the current implementation of DISCOVERY and its initial use in finding web pages in a targeted web domain. The resulting system arguably meets the common needs and requirements of a group of people based on the information provided by the group in the form of a set of context web pages. The system is evaluated for a scenario in which assistance of the system is sought for a group of faculty members in finding NSF research grant opportunities that they should collaboratively respond to, utilizing the context provided by their recent publications. Copyright 2009 ACM. 0 0
Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia Jong-Hoon Oh
Daisuke Kawahara
Kiyotaka Uchimoto
Jun'ichi Kazama
Kentaro Torisawa
WI-IAT English 2008 0 1
Mining Wikipedia Resources for Discovering Answers to List Questions in Web Snippets Alejandro Figueroa SKG English 2008 0 0
Aisles through the category forest;Utilising the Wikipedia Category System for Corpus Building in Machine Learning Rudiger Gleim
Alexander Mehler
Matthias Dehmer
Olga Pustylnikov
Webist 2007 - 3rd International Conference on Web Information Systems and Technologies, Proceedings English 2007 The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on genres found in written text, newly evolved genres on the web are much more demanding. In order to successfully develop approaches to web mining, respective corpora are needed. However, the composition of genre- or domain-specific web corpora is still an unsolved problem. It is time consuming to build large corpora of good quality because web pages typically lack reliable meta information. Wikipedia along with similar approaches of collaborative text production offers a way out of this dilemma. We examine how social tagging, as supported by the MediaWiki software, can be utilised as a source of corpus building. Further, we describe a representation format for social ontologies and present the Wikipedia Category Explorer, a tool which supports categorical views to browse through the Wikipedia and to construct domain specific corpora for machine learning. 0 0