Regular expression

From WikiPapers
(Redirected from Regular expressions)
Jump to: navigation, search

regular expression is included as keyword or extra keyword in 0 datasets, 0 tools and 10 publications.

Datasets

There is no datasets for this keyword.

Tools

There is no tools for this keyword.


Publications

Title Author(s) Published in Language DateThis property is a special property in this wiki. Abstract R C
Title-based approach to relation discovery from wikipedia Zarrad R.
Doggaz N.
Zagrouba E.
IC3K 2013; KEOD 2013 - 5th International Conference on Knowledge Engineering and Ontology Development, Proceedings English 2013 With the advent of the Web and the explosion of available textual data, the field of domain ontology engineering has gained more and more importance. The last decade, several successful tools for automatically harvesting knowledge from web data have been developed, but the extraction of taxonomic and non taxonomic ontological relationships is still far from being fully solved. This paper describes a new approach which extracts ontological relations from Wikipedia. The non-taxonomic relations extraction process is performed by analyzing the titles which appear in each document of the studied corpus. This method is based on regular expressions which appear in titles and from which we can extract not only the two arguments of the relationships but also the labels which describe the relations. The resulting set of labels is used in order to retrieve new relations by analyzing the title hierarchy in each document. Other relations can be extracted from titles and subtitles containing only one term. An enrichment step is also applied by considering each term which appears as a relation argument of the extracted links in order to discover new concepts and new relations. The experiments have been performed on French Wikipedia articles related to the medical field. The precision and recall values are encouraging and seem to validate our approach. Copyright 0 0
An efficient voice enabled web content retrieval system for limited vocabulary Bharath Ram G.R.
Jayakumaur R.
Narayan R.
Shahina A.
Khan A.N.
Communications in Computer and Information Science English 2012 Retrieval of relevant information is becoming increasingly difficult owing to the presence of an ocean of information in the World Wide Web. Users in need of quick access to specific information are sub-jected to a series of web re-directions before finally arriving at the page that contains the required information. In this paper, an optimal voice based web content retrieval system is proposed that makes use of an open source speech recognition engine to deal with voice inputs. The proposed system performs a quicker retrieval of relevant content from Wikipedia and instantly presents the textual information along with the related image to the user. This search is faster than the conventional web content retrieval technique. The current system is built with limited vocabulary but can be extended to support a larger vocabulary. Additionally, the system is also scalable to retrieve content from few other sources of information apart from Wikipedia. 0 0
Development of IR tool for tree-structured MathML-based mathematical descriptions Watanabe T.
Miyazaki Y.
Joint Proceedings of the Work-in-Progress Poster and Invited Young Researcher Symposium at the 18th International Conference on Computers in Education, ICCE 2010 English 2012 The quantity of Web contents including math has been skyrocketing in recent years, such as Wikipedia articles and BBS focusing on math. Some pieces of previous research have dealt with the development of IR systems targeting MathML-based math expressions. They are, however, still developing in terms of lack of fuzzy search functions or low hit rates. One of the authors in ICCE2008 proposed the IR tool enjoying a fuzzy search function, by adopting regular expressions used in MySQL. In this study, it is our objective to additionally propose a "tree structure" algorithm for the fuzzy search function with better precisions. 0 0
Cohort shepherd: Discovering cohort traits from hospital visits Goodwin T.
Rink B.
Roberts K.
Harabagiu S.M.
NIST Special Publication English 2011 This paper describes the system created by the University of Texas at Dallas for content- based medical record retrieval submitted to the TREC 2011 Medical Records Track. Our system builds a query by extracting keywords from a given topic using a Wikipedia-based approach we use regular expressions to ex- tract age, gender, and negation requirements. Each query is then expanded by relying on UMLS, SNOMED, Wikipedia, and PubMed Co-occurrence data for retrieval. Four runs were submitted: two based on Lucene with varying scoring methods, and two based on a hybrid approach with varying negation detec- tion techniques. Our highest scoring submis- sion achieved a MAP score of 40.8. 0 0
Top-performance tokenization and small-ruleset regular expression matching : A quantitative performance analysis and optimization study on the cell/B.E. processor Scarpazza D.P. International Journal of Parallel Programming English 2011 In the last decade, the volume of unstructured data that Internet and enterprise applications create and consume has been growing at impressive rates. The tools we use to process these data are search engines, business analytics suites, natural-language processors and XML processors. These tools rely on tokenization, a form of regular expression matching aimed at extracting words and keywords in a character stream. The further growth of unstructured data-processing paradigms depends critically on the availability of high-performance tokenizers. Despite the impressive amount of parallelism that the multi-core revolution has made available (in terms of multiple threads and wider SIMD units), most applications employ tokenizers that do not exploit this parallelism. I present a technique to design tokenizers that exploit multiple threads and wide SIMD units to process multiple independent streams of data at a high throughput. The technique benefits indefinitely from any future scaling in the number of threads or SIMD width. I show the approach's viability by presenting a family of tokenizer kernels optimized for the Cell/B.E. processor that deliver a performance seen, so far, only on dedicated hardware. These kernels deliver a peak throughput of 14.30 Gbps per chip, and a typical throughput of 9.76 Gbps on Wikipedia input. Also, they achieve almost-ideal resource utilization (99.2%). The approach is applicable to any SIMD enabled processor and matches well the trend toward wider SIMD units in contemporary architecture design. 0 0
An efficient web-based wrapper and annotator for tabular data Amin M.S.
Jamil H.
International Journal of Software Engineering and Knowledge Engineering English 2010 In the last few years, several works in the literature have addressed the problem of data extraction from web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from web pages in an efficient way, and can generate regular expression for the extraction process. Moreover, the proposed system can assign intuitive column names to the columns of the extracted table by leveraging Wikipedia knowledge base for the purpose of table annotation. To improve accuracy of the assignment, we exploit the structural homogeneity of the column values and their co-location information to weed out less likely candidates. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the wrapper generation algorithm works in linear time. 0 0
Deep hypertext with embedded revision control implemented in regular expressions Grishchenko V. WikiSym 2010 English 2010 While text versioning was definitely a part of the original hypertext concept [21, 36, 44], it is rarely considered in this context today. Still, we know that revision control underlies the most exciting social co-authoring projects of the today's Internet, namely the Wikipedia and the Linux kernel. With an intention to adapt the advanced revision control technologies and practices to the conditions of the Web, the paper reconsiders some obsolete assumptions and develops a new versioned text format perfectly processable with standard regular expressions (PCRE [6]). The resulting deep hypertext model allows instant access to past/concurrent versions, authorship, changes; enables deep links to reference changing parts of a changing text. Effectively, it allows distributed and real-time revision control on the Web, implementing the vision of co-evolution and mutation exchange among multiple competing versions of the same text. Copyright 2010 ACM. 0 0
Wiki Vandalysis - Wikipedia Vandalism Analysis Manoj Harpalani
Thanadit Phumprao
Megha Bassi
Michael Hart
Rob Johnson
CLEF English 2010 Wikipedia describes itself as the "free encyclopedia that anyone can edit". Along with the helpful volunteers who contribute by improving the articles, a great number of malicious users abuse the open nature of Wikipedia by vandalizing articles. Deterring and reverting vandalism has become one of the

major challenges of Wikipedia as its size grows. Wikipedia editors fight vandalism both manually and with automated bots that use regular expressions and other simple rules to recognize malicious edits. Researchers have also proposed Machine Learning algorithms for vandalism detection, but these algorithms are still in their infancy and have much room for improvement. This paper presents an approach to fighting vandalism by extracting various features from the edits for machine learning classification. Our classifier uses information about the editor, the sentiment of the edit, the "quality" of the edit (i.e. spelling errors), and targeted regular expressions to capture patterns common in blatant

vandalism, such as insertion of obscene words or multiple exclamations. We have successfully been able to achieve an area under the ROC curve (AUC) of 0.91 on a training set of 15000 human annotated edits and 0.887 on a random sample of 17472 edits from 317443.
0 0
High-performance regular expression scanning on the cell/B.E. processor Scarpazza D.P.
Russell G.F.
Proceedings of the International Conference on Supercomputing English 2009 Matching regular expressions (regexps) is a very common work-load. For example, tokenization, which consists of recognizing words or keywords in a character stream, appears in every search engine indexer. Tokenization also consumes 30% or more of most XML processors' execution time and represents the first stage of any programming language compiler. Despite the multi-core revolution, regexp scanner generators like flex haven't changed much in 20 years, and they do not exploit the power of recent multi-core architectures (e.g., multiple threads and wide SIMD units). This is unfortunate, especially given the pervasive importance of search engines and the fast growth of our digital universe. Indexing such data volumes demands precisely the processing power that multi-cores are designed to offer. We present an algorithm and a set of techniques for using multi-core features such as multiple threads and SIMD instructions to perform parallel regexp-based tokenization. As a proof of concept, we present a family of optimized kernels that implement our algorithm, providing the features of flex on the Cell/B.E. processor at top performance. Our kernels achieve almost-ideal resource utilization (99.2% of the clock cycles are non-NOP issues). They deliver a peak throughput of 14.30 Gbps per Cell chip, and 9.76 Gbps on Wikipedia input: a remarkable performance, comparable to dedicated hardware solutions. Also, our kernels show speedups of 57-81x over flex on the Cell. Our approach is valuable because it is easily portable to other SIMD-enabled processors, and there is a general trend toward more and wider SIMD instructions in architecture design. Copyright 2009 ACM. 0 0
SASL: A semantic annotation system for literature Yuan P.
Gang Wang
Zhang Q.
Jin H.
Lecture Notes in Computer Science English 2009 Due to ambiguity, search engines for scientific literatures may not return right search results. One efficient solution to the problems is to automatically annotate literatures and attach the semantic information to them. Generally, semantic annotation requires identifying entities before attaching semantic information to them. However, due to abbreviation and other reasons, it is very difficult to identify entities correctly. The paper presents a Semantic Annotation System for Literature (SASL), which utilizes Wikipedia as knowledge base to annotate literatures. SASL mainly attaches semantic to terminology, academic institutions, conferences, and journals etc. Many of them are usually abbreviations, which induces ambiguity. Here, SASL uses regular expressions to extract the mapping between full name of entities and their abbreviation. Since full names of several entities may map to a single abbreviation, SASL introduces Hidden Markov Model to implement name disambiguation. Finally, the paper presents the experimental results, which confirm SASL a good performance. 0 0