Evaluation of Text Mining, Its Use for Extraction of Effective and Efficient Data By Therisano A.
Motsilenyane Reg. No: 14001299 Research Proposal Department of Computer Science and Information Systems, Faculty of Science, Botswana International University of Science & Technology E-mail: [email protected] Contact Number: (+267) 74338126 23 January 2018 ABSTRACT Text Miningwhich is also referred to as Text Data Mining, is a concept of deriving highquality information from natural language text. After the information iscontained or derived it is made available to data mining algorithms.
There is quitea lot that can be done with text mining, for example analysing clusters ofwords that are within a document. Text mining was first introduced in the late1990s but it emerged as “text data mining”. Basic lexicalanalysis counts the frequencies of words and terms in order to attempt toclassify a document by topic. Text mining or text data mining carry theanalysis/analytical process a step further.
Data mining looks for hiddencomplex patterns, relationships and datasets. Some of the techniques involvedinclude clustering, decision tress, classification, link analysis and manymore. These techniques can be used in data derived from textual sources, thoughwith adjustments in order to accommodate, high dimensionality of text derivedinformation if every term has been turned into analytical dimensionality. TABLE OF CONTENTS ABSTRACT.
2 TABLE OF CONTENTS. 3 LIST OF ABBREVIATIONS. 4 NLP- Natural Language Processing. 4 SECTION ONE: INTRODUCTION.. 5 1.1 Introduction to the Research Problem..
5 1.2 Research Background. 6 1.3 Problem Statement. 6 1.4 Research Objectives. 6 1.
4.1 General Objective. 6 1.4.2 Specific Objectives. 7 1.5 Research Questions.
7 1.6 Justification of the Study. 7 1.7 Proposal Structure. 7 SECTION TWO: LITERATURE REVIEW.
.. 8 2.1 Introduction. 8 2.X Conclusion. 8 SECTION THREE: METHODOLOGY..
9 3.1 Introduction. 9 3.
2 Ethical and Philosophical Considerations. 9 3.3 Research Design. 9 3.
4 Research Methods for Specific Objective 1. 9 3.5 Research Methods for Specific Objective 2. 9 3.6 Research Methods for Specific Objective 3. 9 3.
X Conclusion. 10 References. 11 APPENDICES.
12 LIST OF ABBREVIATIONS NLP- Natural Language Processing SECTIONONE: INTRODUCTION1.1 Introduction to the Research Problem Thesize of data is increasing at a vigorous rate each day. Business industries,organisations and all types of institution are storing their dataelectronically. A huge amount of text is exchanged over the internet in theform of repositories, digital libraries and other textual information such asemail, blogs and even social media network. Hence this makes it a challenge todetermine appropriate patterns and trends to extract valuable knowledge fromthis large volume of data. 1Textmining is a process to extract that interesting and significant patterns to exploreknowledge from textual data sources. Text mining is a multi-disciplinary fieldbased on information retrieval, data mining, machine learning, statistics, andcomputational linguistics.
Text mining techniques are continuously used orapplied in industry, academia, web applications, internet and other fields. Itis applied in areas like search engines, filter emails, fraud detection,product suggestion analysis and social media, feature extraction, predictiveand trend analysis. 2Theprocess of Text mining performs the following steps:v Collectionunstructured data from different sources in their available formats which mayinclude pdf, plain text, web pagesv Cleansingand pre-processing to detect and remove anomalies. Cleansing make sure tocapture the real essence of text available and is performed to remove stopwords stemming as well as indexing the data.v Processingand controlling operations are applied to check and further clean the data setby automatic processing.v Patternanalysis is implemented, and this is done by Management Information System.v Extractionof valuable and relevant information for effective and timely decision makingand trend analysisThe appropriate technique for mining text reduce thetime and effort to find relevant pattern for analysis and decision making.
3 1.2 Research Background Text mining isused to describe the application of data mining techniques to automateddiscovery of useful or interesting knowledge from unstructured text. Severaltechniques have been proposed for text mining which including conceptualstructure, association rule mining, episode rule mining, decision trees, andrule induction methods.
In addition, Information Retrieval techniques havewidely used the bag-of-words model for tasks such as document matching,ranking, and clustering. 4 Referencing the taskof information extraction aims to find specific data in natural language text. Datato be extracted/retrieved is given by a template which specifies a list ofslots and this slot are to be filled with substrings taken from the document. 1 Document can befilled with templates and its filled template for an information extractiontask in the job-posting domain. This template can include slots that are filledby strings which are taken directly from the document. Several slots may havemultiple fillers for the job-posting domain as in programming languages,platforms, applications, and areas. Machinelearning techniques have been developed to automatically construct informationextractors for job postings. 3 Text Mining can bevisualized as consisting of two phases: first one being Text refining andKnowledge distillation as the second phase.
The text refining phase, transformsthe free form text documents and transforms it into a chosen intermediate form.Knowledge distillation infers patterns or knowledge from intermediate form. TheIntermediate Form can be semi structured such as the conceptual graphrepresentation or structured such as relational data representation. 11.3 Problem Statement Many issues occurduring the text mining process and effect the efficiency and effectiveness ofdecision making.
Text mining on large amount of data is not effective andefficient, depending on the different types of techniques used. Thesetechniques include Information Extraction, Information Retrieval, NaturalLanguage Processing, Clustering and Text Summarization. 11.4 Research Objectives The objective ofthis paper is to analyse different text mining techniques which help to performtext analytics effectively and efficiently from large amount of data. Moreover,the issues that arise during text mining process are identified. 1.4.1 General Objective · To analyse the different text miningtechniques· To analyse techniques for large amounts ofdata· To see/analyse the efficient techniques· To observe the difficulty of text mining 1.
4.2Specific Objectives § To comeup with the effective and efficient techniques§ Toselect the techniques which are good for large data§ To seewhich technique takes a long time to complete its task1.5 Research Questions o How efficient is it to apply text miningtechniques to analysis text?o How effective are the text miningtechniques? 1.6 Justification of the Study The main reason ofthis research is to see f Text mining under data mining has a beneficial/usefulintended purpose.
The research truly goes in deep to see if text miningbenefits the Computer science, since analysis and patterns are important in theworld of computing. The research willget to discover the efficient and effective techniques, but elaborating eachone thoroughly.1.7 Proposal Structure The whole process of text mining consists of a number of subordinatetasks. It is best or it is easier to distribute the tasks into the smallergroups in order to receive the positive result of the process of analysis.First there is the stage of the information retrieval, which is characterizedwith the extraction of information valuable for the analysis.
Then it isfollowed by natural language processing, which presents the retrieved text inthe natural human language. Next is the stage of named entity recognition,which recognizes information according to the certain common identifiers. Lastlythere are more complicated sentiment analysis and quantitative analysiswhich analyse the data from all sides, involving the psychological and otheraspects.
SECTIONTWO: LITERATURE REVIEW 2.1 Introduction Text mining alsocalled as text data mining, is defined ofidentifying or extracting information from large amount of data 5- 1. It is characterizedas a knowledge intense process in which users interact with a document usinganalysis tools.
According to StatSoft the purpose of text mining is processingof unstructured information and extraction of meaningful numerical data fromthe text, which makes the information contained in the text more accessible tovarious data mining techniques. 6 Using text miningone has the capability to derive summaries from the documents in the set andretrieve key concepts for the whole set of documents. 7 Text mining is acombination of techniques from such areas as natural language processing,information retrieval, information extraction and data mining. 8Moreover, each ofthose techniques was developed long before the initial term of text mining wasformulated. The following steps can be included in text mining. 5· It converts the unstructured text intostructured data · Identify the patterns from structured data · Analyse the patterns using Text Miningtechniques · Extract the useful information from the text. The techniques intext mining from different areas such as information extraction, informationretrieval, natural language processing (NLP), categorization and clustering. 9 These stages of textmining process can be made into a single workflow.
In general, text miningturns text into numbers, which can be later incorporated in other data analysesto reveal interesting statistical results. 102.X Conclusion Theavailability of large amount of text-based data, make it a need for it to beprocessed to extract valuable information. Text mining techniques are used toanalyse the interesting and relevant information effectively and efficientlyfrom large amount of unstructured data. Specific patterns and sequences areapplied in order to extract useful information by eliminating irrelevantdetails for predictive analysis. SECTIONTHREE: METHODOLOGY 3.1 Introduction Thereare many techniques developed to address the problem of Text Mining, which isconsidered to nothing more than the information retrieval according to therequirements of a user.
Information retrieval uses four methods: i. PatternTaxonomy Method ii. Term BasedMethod iii. ConceptBased Method iv. PhraseBased Method3.
2 Ethical andPhilosophical Considerations Text miming gets useful datafrom large amount of data that is helpful in progress of, industries, governmentinstitutions and or researches. Considering Text mining it a great technique whichvery helpful. It will not be a human interaction research.
3.3 Research Design Thefirst step is to make time to go to the library, and gather the journals. Explore the relationship between two or more variablesthrough a correlational analysis.3.4 ResearchMethods for Specific Objective 1 To come up with the effective and efficient techniques: ContentAnalysis and Experiment, sincethe efficiency needs to be seen.3.5 Research Methodsfor Specific Objective 2 Toselect the techniques which are good for large data: CaseStudies and Experiment. 3.
6 Research Methodsfor Specific Objective 3 To seewhich technique takes a long time to complete its task:Observation since the tasks will be runningsimultaneously and we see the one which completes first. 3.X Conclusion In conclusion therapid growth of digital data made available in current year’s knowledgediscovery and data mining have attracted great attention with very importantneed for processing data into useful information and knowledge.
7 As a result, thereis growing research interest in the topic of text mining. In general text miningconsists of analysing large amount of text documents by coming up with keyphrases; concepts and many useful data., and prepare the text processed forfurther analysis with data mining techniques. We have defined text miningprocessing flow, applications of text mining and issues in text mining. Patternsgenerated facilitate decision making in industries. 5 Overview ofconcepts, applications, tools and issues of text mining are presented to givethe researchers to carry it to the next level.
Both qualitative andquantitative research will be practiced for the research. References 1 “https://thesai.org/Downloads/Volume7No11/Paper_53-Text_Mining_Techniques_Applications_and_Issues.pdf,” Online. Available: https://thesai.org/Downloads/Volume7No11/Paper_53-Text_Mining_Techniques_Applications_and_Issues.pdf.
2 https://thesai.org/Downloads/Volume7No11/Paper_53-. 3 “http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.
403.2426&rep=rep1&type=pdf”. 4 “http://www.
cs.utexas.edu/~ml/papers/discotex-melm-03.pdf”. 5 “http://www.b-eye-network.com/view/6311” 6 “http://journals.plos.
0156031″. 7 “https://is.vsh.
cz/th/12446/vsh_b/Thesis_Varfolomeeva.pdf”. 8 “https://paginas.fe.up.pt/~prodei/dsie15/web/papers/dsie15_submission_10.
pdf”. 9 “Text_Mining_Techniques_Applications_and_Issues.pdf”. 10 “http://www.cs.
11 Global Partnership for Sustainable Development Data, 2016. Online. Available: http://www.data4sdgs.org/.
12 SEED, “Sustainable Development Goals,” 2017. Online. Available: https://www.seed.uno/about/work/sustainable-development-goals.
html. 13 Secretary-General Sustainable Development Agenda, “Sustainable Development Goals kick off with start of new year,” 30 December 2015. Online. Available: http://www.un.
org/sustainabledevelopment/blog/2015/12/sustainable-development-goals-kick-off-with-start-of-new-year/. 14 United Nations General Assembly, “Transforming our world: the 2030 Agenda for Sustainable Development,” United Nations General Assembly, 2015. 15 United Nations General Assembly, “Report of the world commission on environment and development: Our common future,” United Nations General Assembly, Oslo, 1987.