Deep Analysis of Textual Data in Multiple formats using Hadoop Techniques

K Sivaramakrishna; K.Srinivasarao; BV Satish

doi:https://doi.org/10.14445/22315373/IJMTT-V52P514

International Journal of Mathematics Trends and Technology

Research Article | Open Access | Download PDF

Volume 52 | Number 2 | Year 2017 | Article Id. IJMTT-V52P514 | DOI : https://doi.org/10.14445/22315373/IJMTT-V52P514

Deep Analysis of Textual Data in Multiple formats using Hadoop Techniques

K Sivaramakrishna, K.Srinivasarao, BV Satish

Citation :

K Sivaramakrishna, K.Srinivasarao, BV Satish, "Deep Analysis of Textual Data in Multiple formats using Hadoop Techniques," International Journal of Mathematics Trends and Technology (IJMTT), vol. 52, no. 2, pp. 103-113, 2017. Crossref, https://doi.org/10.14445/22315373/IJMTT-V52P514

Abstract

The analysis of different types of text content in sending mails, social online journals, messages, gatherings and different types of printed correspondence constitutes what we call content analysis. Content analysis is material to most businesses: it can help divide a great of many messages; you can break down client’s remarks and inquiries in gatherings; you can perform assessment investigation utilizing content investigation via evaluating productive or depressing impression of an organization, variety, otherwise product. Content scrutiny has likewise considered as content extraction, and is a subset of the Accepted Communication Handling (ACH) background, identified as the establishing twigs of simulated intellects, when an enthusiasm for understanding content initially created. Right now Content Investigation is frequently measured as the following stride in Big Data investigation. Content Investigation has various subsets: Content Extraction, Named Individual Identification, Semantic network commented on area's portrayal, and some more. A few methods are right now utilized and some of them have picked up a great deal of consideration, for example, Machine Learning, to demonstrate a semi supervised improvement of frameworks, yet they additionally introduce various restrictions which make them not generally the main or the best decision A wide range of machine robotized frameworks are producing extensive measure of information in various structures like truthful data, text content, and bio-metric information that develops the term Big Data. In this Research article we are exaextraction issues, difficulties, and use of these sorts of Big Data with the thought of enormous information measurements. Here we are talking about online networking information analysis, content based analysis, content information analysis, their issues and expected application zones. It will inspire scientists to address these issues of capacity, administration, and recovery of information known as Big Data.

Keywords

Big Data Investigation, content extraction, Textual Investigation, Information Measurements.

References

[1] Xerox Corporation (2015): http://www.xrce.xerox.com/Research-Development/Industry-Expertise/Finance (accessed 26 December 2015).
[2] Apache Opennlp (2015): http://opennlp.apache.org/ (accessed 19 December 2015).
[3] Doug cutting, Marco nicosia, ―About Hadoop http://lucene.apache.org/Hadoop/about.html.
[4] J. R. Finkel, T. Grenager, and C. Manning (2005). ―Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005). (online reading: http://nlp. stanford.edu/~manning/papers/gibbscrf3.pdf).
[5] Chakraborty, G., Pagolu, M. & Garla, S (2013). Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS. SAS Publishing.
[6] S. Lee and H. Kim (2008). ―News Keyword Extraction for Topic Tracking. Fourth International Conference on Networked Computing and Advanced Information Management, IEEE.
[7] Google Alerts (2016): http://www.google.com/alerts (accessed 10 January 2016).
[8] Seung Jin sul, AndreyTovchigrechko, ―Parallelizing BLAST and SOM algorithms with Mapreduce-MPI library 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS, 2011.
[9] ATLAS Project (2013): http://www.atlasproject.eu/atlas/project/task/5.1 (accessed 10 January 2016).
[10] G. Wen, G. Chen, and L. Jiang (2006). ―Performing Text Categorization on Manifold. 2006 IEEE International Conference on Systems, Man, and Cybernetics, Taipei, Taiwan, IEEE.
[11] H. Cordobés, A. Fernández Anta, L.F. Chiroque, F. Pérez García, T. Redondo, A. Santos (2014). ―Graph-based Techniques for Topic Classification of Tweets in Spanish―. International Journal of Interactive Multimedia an Artificial Intelligence.
[12] T. Theodosiou, N. Darzentas, L. Angelis, C.A. Ouzonis (2008). ―PuReD- MCL: a graph-based PubMed document clustering methodology. Bioinformatics 24.
[13] Q. Lu, J. G. Conrad, K. Al-Kofahi, W. Keenan (2011). ―Legal document clustering with built-in topic segmentation, Proceedings of the 20th ACM international conference on Information and knowledge management.
[14] P. Cowling, S. Remde, P. Hartley, W. Stewart, J. Stock-Brooks, T. Woolley (2010), ―C-Link Concept Linkage in Knowledge Repositories. AAAI Spring Symposium Series.
[15] C-Link (2015): http://www.conceptlinkage.org/ (accessed 10 December 2015).
[16] Y. Hassan-Montero, and V Herrero-Solana (2006). ―Improving Tag-Clouds as Visual Information Retrieval Interfaces, I International Conference on Multidisciplinary Information Sciences and Technologies, InSciT2006.
[17] Wordle (2014): http://www.wordle.net/ (accessed 20 December 2015).
[18] M. A. Hearst (2009) ―Information Visualization for Text Analysis, in Search User Interfaces. Cambridge University Press (online reading: http://searchuserinterfaces.com/book/).
[19] D3.js (2016): http://d3js.org/ (accessed 20 January 2016).
[20] Gephi (2016) https://gephi.org/ (accessed 20 January 2016).
[21] L. Hirschman, R. Gaizauskas (2001), ―Natural language question answering: the view from here‖, Natural Language Engineering 7. Cambridge University Press [22] OpenEphyra.
[23] N. Schlaefer, P. Gieselmann, and G. Sautter (2006). ―The Ephyra QA system. 2006 Text Retrieval Conference (TREC).
[24] YodaQA (2015): http://ailao.eu/yodaqa/ (accessed 5 January 2016).
[25] P. Baudis (2015) ―YodaQA: A Modular Question Answering System Pipeline. POSTER 2015 — 19th International Student Conference on Electrical Engineering. (online reading: http://ailao.eu/yodaqa/yodaqa- poster2015.pdf).
[26]DL4J (2015): http://deeplearning4j.org/textanalysis.html (accessed 16 December 2015).
[27]Google–Word2vec(2013): http://arxiv.org/pdf/1301.3781.pdf (accessed 20 December 2015).
[28] D. Lazer, R. Kennedy, G. King, and A. Vespignani (2014). ―Big data. The parable of Google Flu: traps in big data analysis. Science, 343(6176).
[29] D. Boyd, and K. Crawford (2011). ―Six Provocations for Big Data. A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society. (Available at SSRN: http://ssrn.com/abstract=1926431 or http:// dx.doi.org/10.2139/ssrn.1926431).