English español

Repositorio de la Universidad de Oviedo. > Trabajos académicos > Trabajos Fin de Máster >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10651/32447

Title: Herramienta web para el análisis de portales basada en Apache Nutch
Author(s): Rozas García, Sergio
Advisor: Fernández Lanvin, Daniel
Keywords: Web crawler
Issue date: 15-Jul-2015
Series/Report no.: Máster Universitario en Ingeniería Web
Format extent: 330
Abstract: This End Of Master's Degree Project develops a tool for easing the analysis and comparison of web sites by scrutinizing the web resources those web sites hold. This project is developed within the Master's Degree in Web Engineering by the University of Oviedo and it was elaborated as a tool to be used by the personnel of the IT Department of the University of Oviedo in order to do research & testing tasks over different web sites. The tool was built using technologies around the Java Enterprise Edition platform and Pivotal's Spring framework. The whole system relies on web crawling software Nutch from Apache Foundation. This web crawler was developed with high scalability, robustness & extensibility in mind. It also provides high capacity batch processing due to it's execution cycle runs over Apache Hadoop (both software run in Java). One of the main goals by using Apache Nutch is to get Apache Hadoop distributed computing capabilities and, if necessary, perform a deployment over a machine cluster. This scenario may be faced when a large amount of crawling operations are required (by having a large list of URLs to be analysed). For demonstration purposes, this project will be deployed on a single node Hadoop setting that will be powerful enough to fulfil this system's basic usage. Another important goal, is to deliver a software architecture that is able to get expanded by the coding of further analysis algorithms when the staff of the University of Oviedo that will use this system requires to do so. In order to support the web crawler, a persistence backend will be used to save the web resources. In this project, Cassandra, from Apache Foundation is used. Apache Cassandra is a wide-column NoSQL distributed database management system whose internal schema is an approximation to Google's BigTable storage model. In addition, this project will involve the usage of Solr from Apache Foundation. Apache Solr is a search engine software capable of indexing the web resources downloaded by Apache Nutch in order to be able to get the source code and other metadata from them.
URI: http://hdl.handle.net/10651/32447
Appears in Collections:Trabajos Fin de Máster

Files in This Item:

File Description SizeFormat
TFM_Sergio Rozas Garcia.pdf12,03 MBAdobe PDFView/Open

Exportar a Mendeley

This item is licensed under a Creative Commons License
Creative Commons

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.


Base de Datos de Autoridades Biblioteca Universitaria Consultas / Sugerencias