Herramienta web para el análisis de portales basada en Apache Nutch

Rozas García, Sergio

Repositorio

Cómo publicar

Recursos

FAQs

Mostrar el registro sencillo del ítem

Herramienta web para el análisis de portales basada en Apache Nutch

dc.contributor.advisor	Fernández Lanvin, Daniel
dc.contributor.author	Rozas García, Sergio
dc.date.accessioned	2015-07-31T06:43:25Z
dc.date.available	2015-07-31T06:43:25Z
dc.date.issued	2015-07-15
dc.identifier.uri	http://hdl.handle.net/10651/32447
dc.description.abstract	This End Of Master's Degree Project develops a tool for easing the analysis and comparison of web sites by scrutinizing the web resources those web sites hold. This project is developed within the Master's Degree in Web Engineering by the University of Oviedo and it was elaborated as a tool to be used by the personnel of the IT Department of the University of Oviedo in order to do research & testing tasks over different web sites. The tool was built using technologies around the Java Enterprise Edition platform and Pivotal's Spring framework. The whole system relies on web crawling software Nutch from Apache Foundation. This web crawler was developed with high scalability, robustness & extensibility in mind. It also provides high capacity batch processing due to it's execution cycle runs over Apache Hadoop (both software run in Java). One of the main goals by using Apache Nutch is to get Apache Hadoop distributed computing capabilities and, if necessary, perform a deployment over a machine cluster. This scenario may be faced when a large amount of crawling operations are required (by having a large list of URLs to be analysed). For demonstration purposes, this project will be deployed on a single node Hadoop setting that will be powerful enough to fulfil this system's basic usage. Another important goal, is to deliver a software architecture that is able to get expanded by the coding of further analysis algorithms when the staff of the University of Oviedo that will use this system requires to do so. In order to support the web crawler, a persistence backend will be used to save the web resources. In this project, Cassandra, from Apache Foundation is used. Apache Cassandra is a wide-column NoSQL distributed database management system whose internal schema is an approximation to Google's BigTable storage model. In addition, this project will involve the usage of Solr from Apache Foundation. Apache Solr is a search engine software capable of indexing the web resources downloaded by Apache Nutch in order to be able to get the source code and other metadata from them.	spa
dc.format.extent	330	spa
dc.language.iso	spa	spa
dc.relation.ispartofseries	Máster Universitario en Ingeniería Web
dc.rights	CC Reconocimiento - No comercial - Sin obras derivadas 4.0
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject	Web crawler	spa
dc.subject	Hadoop	spa
dc.subject	Cassandra	spa
dc.subject	Solr	spa
dc.title	Herramienta web para el análisis de portales basada en Apache Nutch	spa
dc.type	master thesis	spa
dc.rights.accessRights	open access

Ficheros en el ítem

Nombre:: TFM_Sergio Rozas Garcia.pdf
Tamaño:: 11.75Mb
Formato:: PDF

Este ítem aparece en la(s) siguiente(s) colección(ones)

Trabajos Fin de Máster [5296]
TFM

Mostrar el registro sencillo del ítem

CC Reconocimiento - No comercial - Sin obras derivadas 4.0

Este ítem está sujeto a una licencia Creative Commons