Herramienta web para el análisis de portales basada en Apache Nutch

Rozas García, Sergio

Repositorio

Cómo publicar

Recursos

FAQs

Herramienta web para el análisis de portales basada en Apache Nutch

Autor(es) y otros:

Rozas García, Sergio

Director(es):

Fernández Lanvin, Daniel

Palabra(s) clave:

Web crawler

Hadoop

Cassandra

Solr

Fecha de publicación:

2015-07-15

Serie:

Máster Universitario en Ingeniería Web

Descripción física:

330

Resumen:

This End Of Master's Degree Project develops a tool for easing the analysis and comparison of web sites by scrutinizing the web resources those web sites hold. This project is developed within the Master's Degree in Web Engineering by the University of Oviedo and it was elaborated as a tool to be used by the personnel of the IT Department of the University of Oviedo in order to do research & testing tasks over different web sites. The tool was built using technologies around the Java Enterprise Edition platform and Pivotal's Spring framework. The whole system relies on web crawling software Nutch from Apache Foundation. This web crawler was developed with high scalability, robustness & extensibility in mind. It also provides high capacity batch processing due to it's execution cycle runs over Apache Hadoop (both software run in Java). One of the main goals by using Apache Nutch is to get Apache Hadoop distributed computing capabilities and, if necessary, perform a deployment over a machine cluster. This scenario may be faced when a large amount of crawling operations are required (by having a large list of URLs to be analysed). For demonstration purposes, this project will be deployed on a single node Hadoop setting that will be powerful enough to fulfil this system's basic usage. Another important goal, is to deliver a software architecture that is able to get expanded by the coding of further analysis algorithms when the staff of the University of Oviedo that will use this system requires to do so. In order to support the web crawler, a persistence backend will be used to save the web resources. In this project, Cassandra, from Apache Foundation is used. Apache Cassandra is a wide-column NoSQL distributed database management system whose internal schema is an approximation to Google's BigTable storage model. In addition, this project will involve the usage of Solr from Apache Foundation. Apache Solr is a search engine software capable of indexing the web resources downloaded by Apache Nutch in order to be able to get the source code and other metadata from them.

Estadísticas de uso

Metadatos

Mostrar el registro completo del ítem

CC Reconocimiento - No comercial - Sin obras derivadas 4.0

Este ítem está sujeto a una licencia Creative Commons

Repositorio Institucional de la Universidad de Oviedo

Herramienta web para el análisis de portales basada en Apache Nutch

Autor(es) y otros:

Director(es):

Palabra(s) clave:

Fecha de publicación:

Serie:

Descripción física:

Resumen:

URI:

Colecciones

Ficheros en el ítem

Compartir

Estadísticas de uso

Metadatos