Herramienta web para el análisis de portales basada en Apache Nutch
Autor(es) y otros:
Director(es):
Palabra(s) clave:
Web crawler
Hadoop
Cassandra
Solr
Fecha de publicación:
Serie:
Máster Universitario en Ingeniería Web
Descripción física:
Resumen:
This End Of Master's Degree Project develops a tool for easing the analysis and comparison of web sites by scrutinizing the web resources those web sites hold. This project is developed within the Master's Degree in Web Engineering by the University of Oviedo and it was elaborated as a tool to be used by the personnel of the IT Department of the University of Oviedo in order to do research & testing tasks over different web sites. The tool was built using technologies around the Java Enterprise Edition platform and Pivotal's Spring framework. The whole system relies on web crawling software Nutch from Apache Foundation. This web crawler was developed with high scalability, robustness & extensibility in mind. It also provides high capacity batch processing due to it's execution cycle runs over Apache Hadoop (both software run in Java). One of the main goals by using Apache Nutch is to get Apache Hadoop distributed computing capabilities and, if necessary, perform a deployment over a machine cluster. This scenario may be faced when a large amount of crawling operations are required (by having a large list of URLs to be analysed). For demonstration purposes, this project will be deployed on a single node Hadoop setting that will be powerful enough to fulfil this system's basic usage. Another important goal, is to deliver a software architecture that is able to get expanded by the coding of further analysis algorithms when the staff of the University of Oviedo that will use this system requires to do so. In order to support the web crawler, a persistence backend will be used to save the web resources. In this project, Cassandra, from Apache Foundation is used. Apache Cassandra is a wide-column NoSQL distributed database management system whose internal schema is an approximation to Google's BigTable storage model. In addition, this project will involve the usage of Solr from Apache Foundation. Apache Solr is a search engine software capable of indexing the web resources downloaded by Apache Nutch in order to be able to get the source code and other metadata from them.
This End Of Master's Degree Project develops a tool for easing the analysis and comparison of web sites by scrutinizing the web resources those web sites hold. This project is developed within the Master's Degree in Web Engineering by the University of Oviedo and it was elaborated as a tool to be used by the personnel of the IT Department of the University of Oviedo in order to do research & testing tasks over different web sites. The tool was built using technologies around the Java Enterprise Edition platform and Pivotal's Spring framework. The whole system relies on web crawling software Nutch from Apache Foundation. This web crawler was developed with high scalability, robustness & extensibility in mind. It also provides high capacity batch processing due to it's execution cycle runs over Apache Hadoop (both software run in Java). One of the main goals by using Apache Nutch is to get Apache Hadoop distributed computing capabilities and, if necessary, perform a deployment over a machine cluster. This scenario may be faced when a large amount of crawling operations are required (by having a large list of URLs to be analysed). For demonstration purposes, this project will be deployed on a single node Hadoop setting that will be powerful enough to fulfil this system's basic usage. Another important goal, is to deliver a software architecture that is able to get expanded by the coding of further analysis algorithms when the staff of the University of Oviedo that will use this system requires to do so. In order to support the web crawler, a persistence backend will be used to save the web resources. In this project, Cassandra, from Apache Foundation is used. Apache Cassandra is a wide-column NoSQL distributed database management system whose internal schema is an approximation to Google's BigTable storage model. In addition, this project will involve the usage of Solr from Apache Foundation. Apache Solr is a search engine software capable of indexing the web resources downloaded by Apache Nutch in order to be able to get the source code and other metadata from them.
Colecciones
- Trabajos Fin de Máster [5253]