Máster Universitario en Soft Computing y Análisis Inteligente de Datos
During the last years we are assisting to an intense Web transformation process. It is no longer a mere static information repository but a dynamic system in which users have become the main content contributors. They actively participate in sharing their opinions, thoughts and views about products, events and almost anything in social networks, forums, blogs, etc. With the latest advances in mobile technologies, users can actually interact anytime from anywhere; real time information has become a reality. All these mixture of social networks, discussion groups, forums and blogs are collectively called the user-generated content. It has many practical applications and has a potential major value from both the user and business points of view. On one hand, knowing other user opinions is useful when having to take a decision in our daily life. On the other hand, it is an invaluable information source about user preferences and tastes. Due to the large and diverse number of opinion sources, it appears the necessity of systems able to automatically discover, analyze and summarize the expressed sentiment in the so- called user-generated content. Sentiment analysis grows out of this need. It focuses on the computational study of people's opinions, appraisals, and emotions toward entities, events and their properties. In the first three chapters of this document we introduce the problem of sentiment analysis, describing its main characteristics and di culties, we brie y present the main theoretical background of the realized work, and we provide the reader with an exhaustive literature review, analyzing the previous related works in the literature. Afterwards, we face a sentiment classification problem consisting in learning to classify a series of movie reviews, as positive or negative, in function of the sentiment expressed by the author. In chapter 4 we present the dataset and its main properties, together with all the preprocess steps we have applied to the text movie reviews in order to obtain valuable representations. We also present the methodology we used to execute the experiments and to estimate the performance of the proposed approaches. In chapter 5 we describe our solutions to the problem, we present the details of all the performed experiments and evaluate and discuss the obtained results. As baseline we have reproduced an extensive part of the experiments presented in [Pang et al., 2002]. As follows we propose a series of feature reduction approaches, with the objective of selecting a reduced and representative vocabulary of the movie review domain. Finally, we propose a novel method based on measuring word cooccurrence information in order to obtain a "meaning" representation of the text documents.