With the growth of the Internet, social platforms like Twitter and microblogs have emerged. Since people express their ideas on these platforms, it is possible to analyze people’s emotional tendencies through tweets. But, given the huge volumes implied in such analyses, computerized classification is imperative. This thesis is on judgment about sentiment polarity. Firstly, this thesis summarizes the current situation and technologies for sentiment analysis. Secondly, it proposes a methodology for data processing, feature extraction and algorithms. Finally, it evaluates the effect of various features and algorithms to identify the best method for sentiment analysis. Let us summarize the key points of the thesis: (1) A variety of processing methods. We get data from tweets and save them in a predefined format. Also, we filter data, segment words and part of speech, and so on. (2) A variety of features. 16 sentiment features are considered up in the platform, including punctuation, parts of speech, and other attributes. Combined with existing theories and techniques, N-gram calculation and dependency relation method are used. Thus, we obtain a comprehensive analysis through the combination of features. In order to shorten the time for classification, features are sorted according to their respective importance, and features with higher contribution are retained. (3) A hybrid approach, which combines machine learning and lexicon to improve the performance of sentiment classification. Machine learning and Lexicon are used in the algorithm module. The Lexicon-based approach relies on the emotional dictionary, which contains words with known sentiment scores, and it calculates the sentence score from word score. (4) The combination test. It is difficult for users to make the right choice given the variety of methods for data preprocessing, feature extraction, and algorithm selection. Since it is time-consuming to try one by one, we calculate the results of methods in a multi-threaded way. The program selects the best way to deal with the data, and returns the results to users. Key words: Sentiment Analysis, Machine Learning, Semi-supervised Learning Algorithm, Text Classification, Emotional Dictionary
With the growth of the Internet, social platforms like Twitter and microblogs have emerged. Since people express their ideas on these platforms, it is possible to analyze people’s emotional tendencies through tweets. But, given the huge volumes implied in such analyses, computerized classification is imperative. This thesis is on judgment about sentiment polarity. Firstly, this thesis summarizes the current situation and technologies for sentiment analysis. Secondly, it proposes a methodology for data processing, feature extraction and algorithms. Finally, it evaluates the effect of various features and algorithms to identify the best method for sentiment analysis. Let us summarize the key points of the thesis: (1) A variety of processing methods. We get data from tweets and save them in a predefined format. Also, we filter data, segment words and part of speech, and so on. (2) A variety of features. 16 sentiment features are considered up in the platform, including punctuation, parts of speech, and other attributes. Combined with existing theories and techniques, N-gram calculation and dependency relation method are used. Thus, we obtain a comprehensive analysis through the combination of features. In order to shorten the time for classification, features are sorted according to their respective importance, and features with higher contribution are retained. (3) A hybrid approach, which combines machine learning and lexicon to improve the performance of sentiment classification. Machine learning and Lexicon are used in the algorithm module. The Lexicon-based approach relies on the emotional dictionary, which contains words with known sentiment scores, and it calculates the sentence score from word score. (4) The combination test. It is difficult for users to make the right choice given the variety of methods for data preprocessing, feature extraction, and algorithm selection. Since it is time-consuming to try one by one, we calculate the results of methods in a multi-threaded way. The program selects the best way to deal with the data, and returns the results to users. Key words: Sentiment Analysis, Machine Learning, Semi-supervised Learning Algorithm, Text Classification, Emotional Dictionary
SENTIMENT ANALYSIS AND CLASSIFICATION ON SOCIAL DATA
LI, TIANQI
2016/2017
Abstract
With the growth of the Internet, social platforms like Twitter and microblogs have emerged. Since people express their ideas on these platforms, it is possible to analyze people’s emotional tendencies through tweets. But, given the huge volumes implied in such analyses, computerized classification is imperative. This thesis is on judgment about sentiment polarity. Firstly, this thesis summarizes the current situation and technologies for sentiment analysis. Secondly, it proposes a methodology for data processing, feature extraction and algorithms. Finally, it evaluates the effect of various features and algorithms to identify the best method for sentiment analysis. Let us summarize the key points of the thesis: (1) A variety of processing methods. We get data from tweets and save them in a predefined format. Also, we filter data, segment words and part of speech, and so on. (2) A variety of features. 16 sentiment features are considered up in the platform, including punctuation, parts of speech, and other attributes. Combined with existing theories and techniques, N-gram calculation and dependency relation method are used. Thus, we obtain a comprehensive analysis through the combination of features. In order to shorten the time for classification, features are sorted according to their respective importance, and features with higher contribution are retained. (3) A hybrid approach, which combines machine learning and lexicon to improve the performance of sentiment classification. Machine learning and Lexicon are used in the algorithm module. The Lexicon-based approach relies on the emotional dictionary, which contains words with known sentiment scores, and it calculates the sentence score from word score. (4) The combination test. It is difficult for users to make the right choice given the variety of methods for data preprocessing, feature extraction, and algorithm selection. Since it is time-consuming to try one by one, we calculate the results of methods in a multi-threaded way. The program selects the best way to deal with the data, and returns the results to users. Key words: Sentiment Analysis, Machine Learning, Semi-supervised Learning Algorithm, Text Classification, Emotional DictionaryÈ consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/21431