Multi Domain Sentiment Classification
Multi-domain Sentiment Classification
Shoushan Li and Chengqing Zong
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
{sshanli,cqzong}@nlpr.ia.ac.cn
this new task, ‘multi-domain sentiment
Abstract
classification’.
In this paper, we propose two approaches to
This paper addresses a new task in sentiment
multi-domain sentiment classification. In the first,
classification, called multi-domain sentiment
called feature-level fusion, we combine the feature
classification, that aims to improve perform-
sets from all the domains into one feature set.
ance through fusing training data from multi-
Using the unified feature set, we train a classifier
ple domains. To achieve this, we propose two
using all the training data regardless of domain. In
approaches of fusion, feature-level and classi-
the second approach, classifier-level fusion, we
fier-level, to use training data from multiple
domains simultaneously. Experimental stud-
train a base classifier using the training data from
ies show that multi-domain sentiment classi-
each domain and then apply combination methods
fication using the classifier-level approach
to combine the base classifiers.
performs much better than single domain
classification (using the training data indi-
2 Related Work
vidually).
Sentiment classification has become a hot topic
since the publication work that discusses classifi-
1 Introduction
cation of movie reviews by Pang et al. (2002).
This was followed by a great many studies into
Sentiment classification is a special task of text
sentiment classification focusing on many do-
categorization that aims to classify documents
mains besides that of movie.
according to their opinion of, or sentiment toward
Research into sentiment classification over
a given subject (e.g., if an opinion is supported or
multiple domains remains sparse. It is worth not-
not) (Pang et al., 2002). This task has created a
ing that Blitzer et al. (2007) deal with the domain
considerable interest due to its wide applications.
adaptation problem for sentiment classification
Sentiment classification is a very domain-
where labeled data from one domain is used to
specific problem; training a classifier using the
train a classifier for classifying data from a differ-
data from one domain may fail when testing
ent domain. Our work focuses on the problem of
against data from another. As a result, real
how to make multiple domains ‘help each other’
application systems usually require some labeled
when all contain some labeled samples. These two
data from multiple domains, guaranteeing an
problems are both important for real applications
acceptable performance for different domains.
of sentiment classification.
However, each domain has a very limited amount
of training data due to the fact that creating large-
3 Our Approaches
scale high-quality labeled corpora is difficult and
time-consuming. Given the limited multi-domain
3.1 Problem Statement
training data, an interesting task arises, how to
best make full use of all training data to improve
In a standard supervised classification problem,
sentiment classification performance. We name
we seek a predictor f (also called a classifier) that
257
Proceedings of ACL-08: HLT, Short Papers (Companion Volume), pages 257–260,
Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics
maps an input vector x to the corresponding class
m
nk
=
∑∑
label y. The predictor is trained on a finite set of
f
arg min
L( f ( X ' ),Y )
all
k
i
k
i
f
∈Η
all
all
k 1
= i 1
=
labeled examples { ( X ,Y ) } (i=1,…,n) and its
k
i
i
We call this approach feature-level fusion and
objective is to minimize expected error, i.e.,
show its architecture in Figure 2. The common set
n
f = arg min ∑ L( f (X ),Y )
of terms is the union of the term sets from
i
i
f ∈Η
i
multiple domains.
Where L is a prescribed loss function and H is a
Training Data
Training Data
Training Data
set of functions called the hypothesis space, which
. . .
from Domain 1
from Domain 2
from Domain m
consists of functions from x to y. In sentiment
classification, the input vector of one document is
Training Data from all Domains
constructed from weights of terms. The terms
using a Uniform Feature Vector
(t ,...,t ) are possibly words, word n-grams, or
1
N
even phrases extracted from the training data, with
Classifier
N being the number of terms. The output label y
has a value of 1 or -1 representing a positive or
negative sentiment classification.
Testing Data
Testing Data
Testing Data
. . .
from Domain 1
from Domain 2
from Domain m
In multi-domain classification, m different
domains are indexed by k={1,…,m}, each with
Figure 2: The architecture of the feature-level fusion
n training samples ( X ,Y ) i = {1,..., n } . A
approach
k
i
i
k
k
k
k
straightforward approach is to train a predictor f
k
Feature-level fusion approach is simple to
for the k-th domain only using the training
implement and needs no extra labeled data. Note
data {( X ,Y )} . We call this approach single
i
i
that training data from different domains
k
k
domain classification and show its architecture in
contribute differently to the learning process for a
Figure 1.
specific domain. For example, given data from
three domains, books, DVDs and kitchen, we
Training Data
Training Data
Training Data
. . .
decide to train a classifier for classifying reviews
from Domain 1
from Domain 2
from Domain m
from books. As the training data from DVDs is
much more similar to books than that from
Classifier
Classifier
Classifier
. . .
1
2
m
kitchen (Blitzer et al., 2007), we should give the
data from DVDs a higher weight. Unfortunately,
the feature-level fusion approach lacks the
Testing Data
Testing Data
Testing Data
. . .
from Domain 1
from Domain 2
from Domain m
capacity to do this. A more qualified approach is
required to deal with the differences among the
Figure 1: The architecture of single domain classifica-
classification abilities of training data from
tion.
different domains.
3.2 Feature-level Fusion Approach
3.3 Classifier-level Fusion Approach
Although terms are extracted from multiple do-
As mentioned in sub-Section 2.1, single domain
mains, some occur in all domains and convey the
classification is used to train a single classifier for
same sentiment (this can be called global senti-
each domain using the training data in the corre-
ment information). For example, some terms like
sponding domain. As all these single classifiers
‘excellent’ and ‘perfect’ express positive senti-
aim to determine the sentiment orientation of a
ment information independent of domain. To learn
document, a single classifier can certainly be used
the global sentiment information more correctly,
to classify documents from other domains. Given
we can pool the training data from all domains for
multiple single classifiers, our second approach is
training. Our first approach is using a common set
to combine them to be a multiple classifier system
of terms (t ' ,...,t '
) to construct a uniform fea-
1
Nall
for sentiment classification. We call this approach
ture vector x ' and then train a predictor using all
classifier-level fusion and show its architecture in
training data:
Figure 3. This approach consists of two main steps:
258
(1) train multiple base classifiers (2) combine the
4 Experiments
base classifiers. In the first step, the base classifi-
Data Set
ers are multiple single classifiers f (k=1,…,m)
: We carry out our experiments on the
k
labeled product reviews from four domains: books,
from all domains. In the second step, many com-
DVDs, electronics, and kitchen appliances1. Each
bination methods can be applied to combine the
domain contains 1,000 positive and 1,000
base classifiers. A well-known method called
negative reviews.
meta-learning (ML) has been shown to be very
Experiment Implementation: We apply SVM
effective (Vilalta and Drissi, 2002). The key idea
algorithm to construct our classifiers which has
behind this method is to train a meta-classifier
been shown to perform better than many other
with input attributes that are the output of the base
classification algorithms (Pang et al., 2002). Here,
classifiers.
we use LIBSVM2 with a linear kernel function for
Training Data
Training Data
Training Data
training and testing. In our experiments, the data
. . .
from Domain 1
from Domain 2
from Domain m
in each domain are partitioned randomly into
training data, development data and testing data
Base Classifier
Base Classifier
Base Classifier
. . .
with the proportion of 70%, 20% and 10%
1
2
m
respectively. The development data are used to
train the meta-classifier.
Baseline: The baseline uses the single domain
Development Data
Development Data
Development Data
classification approach mentioned in sub-Section
. . .
from Domain 1
from Domain 2
from Domain m
2.1. We test four different feature sets to construct
our feature vector. First, we use unigrams (e.g.,
Multiple Classifier
Multiple Classifier
Multiple Classifier
‘happy’) as features and perform the standard fea-
. . .
System 1
System 2
System m
ture selection process to find the optimal feature
set of unigrams (1Gram). The selection method is
Testing Data
Testing Data
Testing Data
. . .
Bi-Normal Separation (BNS) that is reported to be
from Domain 1
from Domain 2
from Domain m
excellent in many text categorization tasks (For-
Figure 3: The architecture of the classifier-level fusion
man, 2003). The criterion of the optimization is to
approach
find the set of unigrams with the best performance
on the development data through selecting the
Formally, let X denote a feature vector of a
features with high BNS scores. Then, we get the
k '
sample from the development data of the
optimal word bi-gram (e.g., ‘very happy’) (2Gram)
k '-th domain (k ' = 1,..., m) . The output of the
and mixed feature set (1+2Gram) in the same way.
The fourth feature set (1Gram+2Gram) also con-
k-th base classifier f on this sample is the
k
sists of unigrams and bi-grams just like the third
probability distribution over the set of classes
one. The difference between them lies in their se-
{c , c ,..., c } , i.e.,
1
2
n
lection strategy. The third feature set is obtained
p ( X ) = (
< p c | X ),..., p (c | X ) >
through selecting the unigrams and bi-grams with
k
k '
k
1
k '
k
n
k '
For the k '-th domain, we train a meta-classifier
high BNS scores while the fourth one is obtained
through simply uniting the two optimal sets of
f (k ' = 1,..., m) using the development data from
k '
1Gram and 2Gram.
the k '-th domain with the meta-level feature
From Table 1, we see that 1Gram+2Gram fea-
vector
meta
m n
X
R ⋅
∈
k '
tures perform much better than other types of fea-
meta
X
= < p (X ),..., p (X ),..., p (X ) >
tures, which implies that we need to select good
k '
1
k '
k
k '
m
k '
Each meta-classifier is then used to test the testing
unigram and bi-gram features separately before
data from the same domain.
combine them. Although the size of our training
Different from the feature-level approach, the
data are smaller than that reported in Blitzer et al.
classifier-level approach treats the training data
1
from different domains individually and thus has
This data set is collected by Blitzer et al. (2007):
the ability to take the differences in classification
http://www.seas.upenn.edu/~mdredze/datasets/sentiment/
2 LIBSVM is an integrated software for SVM:
abilities into account.
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
259
(2007) (70% vs. 80%), the classification perform-
0.79 on the testing data from Electronics respec-
ance is comparative to theirs.
tively. Dealing with such an unbalanced perform-
ance, we definitely need to put enough high
Features
Books DVDs Elec-
Kitchen
weight on the training data from Electronics.
tronic
However, the feature-level fusion approach sim-
1Gram 0.75
0.84
0.8
pl
0.825 y pools all training data from different domains
2Gram 0.75
0.73
0.815
and
0.785
treats them equally. Thus it can not capture
1+2Gram 0.765 0.81 0.825 0.80
the unbalanced information. In contrast, meta-
1Gram+2Gram
0.79 0.845 0.85 0.845
learning is able to learn the unbalance automati-
Table 1: Accuracy results on the testing data of single cally through training the meta-classifier using the
domain classification using different feature sets.
development data. Therefore, it can still increase
We implement the fusion using 1+2Gram and
the average accuracy from 0.8325 to 0.8625, an
1Gram+2Gram respectively. From Figure 4, we
impressive relative error reduction of 17.91% over
see that both the two fusion approaches generally
baseline.
outperform single domain classification when us-
5 Conclusion
ing 1+2Gram features. They increase the average
accuracy from 0.8 to 0.82375 and 0.83875, a sig-
In this paper, we propose two approaches to multi-
nificant relative error reduction of 11.87% and
domain classification task on sentiment classifica-
19.38% over baseline.
tion. Empirical studies show that the classifier-
1+2Gram Features
level approach generally outperforms the feature
88
86
approach. Compared to single domain classifica-
86
83
83
84
84
tion, multi-domain classification with the classi-
82.5
83
82.5
82.5
(%)
81
81
82
fier-level approach can consistently achieve much
80
acy
80
better results.
cur
78
76.5
Ac
76
74
Acknowledgments
72
Books
DVDs
Electronics
Kitchen
The research work described in this paper has
been partially supported by the Natural Science
1Gram+2Gram Features
Foundation of China under Grant No. 60575043,
90
89
88
and 60121302, National High-Tech Research and
88
86
86
85
84.5
Development Program of China under Grant No.
84
%)
84.5
83.5
83
84
82
82
cy(
2006AA01Z194, National Key Technologies
82
79
cura 80
R&D Program of China under Grant No.
Ac 78
2006BAH03B02, and Nokia (China) Co. Ltd as
76
well.
74
Books
DVDs
Electronics
Kitchen
References
Single domain classification
Feature-level fusion
J. Blitzer, M. Dredze, and F. Pereira. 2007. Biographies,
Classifier-level fusion with ML
Bollywood, Boom-boxes and Blenders: Domain ad-
Figure 4: Accuracy results on the testing data using
aptation for sentiment classification. In Proceedings
multi-domain classification with different approaches.
of ACL.
G. Forman. 2003. An extensive empirical study of fea-
However, when the performance of baseline in-
ture selection metrics for text classification. Journal
creases, the feature level approach fails to help the
of Machine Learning Research, 3: 1533-7928.
performance improvement in three domains. This
B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbs
up? Sentiment classification using machine learning
is mainly because the base classifiers perform ex-
techniques. In Proceedings of EMNLP.
tremely unbalanced on the testing data of these
R. Vilalta and Y. Drissi. 2002. A perspective view and
domains. For example, the four base classifiers
survey of meta-learning. Artificial Intelligence Re-
from Books, DVDs, Electronics, and Kitchen
view, 18(2): 77–95.
achieve the accuracies of 0.675, 0.62, 0.85, and
260