Amazon Product Review Analysis

by Xuelin Hou xuelin.amy@gmail.com

Back to Home

Executive summary

In this analysis, I've done some basic exploratory data analysis on Amazon review data to reveal the distribution & correlation between different factors.

In the second section, I introduced some basic concepts of text analysis and did some text analysis over the customer reviews. In sentiment analysis, I validated the sentiment score was well matched with customer ratings of the products. In topic modelling, I generated a LDA model to cluster the reviews into 5 distinct topics for better understanding about the reviews.

Dataset was extracted from kaggle at https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products

In [3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
%matplotlib inline

print('numpy version: ' + np.__version__)
print('pandas version: ' + pd.__version__)
print('seaborn version: ' + sns.__version__)
numpy version: 1.18.1
pandas version: 1.0.1
seaborn version: 0.10.0

Exploratory data anlysis

Let's take a look at the dataset, retrieved from kaggle, by loading the data using pandas read_csv function.

In [4]:
data = pd.read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv', 
                    low_memory=False)
data.head(2)
Out[4]:
id dateAdded dateUpdated name asins brand categories primaryCategories imageURLs keys ... reviews.dateSeen reviews.doRecommend reviews.id reviews.numHelpful reviews.rating reviews.sourceURLs reviews.text reviews.title reviews.username sourceURLs
0 AVqVGZNvQMlgsOJE6eUY 2017-03-03T16:56:05Z 2018-10-25T16:36:31Z Amazon Kindle E-Reader 6" Wifi (8th Generation... B00ZV9PXP2 Amazon Computers,Electronics Features,Tablets,Electro... Electronics https://pisces.bbystatic.com/image2/BestBuy_US... allnewkindleereaderblack6glarefreetouchscreend... ... 2018-05-27T00:00:00Z,2017-09-18T00:00:00Z,2017... False NaN 0 3 http://reviews.bestbuy.com/3545/5442403/review... I thought it would be as big as small paper bu... Too small llyyue https://www.newegg.com/Product/Product.aspx%25...
1 AVqVGZNvQMlgsOJE6eUY 2017-03-03T16:56:05Z 2018-10-25T16:36:31Z Amazon Kindle E-Reader 6" Wifi (8th Generation... B00ZV9PXP2 Amazon Computers,Electronics Features,Tablets,Electro... Electronics https://pisces.bbystatic.com/image2/BestBuy_US... allnewkindleereaderblack6glarefreetouchscreend... ... 2018-05-27T00:00:00Z,2017-07-07T00:00:00Z,2017... True NaN 0 5 http://reviews.bestbuy.com/3545/5442403/review... This kindle is light and easy to use especiall... Great light reader. Easy to use at the beach Charmi https://www.newegg.com/Product/Product.aspx%25...

2 rows × 24 columns

There are more columns than we need and we just simplied the data by keeping only columns:

  • name
  • primaryCategories
  • dateAdded
  • reviews.id
  • reviews.username
  • reviews.title
  • reviews.text
  • reviews.numHelpful
  • reviews.rating
In [5]:
data = data[['name','primaryCategories','dateAdded',
             'reviews.username',
             'reviews.title','reviews.text',
             'reviews.numHelpful','reviews.rating']]
data.head(2)
Out[5]:
name primaryCategories dateAdded reviews.username reviews.title reviews.text reviews.numHelpful reviews.rating
0 Amazon Kindle E-Reader 6" Wifi (8th Generation... Electronics 2017-03-03T16:56:05Z llyyue Too small I thought it would be as big as small paper bu... 0 3
1 Amazon Kindle E-Reader 6" Wifi (8th Generation... Electronics 2017-03-03T16:56:05Z Charmi Great light reader. Easy to use at the beach This kindle is light and easy to use especiall... 0 5
back to Top

Statistic summary of reviews

Before summarizing the dataset, I added some additional columns, such as:

  • reviews.len: length of the reviews
  • hour, ym, dow: hour, year-month, day-of-week of when the review was added
In [6]:
data['dateAdded'] = pd.to_datetime(data.dateAdded)
data['reviews.len'] = data['reviews.text'].map(len)
data['hour'] = data.dateAdded.dt.strftime('%H')
data['ym'] = data.dateAdded.dt.strftime('%Y-%m')
data['dow'] = data.dateAdded.dt.strftime('%a')
In [7]:
# summary of numeric columns
data.describe()
Out[7]:
reviews.numHelpful reviews.rating reviews.len
count 5000.000000 5000.000000 5000.000000
mean 0.312400 4.596800 161.348400
std 3.111582 0.731804 242.597383
min 0.000000 1.000000 45.000000
25% 0.000000 4.000000 71.000000
50% 0.000000 5.000000 105.500000
75% 0.000000 5.000000 182.000000
max 105.000000 5.000000 8351.000000
In [8]:
# summary of categorical columns
data.describe(include=['O'])
Out[8]:
name primaryCategories reviews.username reviews.title reviews.text hour ym dow
count 5000 5000 5000 4987 5000 5000 5000 5000
unique 23 4 3815 3124 4385 9 11 6
top Amazon Echo Show Alexa-enabled Bluetooth Speak... Electronics Mike Great tablet Got this for my Daughter-in-Law and she loves ... 14 2017-03 Wed
freq 845 3276 26 122 4 1760 1705 1887

After examing the data summary, we can find some preliminary insights:

  1. Most (at least 75%) reviews were not rated helpful by other users
  2. Most reviews were rated 5 (full marks) by the reviewers.
  3. The avarege reviews length is about 161 letters.
  4. There were 3815 reviewers giving reviews for 23 products in our data.
  5. The most frequent review time was 2pm / Wednesday / 2017 March.

The univariant summary above may not be suffcient for us to understand the data, so we can add another dimension for partition. This can be simply done by using groupby function from pandas

back to Top

Summary by categories

  • The categories were not well balanced among themselves. Electronics has the most products, reviews and revewers.
  • Electronics,Media has the highest quality reviews with better reviews.numHelpful and reviews.len
In [9]:
res = data.groupby('primaryCategories')\
    .agg(num_product = pd.NamedAgg('name', pd.Series.nunique),
         num_reviewer = pd.NamedAgg('reviews.username', pd.Series.nunique),
         num_review = pd.NamedAgg('reviews.text', pd.Series.nunique),
         avg_review_len = pd.NamedAgg('reviews.len', lambda i: np.round(np.mean(i),2)),
         avg_rating = pd.NamedAgg('reviews.rating', lambda i: np.round(np.mean(i),2)),
         avg_review_helpful = pd.NamedAgg('reviews.numHelpful', lambda i: np.round(np.mean(i),2))
          )
res
Out[9]:
num_product num_reviewer num_review avg_review_len avg_rating avg_review_helpful
primaryCategories
Electronics 18 2535 2827 158.39 4.55 0.37
Electronics,Hardware 2 1233 1341 163.43 4.70 0.13
Electronics,Media 1 24 24 521.21 4.67 2.96
Office Supplies,Electronics 2 227 236 154.05 4.62 0.40
In [11]:
fig, axs = plt.subplots(2,3, sharey=True)
fig.set_size_inches(14, 6)
sns.barplot(res['num_product'],  res.index, ax = axs[0,0])
sns.barplot(res['num_reviewer'], res.index, ax = axs[0,1])
sns.barplot(res['num_review'],   res.index, ax = axs[0,2])
sns.boxplot(data['reviews.len'], data.primaryCategories, ax=axs[1,0])
sns.boxplot(data['reviews.rating'], data.primaryCategories, ax=axs[1,1])
sns.boxplot(data['reviews.numHelpful'], data.primaryCategories, ax=axs[1,2])
plt.tight_layout()
plt.show()
back to Top

summary by year-month

We have seen a peak of number of reviews on 2017-03 over 12 different products. This may be due to some bias in data collection and not necessarily reflect the real distribution of the reviews.

In [14]:
res = data.groupby('ym')\
    .agg(num_product = pd.NamedAgg('name', pd.Series.nunique),
           num_review = pd.NamedAgg('reviews.text', pd.Series.nunique),
           avg_review_len = pd.NamedAgg('reviews.len', lambda i: np.round(np.mean(i),2)),
           avg_rating = pd.NamedAgg('reviews.rating', lambda i: np.round(np.mean(i),2))
          ).reset_index()
res
Out[14]:
ym num_product num_review avg_review_len avg_rating
0 2015-12 2 43 155.34 4.84
1 2016-03 1 82 232.80 4.65
2 2016-04 1 371 140.05 4.46
3 2016-06 1 418 129.26 4.51
4 2016-08 1 96 156.11 4.67
5 2017-01 2 781 156.87 4.56
6 2017-03 12 1524 172.30 4.57
7 2017-11 1 4 89.25 5.00
8 2018-02 1 648 180.63 4.67
9 2018-04 1 195 184.67 4.65
10 2018-05 1 590 137.47 4.75
back to Top

Correlations

There is some mild correlation between reviews.len and reveiews.numHelpful, which may suggest that longer reviews tend to be more helpful for other users.

In [63]:
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white")

# convert categorical columns to integers to estimate their correlations
fnames_categorical = ['hour','dow','ym','primaryCategories']
data_ = data.copy()
data_[fnames_categorical] = data_[fnames_categorical].apply(lambda i: pd.factorize(i)[0])
corr = data_.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2afb2128>

Because of the sparsity of the data, there are a lot reviews with zero numHelpful, which makes it difficult to view the pattern of the correlation. After removing zero cout of numHelpful, we are able to find a correlation between review length and number helpful recieved.

In [68]:
sns.scatterplot('reviews.len', 'reviews.numHelpful', data = data.query('`reviews.numHelpful` > 0'))
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a29553ba8>
back to Top

Text analysis

Just take some random reviews to take a look.

In [14]:
# check some random reviews
random_reviews = data.sample(3)

for i in range(len(random_reviews)):
    print('Review #{} ({} stars) {}'.format(i, 
                                               random_reviews['reviews.rating'].iloc[i],
                                               random_reviews['dateAdded'].iloc[i]))
    print(random_reviews['reviews.title'].iloc[i])
    print(random_reviews['reviews.text'].iloc[i])
    print('-'* 50 + '\r')
Review #0 (5 stars) 2018-02-02T02:30:22Z
Our 5th Echo Show! Drop in feature is used a lot!
Our 5th Echo Show. We love the Drop In feature, allowing us to use them as an intercom. The ease of playing music on all of them at once throughout the house, and controlled from any device is used daily. The connection to our Sonos system is a real plus. Also connected to our Dish TV receiver.
--------------------------------------------------
Review #1 (4 stars) 2018-02-02T02:30:22Z
Good product and quite useful
Amazon ECHO show is excellent with add-on video features. The screen is smaller than I expected but meet the requirements.
--------------------------------------------------
Review #2 (5 stars) 2018-05-02T14:01:51Z
Great sound
Love my Alexa! Having lots of fun asking her questions and enjoy listening to music on it.
--------------------------------------------------

Tokenization

Tokenization means breaking sentences into words / phrases.

Following is an example of tokenziation using any character other than alphanum.

In [32]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')

def process_text(x):
    x = x.lower()
    return tokenizer.tokenize(x)

raw_text = random_reviews['reviews.text'].iloc[0]

print('-'* 60)
print('raw review: \n' + raw_text)
print('-'* 60)
print('tokenized review: \n' + str(process_text(raw_text)))
print('-'* 60)
------------------------------------------------------------
raw review: 
Our 5th Echo Show. We love the Drop In feature, allowing us to use them as an intercom. The ease of playing music on all of them at once throughout the house, and controlled from any device is used daily. The connection to our Sonos system is a real plus. Also connected to our Dish TV receiver.
------------------------------------------------------------
tokenized review: 
['our', '5th', 'echo', 'show', 'we', 'love', 'the', 'drop', 'in', 'feature', 'allowing', 'us', 'to', 'use', 'them', 'as', 'an', 'intercom', 'the', 'ease', 'of', 'playing', 'music', 'on', 'all', 'of', 'them', 'at', 'once', 'throughout', 'the', 'house', 'and', 'controlled', 'from', 'any', 'device', 'is', 'used', 'daily', 'the', 'connection', 'to', 'our', 'sonos', 'system', 'is', 'a', 'real', 'plus', 'also', 'connected', 'to', 'our', 'dish', 'tv', 'receiver']
------------------------------------------------------------
back to Top

World cloud by category

word cloud is an another interesting visualization to show the distribution of tokens of text.

In [78]:
from wordcloud import WordCloud
fig, axs = plt.subplots(2,2)
fig.set_size_inches(14,6)
for i, cate in enumerate(data.primaryCategories.unique()):
    text = '\n'.join(data.loc[data.primaryCategories == cate, 'reviews.text'].values)
    wordcloud = WordCloud(background_color='white').generate(text)
    axs[i // 2, i % 2].imshow(wordcloud, interpolation="bilinear")
    axs[i // 2, i % 2].set_title(cate)
    axs[i // 2, i % 2].axis('off')
fig.tight_layout()
plt.show()
back to Top

Sentiment Analysis

Sentiment Analysis is to score the sentiment from the human text. It can be used to monitor the brand awareness/perceptions, customer's attitude towards products via analyzing the reveiews.

How does sentiment score work ?

In a sentiment model, it stores lists of postive / neutral / negative keywords and compare the tokens with the keywords to compute the individual pos/neu/neg scores. Finally a combined score will be aggegrated as the sentiment score for the sentence.

Following is an example of sentiment analysis of a random review.

  • neg: negative score
  • neu: neutral score
  • pos: positive score
  • compound: combined score
In [33]:
analyzer = SentimentIntensityAnalyzer()
text = random_reviews['reviews.text'].iloc[0]
print(text)
analyzer.polarity_scores(text)
Our 5th Echo Show. We love the Drop In feature, allowing us to use them as an intercom. The ease of playing music on all of them at once throughout the house, and controlled from any device is used daily. The connection to our Sonos system is a real plus. Also connected to our Dish TV receiver.
Out[33]:
{'neg': 0.033, 'neu': 0.833, 'pos': 0.134, 'compound': 0.7506}
In [34]:
pos_reviews = data.loc[data['reviews.rating'] == 5, :].sample(2)
neg_reviews = data.loc[data['reviews.rating'] == 1, :].sample(2)
random_reviews = pd.concat([pos_reviews, neg_reviews])
scores = random_reviews['reviews.text'].map(lambda i: analyzer.polarity_scores(i)['compound'])
random_reviews['score'] = scores

Show some more examples of Positive & Negative sentiment reviews

In [35]:
for i in range(len(random_reviews)):
    print('Review #{} ({} stars) by {}'.format(i, 
                                               random_reviews['reviews.rating'].iloc[i],
                                               random_reviews['reviews.username'].iloc[i]))
    print(random_reviews['reviews.title'].iloc[i])
    print('{} (sentiment score: {:0.2f})'.format(random_reviews['reviews.text'].iloc[i],
                                               random_reviews['score'].iloc[i]))
    print('-'*50 + '\r')
Review #0 (5 stars) by dm94101
awesome
I was looking everywhere for this because I lost my charger for my fire stick and i finally found it at best buy. It was a great price and works perfectly! (sentiment score: 0.88)
--------------------------------------------------
Review #1 (5 stars) by LPark
Excellent tablet for the low price
The amazon fire tablet 2016 is quite good. It provides sufficient content & ability to do all of the basics one requires in a tablet. The 8 inch screen is bright and provides various settings. The tablet's light weight makes it very portable. (sentiment score: 0.77)
--------------------------------------------------
Review #2 (1 stars) by JohnS
terrible product,bad voice quality
the speaker voice quality is terrible compare the similar size my logitech UE BOOM.the price is too high, even I got on promotion with $79 (sentiment score: -0.48)
--------------------------------------------------
Review #3 (1 stars) by joesedita
Amazon Fire 7 Tablet
Too bad Amazon turned this tablet into a big advertising tool. Many apps dont work and the camera is not good. (sentiment score: -0.64)
--------------------------------------------------

I correlated the sentiment score with review ratings as a method of validation. Generally speaking, the sentiment score is well correlated with review ratings. This is also making sense that, angry customer are making negative reveiws and gave the rating as low as possible.

In [36]:
scores = data['reviews.text'].map(lambda i: analyzer.polarity_scores(i)['compound'])
data['score'] = scores

# sentiment score distribution

fig, axs = plt.subplots(ncols=4, sharey=True, sharex=True)
fig.set_size_inches(12, 3)
for idx, cate in enumerate(data.primaryCategories.unique()):
    sns.distplot(data.loc[data.primaryCategories == cate, 'score'].values, ax = axs[idx])
    axs[idx].set_title(cate)
fig.tight_layout()
plt.show()
In [87]:
# relationship between score and rating
sns.violinplot(x='reviews.rating', y='score', data=data)
Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2bf92438>
back to Top

Topic Modelling

Topic modelling clusters the corpus of texts into several topics(groups) by assuming:

  • A document/review is mix distribution of different topics
  • A topic is mix distributino of different tokens(words)

Topic modelling helps us to understand proximity of the review meanings. In the following section, I cluster the reviews into 5 topics, and an interactive visualization was generate to explore how each topic was made up with different words, so that we can use our domain knowledge to come up with a specific topic tag for those reviews.

In some modern e-commerce site (Taobao), topic tags were added at the top of the review section for customer to quickly filter review with certrain types of topics

In [9]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')

def process_text(x):
    x = x.lower()
    return tokenizer.tokenize(x)

docs = data['reviews.text'].map(process_text)

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

# Remove stopwords
from nltk.corpus import stopwords
docs = [[token for token in doc if token not in stopwords.words('english')] for doc in docs]
In [99]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
            
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))
Number of unique tokens: 670
Number of documents: 5000
In [100]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 2000
passes = 20
iterations = 200
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)
Average topic coherence: -2.4597.
In [101]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model, corpus, dictionary)
vis
Out[101]: