Amazon Product Review Analysis¶

Back to Home

Executive summary¶

In this analysis, I've done some basic exploratory data analysis on Amazon review data to reveal the distribution & correlation between different factors.

In the second section, I introduced some basic concepts of text analysis and did some text analysis over the customer reviews. In sentiment analysis, I validated the sentiment score was well matched with customer ratings of the products. In topic modelling, I generated a LDA model to cluster the reviews into 5 distinct topics for better understanding about the reviews.

Dataset was extracted from kaggle at https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
%matplotlib inline

print('numpy version: ' + np.__version__)
print('pandas version: ' + pd.__version__)
print('seaborn version: ' + sns.__version__)

numpy version: 1.18.1
pandas version: 1.0.1
seaborn version: 0.10.0

Exploratory data anlysis¶

Let's take a look at the dataset, retrieved from kaggle, by loading the data using pandas read_csv function.

data = pd.read_csv('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv', 
                    low_memory=False)
data.head(2)

There are more columns than we need and we just simplied the data by keeping only columns:

name
primaryCategories
dateAdded
reviews.id
reviews.username
reviews.title
reviews.text
reviews.numHelpful
reviews.rating

data = data[['name','primaryCategories','dateAdded',
             'reviews.username',
             'reviews.title','reviews.text',
             'reviews.numHelpful','reviews.rating']]
data.head(2)

back to Top

Statistic summary of reviews¶

Before summarizing the dataset, I added some additional columns, such as:

reviews.len: length of the reviews
hour, ym, dow: hour, year-month, day-of-week of when the review was added

data['dateAdded'] = pd.to_datetime(data.dateAdded)
data['reviews.len'] = data['reviews.text'].map(len)
data['hour'] = data.dateAdded.dt.strftime('%H')
data['ym'] = data.dateAdded.dt.strftime('%Y-%m')
data['dow'] = data.dateAdded.dt.strftime('%a')

# summary of numeric columns
data.describe()

# summary of categorical columns
data.describe(include=['O'])

After examing the data summary, we can find some preliminary insights:

Most (at least 75%) reviews were not rated helpful by other users
Most reviews were rated 5 (full marks) by the reviewers.
The avarege reviews length is about 161 letters.
There were 3815 reviewers giving reviews for 23 products in our data.
The most frequent review time was 2pm / Wednesday / 2017 March.

The univariant summary above may not be suffcient for us to understand the data, so we can add another dimension for partition. This can be simply done by using groupby function from pandas

back to Top

Summary by categories¶

The categories were not well balanced among themselves. Electronics has the most products, reviews and revewers.
Electronics,Media has the highest quality reviews with better reviews.numHelpful and reviews.len

res = data.groupby('primaryCategories')\
    .agg(num_product = pd.NamedAgg('name', pd.Series.nunique),
         num_reviewer = pd.NamedAgg('reviews.username', pd.Series.nunique),
         num_review = pd.NamedAgg('reviews.text', pd.Series.nunique),
         avg_review_len = pd.NamedAgg('reviews.len', lambda i: np.round(np.mean(i),2)),
         avg_rating = pd.NamedAgg('reviews.rating', lambda i: np.round(np.mean(i),2)),
         avg_review_helpful = pd.NamedAgg('reviews.numHelpful', lambda i: np.round(np.mean(i),2))
          )
res

fig, axs = plt.subplots(2,3, sharey=True)
fig.set_size_inches(14, 6)
sns.barplot(res['num_product'],  res.index, ax = axs[0,0])
sns.barplot(res['num_reviewer'], res.index, ax = axs[0,1])
sns.barplot(res['num_review'],   res.index, ax = axs[0,2])
sns.boxplot(data['reviews.len'], data.primaryCategories, ax=axs[1,0])
sns.boxplot(data['reviews.rating'], data.primaryCategories, ax=axs[1,1])
sns.boxplot(data['reviews.numHelpful'], data.primaryCategories, ax=axs[1,2])
plt.tight_layout()
plt.show()

back to Top

summary by year-month¶

We have seen a peak of number of reviews on 2017-03 over 12 different products. This may be due to some bias in data collection and not necessarily reflect the real distribution of the reviews.

res = data.groupby('ym')\
    .agg(num_product = pd.NamedAgg('name', pd.Series.nunique),
           num_review = pd.NamedAgg('reviews.text', pd.Series.nunique),
           avg_review_len = pd.NamedAgg('reviews.len', lambda i: np.round(np.mean(i),2)),
           avg_rating = pd.NamedAgg('reviews.rating', lambda i: np.round(np.mean(i),2))
          ).reset_index()
res

back to Top

Correlations¶

There is some mild correlation between reviews.len and reveiews.numHelpful, which may suggest that longer reviews tend to be more helpful for other users.

from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white")

# convert categorical columns to integers to estimate their correlations
fnames_categorical = ['hour','dow','ym','primaryCategories']
data_ = data.copy()
data_[fnames_categorical] = data_[fnames_categorical].apply(lambda i: pd.factorize(i)[0])
corr = data_.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x1a2afb2128>

Because of the sparsity of the data, there are a lot reviews with zero numHelpful, which makes it difficult to view the pattern of the correlation. After removing zero cout of numHelpful, we are able to find a correlation between review length and number helpful recieved.

sns.scatterplot('reviews.len', 'reviews.numHelpful', data = data.query('`reviews.numHelpful` > 0'))

<matplotlib.axes._subplots.AxesSubplot at 0x1a29553ba8>

back to Top

Text analysis¶

Just take some random reviews to take a look.

# check some random reviews
random_reviews = data.sample(3)

for i in range(len(random_reviews)):
    print('Review #{} ({} stars) {}'.format(i, 
                                               random_reviews['reviews.rating'].iloc[i],
                                               random_reviews['dateAdded'].iloc[i]))
    print(random_reviews['reviews.title'].iloc[i])
    print(random_reviews['reviews.text'].iloc[i])
    print('-'* 50 + '\r')

Review #0 (5 stars) 2018-02-02T02:30:22Z
Our 5th Echo Show! Drop in feature is used a lot!
Our 5th Echo Show. We love the Drop In feature, allowing us to use them as an intercom. The ease of playing music on all of them at once throughout the house, and controlled from any device is used daily. The connection to our Sonos system is a real plus. Also connected to our Dish TV receiver.
--------------------------------------------------
Review #1 (4 stars) 2018-02-02T02:30:22Z
Good product and quite useful
Amazon ECHO show is excellent with add-on video features. The screen is smaller than I expected but meet the requirements.
--------------------------------------------------
Review #2 (5 stars) 2018-05-02T14:01:51Z
Great sound
Love my Alexa! Having lots of fun asking her questions and enjoy listening to music on it.
--------------------------------------------------

Tokenization¶

Tokenization means breaking sentences into words / phrases.

Following is an example of tokenziation using any character other than alphanum.

# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')

def process_text(x):
    x = x.lower()
    return tokenizer.tokenize(x)

raw_text = random_reviews['reviews.text'].iloc[0]

print('-'* 60)
print('raw review: \n' + raw_text)
print('-'* 60)
print('tokenized review: \n' + str(process_text(raw_text)))
print('-'* 60)

------------------------------------------------------------
raw review: 
Our 5th Echo Show. We love the Drop In feature, allowing us to use them as an intercom. The ease of playing music on all of them at once throughout the house, and controlled from any device is used daily. The connection to our Sonos system is a real plus. Also connected to our Dish TV receiver.
------------------------------------------------------------
tokenized review: 
['our', '5th', 'echo', 'show', 'we', 'love', 'the', 'drop', 'in', 'feature', 'allowing', 'us', 'to', 'use', 'them', 'as', 'an', 'intercom', 'the', 'ease', 'of', 'playing', 'music', 'on', 'all', 'of', 'them', 'at', 'once', 'throughout', 'the', 'house', 'and', 'controlled', 'from', 'any', 'device', 'is', 'used', 'daily', 'the', 'connection', 'to', 'our', 'sonos', 'system', 'is', 'a', 'real', 'plus', 'also', 'connected', 'to', 'our', 'dish', 'tv', 'receiver']
------------------------------------------------------------

back to Top

World cloud by category¶

word cloud is an another interesting visualization to show the distribution of tokens of text.

from wordcloud import WordCloud
fig, axs = plt.subplots(2,2)
fig.set_size_inches(14,6)
for i, cate in enumerate(data.primaryCategories.unique()):
    text = '\n'.join(data.loc[data.primaryCategories == cate, 'reviews.text'].values)
    wordcloud = WordCloud(background_color='white').generate(text)
    axs[i // 2, i % 2].imshow(wordcloud, interpolation="bilinear")
    axs[i // 2, i % 2].set_title(cate)
    axs[i // 2, i % 2].axis('off')
fig.tight_layout()
plt.show()

back to Top

Sentiment Analysis¶

Sentiment Analysis is to score the sentiment from the human text. It can be used to monitor the brand awareness/perceptions, customer's attitude towards products via analyzing the reveiews.

How does sentiment score work ?

In a sentiment model, it stores lists of postive / neutral / negative keywords and compare the tokens with the keywords to compute the individual pos/neu/neg scores. Finally a combined score will be aggegrated as the sentiment score for the sentence.

Following is an example of sentiment analysis of a random review.

neg: negative score
neu: neutral score
pos: positive score
compound: combined score

analyzer = SentimentIntensityAnalyzer()
text = random_reviews['reviews.text'].iloc[0]
print(text)
analyzer.polarity_scores(text)

Our 5th Echo Show. We love the Drop In feature, allowing us to use them as an intercom. The ease of playing music on all of them at once throughout the house, and controlled from any device is used daily. The connection to our Sonos system is a real plus. Also connected to our Dish TV receiver.

{'neg': 0.033, 'neu': 0.833, 'pos': 0.134, 'compound': 0.7506}

pos_reviews = data.loc[data['reviews.rating'] == 5, :].sample(2)
neg_reviews = data.loc[data['reviews.rating'] == 1, :].sample(2)
random_reviews = pd.concat([pos_reviews, neg_reviews])
scores = random_reviews['reviews.text'].map(lambda i: analyzer.polarity_scores(i)['compound'])
random_reviews['score'] = scores

Show some more examples of Positive & Negative sentiment reviews

for i in range(len(random_reviews)):
    print('Review #{} ({} stars) by {}'.format(i, 
                                               random_reviews['reviews.rating'].iloc[i],
                                               random_reviews['reviews.username'].iloc[i]))
    print(random_reviews['reviews.title'].iloc[i])
    print('{} (sentiment score: {:0.2f})'.format(random_reviews['reviews.text'].iloc[i],
                                               random_reviews['score'].iloc[i]))
    print('-'*50 + '\r')

Review #0 (5 stars) by dm94101
awesome
I was looking everywhere for this because I lost my charger for my fire stick and i finally found it at best buy. It was a great price and works perfectly! (sentiment score: 0.88)
--------------------------------------------------
Review #1 (5 stars) by LPark
Excellent tablet for the low price
The amazon fire tablet 2016 is quite good. It provides sufficient content & ability to do all of the basics one requires in a tablet. The 8 inch screen is bright and provides various settings. The tablet's light weight makes it very portable. (sentiment score: 0.77)
--------------------------------------------------
Review #2 (1 stars) by JohnS
terrible product,bad voice quality
the speaker voice quality is terrible compare the similar size my logitech UE BOOM.the price is too high, even I got on promotion with $79 (sentiment score: -0.48)
--------------------------------------------------
Review #3 (1 stars) by joesedita
Amazon Fire 7 Tablet
Too bad Amazon turned this tablet into a big advertising tool. Many apps dont work and the camera is not good. (sentiment score: -0.64)
--------------------------------------------------

I correlated the sentiment score with review ratings as a method of validation. Generally speaking, the sentiment score is well correlated with review ratings. This is also making sense that, angry customer are making negative reveiws and gave the rating as low as possible.

scores = data['reviews.text'].map(lambda i: analyzer.polarity_scores(i)['compound'])
data['score'] = scores

# sentiment score distribution

fig, axs = plt.subplots(ncols=4, sharey=True, sharex=True)
fig.set_size_inches(12, 3)
for idx, cate in enumerate(data.primaryCategories.unique()):
    sns.distplot(data.loc[data.primaryCategories == cate, 'score'].values, ax = axs[idx])
    axs[idx].set_title(cate)
fig.tight_layout()
plt.show()

# relationship between score and rating
sns.violinplot(x='reviews.rating', y='score', data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x1a2bf92438>

back to Top

Topic Modelling¶

Topic modelling clusters the corpus of texts into several topics(groups) by assuming:

A document/review is mix distribution of different topics
A topic is mix distributino of different tokens(words)

Topic modelling helps us to understand proximity of the review meanings. In the following section, I cluster the reviews into 5 topics, and an interactive visualization was generate to explore how each topic was made up with different words, so that we can use our domain knowledge to come up with a specific topic tag for those reviews.

In some modern e-commerce site (Taobao), topic tags were added at the top of the review section for customer to quickly filter review with certrain types of topics

# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')

def process_text(x):
    x = x.lower()
    return tokenizer.tokenize(x)

docs = data['reviews.text'].map(process_text)

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

# Remove stopwords
from nltk.corpus import stopwords
docs = [[token for token in doc if token not in stopwords.words('english')] for doc in docs]

# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
            
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 670
Number of documents: 5000

# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 2000
passes = 20
iterations = 200
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

Average topic coherence: -2.4597.

import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(model, corpus, dictionary)
vis

	id	dateAdded	dateUpdated	name	asins	brand	categories	primaryCategories	imageURLs	keys	...	reviews.dateSeen	reviews.doRecommend	reviews.id	reviews.numHelpful	reviews.rating	reviews.sourceURLs	reviews.text	reviews.title	reviews.username	sourceURLs
0	AVqVGZNvQMlgsOJE6eUY	2017-03-03T16:56:05Z	2018-10-25T16:36:31Z	Amazon Kindle E-Reader 6" Wifi (8th Generation...	B00ZV9PXP2	Amazon	Computers,Electronics Features,Tablets,Electro...	Electronics	https://pisces.bbystatic.com/image2/BestBuy_US...	allnewkindleereaderblack6glarefreetouchscreend...	...	2018-05-27T00:00:00Z,2017-09-18T00:00:00Z,2017...	False	NaN	0	3	http://reviews.bestbuy.com/3545/5442403/review...	I thought it would be as big as small paper bu...	Too small	llyyue	https://www.newegg.com/Product/Product.aspx%25...
1	AVqVGZNvQMlgsOJE6eUY	2017-03-03T16:56:05Z	2018-10-25T16:36:31Z	Amazon Kindle E-Reader 6" Wifi (8th Generation...	B00ZV9PXP2	Amazon	Computers,Electronics Features,Tablets,Electro...	Electronics	https://pisces.bbystatic.com/image2/BestBuy_US...	allnewkindleereaderblack6glarefreetouchscreend...	...	2018-05-27T00:00:00Z,2017-07-07T00:00:00Z,2017...	True	NaN	0	5	http://reviews.bestbuy.com/3545/5442403/review...	This kindle is light and easy to use especiall...	Great light reader. Easy to use at the beach	Charmi	https://www.newegg.com/Product/Product.aspx%25...

	reviews.numHelpful	reviews.rating	reviews.len
count	5000.000000	5000.000000	5000.000000
mean	0.312400	4.596800	161.348400
std	3.111582	0.731804	242.597383
min	0.000000	1.000000	45.000000
25%	0.000000	4.000000	71.000000
50%	0.000000	5.000000	105.500000
75%	0.000000	5.000000	182.000000
max	105.000000	5.000000	8351.000000

	num_product	num_reviewer	num_review	avg_review_len	avg_rating	avg_review_helpful
primaryCategories
Electronics	18	2535	2827	158.39	4.55	0.37
Electronics,Hardware	2	1233	1341	163.43	4.70	0.13
Electronics,Media	1	24	24	521.21	4.67	2.96
Office Supplies,Electronics	2	227	236	154.05	4.62	0.40

	ym	num_product	num_review	avg_review_len	avg_rating
0	2015-12	2	43	155.34	4.84
1	2016-03	1	82	232.80	4.65
2	2016-04	1	371	140.05	4.46
3	2016-06	1	418	129.26	4.51
4	2016-08	1	96	156.11	4.67
5	2017-01	2	781	156.87	4.56
6	2017-03	12	1524	172.30	4.57
7	2017-11	1	4	89.25	5.00
8	2018-02	1	648	180.63	4.67
9	2018-04	1	195	184.67	4.65
10	2018-05	1	590	137.47	4.75

	name	primaryCategories	reviews.username	reviews.title	reviews.text	hour	ym	dow
count	5000	5000	5000	4987	5000	5000	5000	5000
unique	23	4	3815	3124	4385	9	11	6
top	Amazon Echo Show Alexa-enabled Bluetooth Speak...	Electronics	Mike	Great tablet	Got this for my Daughter-in-Law and she loves ...	14	2017-03	Wed
freq	845	3276	26	122	4	1760	1705	1887

Amazon Product Review Analysis¶

Executive summary¶

Table of Contents

Exploratory data anlysis¶

Statistic summary of reviews¶

Summary by categories¶

summary by year-month¶

Correlations¶

Text analysis¶

Tokenization¶

World cloud by category¶

Sentiment Analysis¶

Topic Modelling¶