Ever since I started my data science journey, I have been fascinated by word clouds. A word cloud (or tag cloud) is a unique data visualization tool for exploring text data through word frequency or significance. This is handy when you have text data and want to get a basic idea of the distribution.
After months of developing my skills as a data scientist, I finally got fed up last week. I wanted to learn how to make one of those super cool word clouds that I had seen (but never forgotten)!!! So, I found a helpful Datacamp tutorial, read a little documentation, and I was ready to practice my new knowledge.
Downloading packages and importing data
The first step of practicing a new visualization type is to find a data set that is appropriate for that visualization. Given that we are trying to practice a word cloud, we should probably find some data with text involved. There are many great options to practice, such as customer reviews, tweets, or Youtube video captions. But, I found a dataset on Kaggle about Kickstarter projects that I figured would be perfect for this scenario.
import numpy as np import pandas as pd from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt %matplotlib inline # Link to the Kickstarter Projects dataset on Kaggle: # https://www.kaggle.com/kemical/kickstarter-projects#ks-projects-201801.csv df = pd.read_csv('ks-projects-201801.csv') df.head()
Cleaning the data
According to the dataset’s documentation, the usd_pledge_real and usd_goal_real columns are the true conversion in US dollars generated by the Fixer.io API. So, we should use those and drop all other money-related columns.
The purpose of Kickstarter is to raise as much money as possible to start your business. Therefore, a column that appropriately defines success is the amount of money pledged to each campaign. Let’s use the usd_pledged_real column to retrieve just the top 500 Kickstarter campaigns based on amount pledged.
# Rename columns and drop unneeded ones df['usd_pledged'] = df.usd_pledged_real df['usd_goal'] = df.usd_goal_real df_ks = df.drop(columns=['goal', 'usd pledged', 'usd_pledged_real', 'usd_goal_real']) # Get the top 50 Kickstarter campaigns by $ pledged df_sorted = df_ks.sort_values(by='usd_pledged', ascending=False) df_top = df_sorted.head(500)
Visualizing the data with a word cloud
Now, comes the fun part. We can take advantage of the library that we imported earlier. We can use it to build up a word cloud of the names of the top 500 Kickstarter campaigns.
The WordCloud object has a method .generate() that will generate a wordcloud from a string. We can use list comprehension to build one giant string with all of the names of the Kickstarter campaigns.
Note that we also imported STOPWORDS from the wordcloud library. This is a built-in set of words that usually don’t add much value to text analysis (such as it, they, etc.). We can pass this set of words in with the stopwords parameter, to make sure we exclude those undesired words.
For more information about how to properly set the interpolation parameter, check out this documentation from matplotlib. For now, it is set to bilinear.
# Join all names and separate them with whitespace text = " ".join(str(name) for name in df_top.name) # Create stopword list: stopwords = set(STOPWORDS) # Generate a word cloud image wordcloud = WordCloud(stopwords=stopwords).generate(text) # Display the generated image: # the matplotlib way: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Let’s make it pretty
Just a few lines of code, and we’ve got a word cloud! There’s a lot going on here in a small photo, so let’s make some improvements.
Lets, change the background to white, to make it easier to read. Let’s make the minimum font size a little larger. Finally, let’s exclude the words “First” and “Smart” so we can see come other words pop up.
# Create stopword list: stopwords = set(STOPWORDS) # This time, add in your own words to ignore stopwords.update(["First", "Smart"]) # Generate a word cloud image wordcloud = WordCloud(stopwords=stopwords, background_color="white", min_font_size=8).generate(text) # Display the generated image: # But make it a little larger this time.. plt.figure(figsize=(9,6)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()
Very interesting. Here’s some takeaways from our analysis with our word cloud:
- Everyone likes to say that they’re first, and smart.. and it seems to be working, since they made the top 500 Kickstarter campaigns in this data set.
- It seems that there are many campaigns related to technology: electric, headphones, camera, 3D Printer, etc.
- There are also more mention of games on here than I was expecting… who knew Kickstarter was so fun?
You’ve officially created a word cloud! Congratulations! If you are still thirsting for more analysis, you can build some more word clouds to investigate these topics:
- The Kickstarter data came with a category column that contains values such as Poetry and Music. Do different categories have certain words that are more common?
- The data also has a launched column, which holds a date. Do older campaigns have noticeably different word frequencies from newer campaigns?
Find anything interesting or shocking from your analyses? Let me know in the comments!