Analysis of activity shifts on reddit
In this work, I take a look at the changes in subreddit activity levels, how they shifted and whether certain communities on reddit shifted together.
Abstract
The year 2019 to 2021 has seen countless events with global ramifications. As individuals become increasingly interconnected online through social media, it raises questions about how they are interacting with one another on these platforms and why. To understand these changes is to understand what people find most interesting or important, and how they feel about them. In this work, we will take a stab at answering this question by examining a subset of ~5000 reddit communities and the changes in activity level within them, as well as whether the sentiment associated with the activity is positive or negative. We will look at subreddits that have grown in size, and those that have contracted separately, and try to understand what sets them apart. We will then identify whether the activity through the grown/contraction phases is largely related to positive or negative sentiment, time permitting.
Hypothesis: Activity level changes in subreddits are positively correlated with the degree of positive or negative sentiment associated with the activity.
Research questions:
- What are the similarities shared between subreddits that experienced similar activity level changes?
- Did the ones that grew/contracted:
- experience that growth during similar time periods, and are there any reasons why?
- share common topics?
- What kind of sentiment can we relate to subreddits that experienced similar activity level changes?
- Was there any sentiment associated with these changes?
- Was growth/contraction associated with very positive/negative sentiments? (I.e. Was the change driven by love/attraction or by hate/repulsion?)
The data
In this analysis we use two filtered and cleaned data sources, text_comments.csv.gz
and text_submissions.csv.gz
as provided generously by my course professor. They contain the top ~5000 subreddits, which detail all the comments and submissions from January 2019 to June 2021.
Column Schema
For both comments and submissions:
id
: a unique id for the itemscore
: score of the item (upvotes minus downvotes, with some algorithmic ‘fuzzing’ applied)author
: username of the user who posted the item, can be ‘[deleted]’ if an item has been deleted from its authors’ profile, or ‘AutoModerator’ if posted by the AutoModerator botsubreddit
: name of the subreddit the item was posted increated_utc
: time the item was posted, in Unix time
For comments only:
link_id
: id of the link to which this comment belongsbody
: textual content of the comment
For submissions only:
is_self
: True if a submission is a text-only ‘self-post’, False if the submission is a linkdomain
: domain of the linktitle
: title of the submissionselftext
: content of the self-post
Let’s take a look at the first 5 rows of data in our two data sources, text_comments.csv.gz
and text_submissions.csv.gz
Comment text data
id | score | link_id | author | subreddit | body | created_utc | |
---|---|---|---|---|---|---|---|
0 | t1_ftjl56l | 4.0 | t3_gzv6so | mega_trex | BeautyGuruChatter | Does anyone have a good cruelty free one? The ... | 1.591756e+09 |
1 | t1_ftjpxmc | 6.0 | t3_gzv6so | [deleted] | BeautyGuruChatter | (stares at my soft glam i've had for like 3 ye... | 1.591758e+09 |
2 | t1_gzzxfyt | 22.0 | t3_nodb9e | divadream | BeautyGuruChatter | When Jen’s initial reactions came out to the s... | 1.622398e+09 |
3 | t1_gzzy7nc | 92.0 | t3_no6qaj | Ziegenkoennenfliegen | BeautyGuruChatter | I think you mean a \n>Highschool *fucking* bully | 1.622399e+09 |
4 | t1_h00tpbp | 82.0 | t3_nolx7p | meowrottenralph | BeautyGuruChatter | Ugh. I was honestly hoping that this brand wou... | 1.622415e+09 |
Which contains (46413725, 7) rows and columns respectively.
Submission/Post Text Data
id | author | created_utc | domain | is_self | score | selftext | title | subreddit | |
---|---|---|---|---|---|---|---|---|---|
0 | t3_npxigk | All_Consuming_Void | 1622563615 | self.BeautyGuruChatter | True | 0.0 | [removed] | Hyram launches his own brand | BeautyGuruChatter |
1 | t3_nqj6bf | AutoModerator | 1622631621 | self.BeautyGuruChatter | True | 38.0 | What are the influencers trying to influence y... | What I'm not gonna buy Wednesday - Anti-haul | BeautyGuruChatter |
2 | t3_nk0btr | barrahhhh | 1621869439 | reddit.com | False | 144.0 | NaN | Plouise goes off in facebook group for 'bullying' | BeautyGuruChatter |
3 | t3_nrbybs | [deleted] | 1622722260 | self.BeautyGuruChatter | True | 2.0 | [deleted] | Is youtube algorithm against Susan Yara? She g... | BeautyGuruChatter |
4 | t3_nl0ebd | carlosShook | 1621977767 | vm.tiktok.com | False | 0.0 | NaN | Sephora steals concept from Huntr Faulknr afte... | BeautyGuruChatter |
Which contains (3496180, 9) rows and columns respectively.
We have a LOT of data here. A brief look at our data tells us we have over 46 million comments, which belong to almost 3.5 million posts. Those 3.5 million posts belong to 4,465 subreddits. (And this data is already a subset of all of reddit)
We clearly have a lot of data to work with, but due to the time contraints associated with this project it would be best to narrow down a research question and take a subset of this data.
I’ll now take a look at the distribution of the number of posts per subreddit.
This data is very hard to read, as most subreddits have less than 20,000 posts but there are a few that reach around 120,000 in this subset. Lets take a look at the log distribution of the data, which spreads out the distribution of subreddits if they have a small number of posts and condenses the distribution of subreddits that have a lot of posts.
From this analysis, we can see a normal distribution starting to appear! Most of our data has around $e^5$ to $e^7$ number of posts.
Research Question 1.
To answer our first research question, we have to first look at How have subreddits grown and shrunk over the two years. We want to see if there exists any trends in groups of subreddits that have grown or contracted together, and whether those growths or contractions are associated with positive or negative sentiments. For example, Formula One has seen large growth over the 2019-2021 time period due to the release of the Drive to Survive Netflix show, attracting lots of new fans. An growth event in the Formula One subreddit should be expected come with positive sentiment.
In order to perform this analysis, we’ll first have to make sure the dates in our data are properly processed. The created_utc
is currently in Unix time, so I have converted it to a standard, readable time format.
Calculating activity level changes through the date range given for each subreddit
Define activity level here as # posts + # comments = activity level
. We’ll leave the individual activity levels as they are for now, and create a new column that merges them in our new aggregated dataset.
Since we have such high resolution data for created_utc
(times down to the second!), we will aggegrate up so we can see get the bigger picture of activity levels. Since it is more than 2 years worth of data, we can aggregate activity level by week. This preserves volatility within months but cutting down the number of value we have to deal with by a lot. There could have been some days that had no posts but had comments, or vice versa.
We can preform these actions for each dataset separately, before merging it all into a dataset with weeks as our rows and activity levels for both submissions and comments as columns all grouped by subreddit.
Summary of activity within subreddits
The following table shows us that most subreddits had over 3,100 comments and 319 posts/submissions over the past 2 and a half years.
act_lvl_comments | act_lvl_submissions | total_activity | ||||
---|---|---|---|---|---|---|
sum | mean | sum | mean | sum | mean | |
count | 4.474000e+03 | 4474.000000 | 4474.000000 | 4474.000000 | 4.474000e+03 | 4474.000000 |
mean | 8.386609e+03 | 72.432462 | 780.684175 | 6.525290 | 9.167293e+03 | 78.957752 |
std | 3.653893e+04 | 299.227174 | 2976.810469 | 23.657839 | 3.912499e+04 | 319.474804 |
min | 2.540000e+02 | 3.135802 | 0.000000 | 0.000000 | 2.570000e+02 | 3.172840 |
25% | 1.737000e+03 | 14.139313 | 181.000000 | 1.470781 | 1.973250e+03 | 16.167939 |
50% | 2.912500e+03 | 24.007634 | 319.500000 | 2.574721 | 3.300000e+03 | 27.142227 |
75% | 6.271500e+03 | 52.030395 | 662.000000 | 5.406489 | 6.981750e+03 | 57.756268 |
max | 1.895300e+06 | 14467.938931 | 128332.000000 | 979.633588 | 2.023632e+06 | 15447.572519 |
Lets take a look at the first subreddit to see how our data looks!
For this particular subreddit we can see that as time went on, total activity levels have dropped over time. Notice that most of the volatility is due to the comment activity fluctuations. We can also observe that around new years, activity levels peak and then drops back down.
Measuring activity level growth and contraction
With this data, we can easily calculate changes in activity level and group subreddits based on their respective level of change. We’ll calculate the overall trend using linear regression over the entire time period for each subreddit, which will serve as our average change over time.
Diving deeper into the data, we see exactly which subreddits had the most contraction and most growth:
Average change in overall activity per week | |
---|---|
subreddit | |
TITWcommentdump | -348.1786 |
csci040temp | -117.8000 |
TITWleaderboard | -96.2667 |
pan_media | -83.5325 |
WallStreetbetsELITE | -69.4371 |
... | ... |
SHIBArmy | 39.5613 |
wallstreetbets | 54.2944 |
amcstock | 100.5929 |
Superstonk | 113.6901 |
kfq | 2766.8000 |
4474 rows × 1 columns
Average change in overall activity per week | |
---|---|
count | 4474.000000 |
mean | 0.604920 |
std | 41.930156 |
min | -348.178600 |
25% | -0.041750 |
50% | 0.075600 |
75% | 0.220800 |
max | 2766.800000 |
Looking at a box plot of the average change per week, we see that we have a huge outlier with a growth rate of 2766 in our data.
We’ll examine that one subreddit later on.
Lets plot a histogram of the average slope to analyze the distribution, omitting the outlier at 2766
We can see that most subreddits experiences little to no change at all, as the high bar in the middle at 0 represents most of the subreddits.
Subdividing and filtering subreddits
Filtering
What we want to do now is split subreddits into those that grew in activity level, and those that contracted. I also want to filter out any subreddits that didn’t have much activity at all.
The platform shows posts(submissions) filtered based a couple categories. The “hot” filter orders posts by most upvotes recently to least. The “new” filter orders posts by newest to oldest. The “top” filter orders posts by highest total number of votes to least. Lastly, the “rising” filter predicts which posts will be “successful” (in their words) based on post age, subreddit size, and votes per minute.
From my personal experience, I mainly use the hot and new filters while only occasionally using the top filter. This means that my interactions with subreddits are generally with current posts. Based on this assumption, it would imply subreddits with few or no posts per week don’t generate as much interaction as those that do have many posts per week. This tells me that I only want to look at subreddits that have at least 5 posts a week on average. I will classify any subreddit with less posts per week on average as too small.
According to the previous Summary of activity within subreddits
output, this measure corresponds with subreddits above the 75th percentile, with a mean submission per week number of ~5 posts.
We also want to filter out any subreddits that didn’t change much at all. To just filter out subreddits with a mean slope around 0 would ignore subreddits whose activity climbed and dropped by the same amount, over the same time frame lengths. To counteract this, we will filter out subreddits with a low absolute sum of the first differences in activity levels along with any subreddits with a slope within -1 and 1. These restrictions mean we will effectively be excluding 90% of our data. (This is helpful from both an interpretation perspective and computation perspective.)
This table shows some of the sites we have remaining after filtering:
subreddit | |
---|---|
0 | 196 |
1 | 2007scape |
2 | ACNHvillagertrade |
3 | ACTrade |
4 | ACVillager |
... | ... |
321 | wallstreetbets |
322 | wholesomememes |
323 | wildrift |
324 | worldnews |
325 | yeagerbomb |
326 rows × 1 columns
Dividing subreddits into chunks
All this prep work was done in order to help us analyze the ways that subreddits grew and contracted over time. So naturally we want to understand why some subreddits grew and why some subreddits shrank.
(We won’t be looking at those that didnt really change in activity level.)
To understand the intricacies between subreddits, we need to increase the resolution of our data and one way to do that is to group subreddits by their level of change, as determined by which percentile they lie in. Due to the fact we have 326 subreddits to work with after filtering, we will settle with the splitting both the growing and contracting datasets into halves:
- Growth
- High Growth, Low Growth
- Growth Threshold: 2.01
- This threshold determines whether a subreddit belongs in the high or low growth category
- Contraction
- High Contraction, Low Contraction
- Contraction Threshold: -2.44
- This threshold determines whether a subreddit belongs in the high or low contraction category
Growth subreddits
Here we look at the distribution of average activity level change in contracted subreddits.
Add in describe table here
It seems that we’ve been able to capture all the subreddits with growth, even subreddit with the largest overall growth.
High growth Subreddits
Here are some of the high growth subreddits:
- MAAU
- wildrift
- lgbt
- PoliticalHumor
- politics
- nottheonion
- mechmarket
- NoStupidQuestions
- Whatcouldgowrong
- Genshin_Impact
- SHIBArmy
- wallstreetbets
Low Growth Subreddits
Here are some of the low growth subreddits:
- cats
- formuladank
- dating_advice
- PersonalFinanceCanada
- h3h3productions
- playstation
- italy
- WatchPeopleDieInside
- fivenightsatfreddys
- LifeProTips
- FemaleDatingStrategy
- nvidia
Contracted subreddits
Here we look at the distribution of average activity level change in contracted subreddits.
Statistics | |
---|---|
count | 15307.00 |
mean | -4.12 |
std | 7.01 |
min | -83.53 |
25% | -3.748400 |
50% | -2.28 |
75% | -1.46 |
max | -1.01 |
326 rows × 1 columns
Comparing our previous results to this table of average changes, it seems we’re missing the top three subreddits with the largest change.
The bulk of activity in TITWcommentdump and TITWleaderboard occured in early 2021, dying off really quickly. A quick google tells us that people essentially spam “THIS IS THE WAY” in posts and comments, and the leaderboard subreddit tracks who spammed that phrase the most. This subreddit seems to be related to The Mandalorian, as “This is the way” is one of the shows iconic phrases.
csci40temp is a subreddit used by students of a computer science course where they can spam/test bots. This makes sense as activity within this subreddit was short-lived. They are rightfully excluded by our filter.
Let’s continue with our analysis, starting off with the high contraction followed by low contraction datasets.
High Contraction Subreddits
Here are some of the high contraction subreddits:
- Coronavirus
- GME
- AskReddit
- dankmemes
- funny
- me_irl
- gaming
- worldnews
- entitledparents
- CFB
- ukpolitics
- smashbros
Low Contraction Subreddits
Here are some of the low contraction subreddits:
- Warframe
- AnimalsOnReddit
- COVID19
- mildlyinteresting
- Overwatch
- dank_meme
- MurderedByWords
- leagueoflegends
- comedyheaven
- teslamotors
- Borderlands
- atheism
The outlier
As per the average change analysis, we know there’s a huge activity level outlier: the one subreddit kfq, that saw an average activity level growth of 2766. This subreddit is mainly in chinese. It’s description is:
“The Wandering KFQ @ 20190301. Here is a mirror sub of a Chinese community which has been closed by censorship problem.”
We can reasonably assume the huge activity is due to the migration of posts from the original subreddit, and is used purely to store posts since was no new posts as of March 31st, 2019.
We will exclude that subreddit so that we can see differences in activity level change between the rest of the subreddits in a clearer fashion.
After collecting average changes for each subreddit and removing the outlier, this is now our distribution of change!
We can see that there are still lots of subreddits with average change around 0, but we know they are all above | 1 | based on our filtering. |
Subreddit clustering by submission text
Now that we’ve finally cleaned and organized our subreddits, we want to be able to cluster the subreddits in each group together and examine the similarities they share. This is in an effort to understand why they have shared the same growth numbers.
I’ve decided not to include comment text in my word embeddings, due to the fact comments commonly stray off topic and are more related to personal experiences rather than the submission/post. The post itself is more reliably related to the given subreddit, since many subreddits have rules on what can and cannot be posted within them.
We are essentially gathering all the words associated to a given subreddit into one column on which we can create bag of words vectors. We will then feed that matrix into the term frequency - Inverse Document Frequency (tf-IDF) transformer which augments the bag of words matrix by giving less weight to common words to improve the amount of information we can obtain.
Below I’ve joined all the text within the title
and selt_text
columns from every submission into one column for every subreddit. The text
column is what we will perform the bag of words algorithm from sklearn.feature_extraction.text
called CountVectorizer
.
text | avg_chg | category | |
---|---|---|---|
subreddit | |||
196 | \n\n[View Poll](https://www.reddit.com/poll/np... | 27.8067 | high_growth |
2007scape | looking for some west coasters to personally a... | -1.7927 | low_contract |
ACNHvillagertrade | Pink hippo bitty in boxes heavily gifted I don... | -1.5510 | low_contract |
ACTrade | Howdy all! I haven’t played AC for about a yea... | 4.7686 | high_growth |
ACVillager | Currently can do 10 NMT and 3 Million Bells or... | -6.2110 | high_contract |
... | ... | ... | ... |
wallstreetbets | Let me introduce you to the ratio spread.\n\nT... | 54.2944 | high_growth |
wholesomememes | Every shitty horror movie you’re doing just f... | -1.5314 | low_contract |
wildrift | https://ibb.co/3NLSm9G\n\nNothing impressive t... | 1.8571 | low_growth |
worldnews | [removed]\n\n[View Poll](https://www.reddit.co... | -3.4403 | high_contract |
yeagerbomb | What I love about Eldia is I just get to be m... | 5.3669 | high_growth |
326 rows × 3 columns
Visualizing our subreddits
So we’ve created a tf-IDF matrix based on the text
column above. What now? Well because each subreddit is now represented by 500 words, we can think of all our subreddits residing in a 500 dimensional space. Thats crazy! Theres no way we can look at 500 dimensions at once, so in order to make the digestion of our data easier, we’ll use Principle Components Analysis (PCA) to find a way to reduce that all the dimensions down into just 10, and we will subsequently look at the top 2 dimensions (which capture the most variance in the data).
Using K-Means, we’re able cluster subreddits together based on their tf-IDF matrices after performing dimensionality reduction on them. We can see that the clusters below don’t really care whether a subreddit grew by a lot, or not by much.
What the following plots do is group subreddits together into topics. We will start off by examining the subreddits that experienced growth.
Subreddits that experienced growth
It’s much harder to distinguish groups within the growth subreddits, as they are made up of many different online communities. I attempt to classify them below:
Cluster Number | Cluster Colour | Potential Grouping |
---|---|---|
5 | Purple | Memes/General |
0 | Cyan | Politics |
2 | Blue | Tech/Pokemon/Animal Crossin/CryptoCurrency |
7 | Light Green | Spanish |
6 | Orange | Gaming |
1 | Pink | Sports/memes/Anime |
4 | Red | Stocks/Crypto |
3 | Green | Turkish/Roblox/memes |
Subreddits that experienced contraction
On the left, we have a cluster of primarily gaming subreddits in green. They had a mix of low and high contraction in activity levels. These games however seemed to belong to a sub-category of big name FPS games, such as Overwatch, Call of Duty: Cold War, Battlefield V, The Last of Us 2, and Destiny.
In purple, we have a variety of subreddits, from sports like NBA and NFL, to r/wallstreetbets spinoffs like Wallstreetsilver and WallStreebetsELITE. This cluster is not very cohesive, so it is hard derive any insight from it.
Near the bottom, we have a red cluster of primarily joke/meme/funny topic related subreddits such as TheMemersClub, dankmemes, wholesomememes as well as some shows such as WANDAVISION, gameofthrones, and rupaulsdragrace.
On the right, we have a cluster of blue subreddits which don’t seemingly relate to eachother as they contain subreddits focused on the Animal Crossing game such as ACVillager, ACNHvillagertrade, AnimalCrossingTrading, AnimalCrossingNewHor, and TurnipExchange while also containing various subreddits from other topics such as todayilearned and MemeEconomy.
On the right we also have a tiny orange cluster of subreddits almost exclusively dedicated to the COVID-19 pandemic/world news.
Cluster Number | Cluster Colour | Potential Grouping |
---|---|---|
2 | Green | First Person Shooter (FPS) Video Games |
4 | Purple | NBA/NFL/Wall Street Bet spinoffs/Russian |
1 | Red | Jokes/memes/funny |
3 | Blue | Animal Crossing/memes |
0 | Orange | COVID-19 |
As we can see, there are some clear topics/communities that contracted such as COVID-19, FPS Video Games while some clear communities that grew such as politics, stocks/cryptocurrency, and other gaming categories.
We could probably explain the decline in COVID-19 related subreddits as people got tired of talking and hearing about the on-going pandemic. As restrictions lifted, people could have played less video games (specifically FPS games and Animal Crossing) resulting in related subreddits to decrease in activity as well. Interestingly, the actual Animal Crossing subreddit had high growth in activity level, even though it’s related subreddits decreased in activity level.
Analyzing the growth patterns of each cluster
Growth clusters
Contraction clusters
Analysis of the community activity plots
This is a lot of data to digest! Here are some thoughts:
Growth
Starting off with the communities/clusters of subreddits that grew, we see that the Tech/Pokemon/Animal Crossing/CryptoCurrency and Stocks/Crypto Currency clusters grew quite similarly, but thats most likely due to the fact they both contained similar subreddits. The rest of the subreddit clusters don’t share many similarities in when they grew, so unfortunately we cannot make any insight into whether certain clusteres of communities on reddit grew.
What we can observe from these graphs is the jump in activity in almost all the clusters around early 2020, between January and July. This correlates well with when COVID-19 lockdowns started to be put in place around the world.
Contraction
Less clusters worked better in terms of clustering subreddits that contracted, since in the first two dimensions of our submission text embedding space the subreddits appeared to have some sem-distinct groupings already.
Also very similarly to the growth category of subreddits, we see a jump in activity early 2020 between January and July as well for these subreddits. Albeit with this group on subreddits, it doesn’t seem as though that was enough to spur new activity in these subreddits.
Conclusion of part 1: Analysis of similarities between subreddits that changed together over January 2019 - June 2021.
So to conclude this section, we did quite a bit of data cleaning and separation. We first calculated the average rate of change in the amount of submissions and comments in each subreddit which we rolled up into total_activity
. We then filtered out subreddits that didn’t change much over the time span of data we have, and subsequently separated them into groups of sub-reddits that grew, and subreddits that contracted. After figuring out which subreddits we want to analyze, we were able to obtain all the submission text (including titles of posts and the content, if any, attached) and construct a tf-IDF matrix.
The NLP analysis
This matrix represents each subreddit by a vector of 500 words, which more “important” words weighing more than common words. Using this we performed PCA analysis to identify 10 dimensions that capture the most variance between the subreddits, visualizing the top 2. We clustered subreddits together based on these 10 dimensions using the KMeans algorithm, and tried our best to interpret the clustering. From there we examined how each of our clusters changed in total_activity
in relation to eachother. We were unable to identify any strong similarities between subreddits, but all subreddits grew in early 2020, between January and June as lockdowns around the world started being put in place.
Comment data
Due to time constraints, I will not be using the comment text data for the sentiment analysis even though it provides a lot of information due to the time constraints with this project. It would involve running ~2.8gb of data through a sentiment processor, and that will take too long to run.
Part 2: Sentiment Analysis
In the following section, we’ll take a look at the sentiment associated with the growth and contraction groups of subreddits respectively to determine whether there was any sentiment associated with the growth or decline in activity level.
To do this, we will be using the SentimentIntensityAnalyzer
powered by the popular vader lexicon from nltk.sentiment
.
Subreddits that grew
From the left, we see the subreddit with the most postiive sentiment also has the most negative sentiment. A deeper dive into r/TIHI shows that its generally about things people hate, and asks for otheres opinion on whether their opinion is justified. It seems that the subreddit itself is negative, but with the “Thanks…” required in each submission, it seems as though positive and negative sentiments have been balanced out.
We can also examine the average sentiment statistics:
neg | neu | pos | |
---|---|---|---|
count | 185.000000 | 185.000000 | 185.000000 |
mean | 0.095876 | 0.765146 | 0.138962 |
std | 0.049683 | 0.079701 | 0.048680 |
min | 0.005000 | 0.274000 | 0.005000 |
25% | 0.064000 | 0.725000 | 0.119000 |
50% | 0.091000 | 0.762000 | 0.140000 |
75% | 0.124000 | 0.798000 | 0.161000 |
max | 0.409000 | 0.991000 | 0.317000 |
Based purely on sentiment, we have an average of 13.89% positive sentiment, and 9.58% negative sentiment. It appears there exists a bit more positive sentiment than negative sentiment within subreddits that grew.
Subreddits that contracted
Starting from the left side of this graph, we can see that r/wholesomememes had tons of positive sentiment related to it, which is understandable as being wholesome is central to what the community is, and being wholesome is trivially related to positive sentiment.
Further to the right, we can see a noticeable spike in negative sentiment in the r/MurderedbyWords subreddit, which is dedicated to comebacks and counter-argumemts.
We can also examine the average sentiment statistics:
neg | neu | pos | |
---|---|---|---|
count | 140.000000 | 140.000000 | 140.000000 |
mean | 0.099221 | 0.754379 | 0.146371 |
std | 0.034446 | 0.052458 | 0.037560 |
min | 0.003000 | 0.609000 | 0.007000 |
25% | 0.079000 | 0.726250 | 0.131000 |
50% | 0.098500 | 0.755000 | 0.147500 |
75% | 0.119000 | 0.780250 | 0.165000 |
max | 0.240000 | 0.989000 | 0.326000 |
If we look at the row that describes the mean, we can see that there is on average 14.63% positive sentiment associated with these subreddits and 9.92% negative sentiment.
Based solely on this analysis, we see that the proportions of positive and negative sentiment are really similar between the subreddits that grew and those that contracted.
T-tests on positive and negative sentiment between subreddits that grew and subreddits that contracted
- P-value of positive sentiment t-test: 13.58%
- P-value of negative sentiment t-test: 49.56%
From our t-tests of each negative and positive sentiment scores for each subreddit, we can see clearly from their p-values that they are both greater than 5%, which means we cannot reject the null hypothesis, which states that subreddits that contracted and grew should have the same negative and positive sentiment value on average.
Therefore we there does not exist enough statistically significant proof that subreddits that grew and subreddits that contracted are different in terms of sentiment.
Conclusion
We were able to find some interesting results with our analysis of the activity level changes between subreddits that grew and subreddits that contracted, and we were somewhat successful in clustering subreddits based on similar topics. We found that regardless of the overall activity change observed over the 2.5 years of data we have, there was a significant uptick in activity for all our filtered subreddits in early 2020, when COVID-19 lockdowns were being put in place around the world.
Based on our brief sentiment analysis of the sentiment associated with each subreddit’s submissions, we cannot say whether activity level changes in subreddits are positively correlated with the degree of positive or negative sentiment associated with the activity. We have only seen that there seems to be more positive sentiment than negative sentiment associated with these all the subreddits we looked at.
Therefore we cannot accept/reject our hypothesis, that activity level changes in subreddits are positively correlated with the degree of positive or negative sentiment associated with the activity.
Thank you for taking the time to read through my data story and coming along with me on this data journey. I hope you learned something new today!
If you’d like to learn more about me, check out my video introduction
Or you could checkout my personal website