Skip to the content.

Analysis of activity shifts on reddit

In this work, I take a look at the changes in subreddit activity levels, how they shifted and whether certain communities on reddit shifted together.

Abstract

The year 2019 to 2021 has seen countless events with global ramifications. As individuals become increasingly interconnected online through social media, it raises questions about how they are interacting with one another on these platforms and why. To understand these changes is to understand what people find most interesting or important, and how they feel about them. In this work, we will take a stab at answering this question by examining a subset of ~5000 reddit communities and the changes in activity level within them, as well as whether the sentiment associated with the activity is positive or negative. We will look at subreddits that have grown in size, and those that have contracted separately, and try to understand what sets them apart. We will then identify whether the activity through the grown/contraction phases is largely related to positive or negative sentiment, time permitting.

Hypothesis: Activity level changes in subreddits are positively correlated with the degree of positive or negative sentiment associated with the activity.

Research questions:

  1. What are the similarities shared between subreddits that experienced similar activity level changes?
    • Did the ones that grew/contracted:
    • experience that growth during similar time periods, and are there any reasons why?
    • share common topics?
  2. What kind of sentiment can we relate to subreddits that experienced similar activity level changes?
    • Was there any sentiment associated with these changes?
    • Was growth/contraction associated with very positive/negative sentiments? (I.e. Was the change driven by love/attraction or by hate/repulsion?)

The data

In this analysis we use two filtered and cleaned data sources, text_comments.csv.gz and text_submissions.csv.gz as provided generously by my course professor. They contain the top ~5000 subreddits, which detail all the comments and submissions from January 2019 to June 2021.

Column Schema

For both comments and submissions:

For comments only:

For submissions only:

Let’s take a look at the first 5 rows of data in our two data sources, text_comments.csv.gz and text_submissions.csv.gz

Comment text data

id score link_id author subreddit body created_utc
0 t1_ftjl56l 4.0 t3_gzv6so mega_trex BeautyGuruChatter Does anyone have a good cruelty free one? The ... 1.591756e+09
1 t1_ftjpxmc 6.0 t3_gzv6so [deleted] BeautyGuruChatter (stares at my soft glam i've had for like 3 ye... 1.591758e+09
2 t1_gzzxfyt 22.0 t3_nodb9e divadream BeautyGuruChatter When Jen’s initial reactions came out to the s... 1.622398e+09
3 t1_gzzy7nc 92.0 t3_no6qaj Ziegenkoennenfliegen BeautyGuruChatter I think you mean a \n>Highschool *fucking* bully 1.622399e+09
4 t1_h00tpbp 82.0 t3_nolx7p meowrottenralph BeautyGuruChatter Ugh. I was honestly hoping that this brand wou... 1.622415e+09

Which contains (46413725, 7) rows and columns respectively.

Submission/Post Text Data

id author created_utc domain is_self score selftext title subreddit
0 t3_npxigk All_Consuming_Void 1622563615 self.BeautyGuruChatter True 0.0 [removed] Hyram launches his own brand BeautyGuruChatter
1 t3_nqj6bf AutoModerator 1622631621 self.BeautyGuruChatter True 38.0 What are the influencers trying to influence y... What I'm not gonna buy Wednesday - Anti-haul BeautyGuruChatter
2 t3_nk0btr barrahhhh 1621869439 reddit.com False 144.0 NaN Plouise goes off in facebook group for 'bullying' BeautyGuruChatter
3 t3_nrbybs [deleted] 1622722260 self.BeautyGuruChatter True 2.0 [deleted] Is youtube algorithm against Susan Yara? She g... BeautyGuruChatter
4 t3_nl0ebd carlosShook 1621977767 vm.tiktok.com False 0.0 NaN Sephora steals concept from Huntr Faulknr afte... BeautyGuruChatter

Which contains (3496180, 9) rows and columns respectively.

We have a LOT of data here. A brief look at our data tells us we have over 46 million comments, which belong to almost 3.5 million posts. Those 3.5 million posts belong to 4,465 subreddits. (And this data is already a subset of all of reddit)

We clearly have a lot of data to work with, but due to the time contraints associated with this project it would be best to narrow down a research question and take a subset of this data.

I’ll now take a look at the distribution of the number of posts per subreddit.

Distribution of posts per subreddit

This data is very hard to read, as most subreddits have less than 20,000 posts but there are a few that reach around 120,000 in this subset. Lets take a look at the log distribution of the data, which spreads out the distribution of subreddits if they have a small number of posts and condenses the distribution of subreddits that have a lot of posts.

From this analysis, we can see a normal distribution starting to appear! Most of our data has around $e^5$ to $e^7$ number of posts.

Research Question 1.

To answer our first research question, we have to first look at How have subreddits grown and shrunk over the two years. We want to see if there exists any trends in groups of subreddits that have grown or contracted together, and whether those growths or contractions are associated with positive or negative sentiments. For example, Formula One has seen large growth over the 2019-2021 time period due to the release of the Drive to Survive Netflix show, attracting lots of new fans. An growth event in the Formula One subreddit should be expected come with positive sentiment.

In order to perform this analysis, we’ll first have to make sure the dates in our data are properly processed. The created_utc is currently in Unix time, so I have converted it to a standard, readable time format.

Calculating activity level changes through the date range given for each subreddit

Define activity level here as # posts + # comments = activity level. We’ll leave the individual activity levels as they are for now, and create a new column that merges them in our new aggregated dataset.

Since we have such high resolution data for created_utc (times down to the second!), we will aggegrate up so we can see get the bigger picture of activity levels. Since it is more than 2 years worth of data, we can aggregate activity level by week. This preserves volatility within months but cutting down the number of value we have to deal with by a lot. There could have been some days that had no posts but had comments, or vice versa.

We can preform these actions for each dataset separately, before merging it all into a dataset with weeks as our rows and activity levels for both submissions and comments as columns all grouped by subreddit.

Summary of activity within subreddits

The following table shows us that most subreddits had over 3,100 comments and 319 posts/submissions over the past 2 and a half years.

act_lvl_comments act_lvl_submissions total_activity
sum mean sum mean sum mean
count 4.474000e+03 4474.000000 4474.000000 4474.000000 4.474000e+03 4474.000000
mean 8.386609e+03 72.432462 780.684175 6.525290 9.167293e+03 78.957752
std 3.653893e+04 299.227174 2976.810469 23.657839 3.912499e+04 319.474804
min 2.540000e+02 3.135802 0.000000 0.000000 2.570000e+02 3.172840
25% 1.737000e+03 14.139313 181.000000 1.470781 1.973250e+03 16.167939
50% 2.912500e+03 24.007634 319.500000 2.574721 3.300000e+03 27.142227
75% 6.271500e+03 52.030395 662.000000 5.406489 6.981750e+03 57.756268
max 1.895300e+06 14467.938931 128332.000000 979.633588 2.023632e+06 15447.572519

Lets take a look at the first subreddit to see how our data looks!

png

For this particular subreddit we can see that as time went on, total activity levels have dropped over time. Notice that most of the volatility is due to the comment activity fluctuations. We can also observe that around new years, activity levels peak and then drops back down.

Measuring activity level growth and contraction

With this data, we can easily calculate changes in activity level and group subreddits based on their respective level of change. We’ll calculate the overall trend using linear regression over the entire time period for each subreddit, which will serve as our average change over time.

Diving deeper into the data, we see exactly which subreddits had the most contraction and most growth:

Average change in overall activity per week
subreddit
TITWcommentdump -348.1786
csci040temp -117.8000
TITWleaderboard -96.2667
pan_media -83.5325
WallStreetbetsELITE -69.4371
... ...
SHIBArmy 39.5613
wallstreetbets 54.2944
amcstock 100.5929
Superstonk 113.6901
kfq 2766.8000

4474 rows × 1 columns

Average change in overall activity per week
count 4474.000000
mean 0.604920
std 41.930156
min -348.178600
25% -0.041750
50% 0.075600
75% 0.220800
max 2766.800000

Looking at a box plot of the average change per week, we see that we have a huge outlier with a growth rate of 2766 in our data.

png

We’ll examine that one subreddit later on.

Lets plot a histogram of the average slope to analyze the distribution, omitting the outlier at 2766

png

We can see that most subreddits experiences little to no change at all, as the high bar in the middle at 0 represents most of the subreddits.

Subdividing and filtering subreddits

Filtering

What we want to do now is split subreddits into those that grew in activity level, and those that contracted. I also want to filter out any subreddits that didn’t have much activity at all.

The platform shows posts(submissions) filtered based a couple categories. The “hot” filter orders posts by most upvotes recently to least. The “new” filter orders posts by newest to oldest. The “top” filter orders posts by highest total number of votes to least. Lastly, the “rising” filter predicts which posts will be “successful” (in their words) based on post age, subreddit size, and votes per minute.

From my personal experience, I mainly use the hot and new filters while only occasionally using the top filter. This means that my interactions with subreddits are generally with current posts. Based on this assumption, it would imply subreddits with few or no posts per week don’t generate as much interaction as those that do have many posts per week. This tells me that I only want to look at subreddits that have at least 5 posts a week on average. I will classify any subreddit with less posts per week on average as too small.

According to the previous Summary of activity within subreddits output, this measure corresponds with subreddits above the 75th percentile, with a mean submission per week number of ~5 posts.

We also want to filter out any subreddits that didn’t change much at all. To just filter out subreddits with a mean slope around 0 would ignore subreddits whose activity climbed and dropped by the same amount, over the same time frame lengths. To counteract this, we will filter out subreddits with a low absolute sum of the first differences in activity levels along with any subreddits with a slope within -1 and 1. These restrictions mean we will effectively be excluding 90% of our data. (This is helpful from both an interpretation perspective and computation perspective.)

This table shows some of the sites we have remaining after filtering:

subreddit
0 196
1 2007scape
2 ACNHvillagertrade
3 ACTrade
4 ACVillager
... ...
321 wallstreetbets
322 wholesomememes
323 wildrift
324 worldnews
325 yeagerbomb

326 rows × 1 columns

Dividing subreddits into chunks

All this prep work was done in order to help us analyze the ways that subreddits grew and contracted over time. So naturally we want to understand why some subreddits grew and why some subreddits shrank.

(We won’t be looking at those that didnt really change in activity level.)

To understand the intricacies between subreddits, we need to increase the resolution of our data and one way to do that is to group subreddits by their level of change, as determined by which percentile they lie in. Due to the fact we have 326 subreddits to work with after filtering, we will settle with the splitting both the growing and contracting datasets into halves:

Growth subreddits

Here we look at the distribution of average activity level change in contracted subreddits.

Add in describe table here

It seems that we’ve been able to capture all the subreddits with growth, even subreddit with the largest overall growth.

High growth Subreddits

Here are some of the high growth subreddits:

Low Growth Subreddits

Here are some of the low growth subreddits:

Contracted subreddits

Here we look at the distribution of average activity level change in contracted subreddits.

Statistics
count 15307.00
mean -4.12
std 7.01
min -83.53
25% -3.748400
50% -2.28
75% -1.46
max -1.01

326 rows × 1 columns

Comparing our previous results to this table of average changes, it seems we’re missing the top three subreddits with the largest change.

png

The bulk of activity in TITWcommentdump and TITWleaderboard occured in early 2021, dying off really quickly. A quick google tells us that people essentially spam “THIS IS THE WAY” in posts and comments, and the leaderboard subreddit tracks who spammed that phrase the most. This subreddit seems to be related to The Mandalorian, as “This is the way” is one of the shows iconic phrases.

csci40temp is a subreddit used by students of a computer science course where they can spam/test bots. This makes sense as activity within this subreddit was short-lived. They are rightfully excluded by our filter.

Let’s continue with our analysis, starting off with the high contraction followed by low contraction datasets.

High Contraction Subreddits

Here are some of the high contraction subreddits:

Low Contraction Subreddits

Here are some of the low contraction subreddits:

The outlier

As per the average change analysis, we know there’s a huge activity level outlier: the one subreddit kfq, that saw an average activity level growth of 2766. This subreddit is mainly in chinese. It’s description is:

“The Wandering KFQ @ 20190301. Here is a mirror sub of a Chinese community which has been closed by censorship problem.”

png

We can reasonably assume the huge activity is due to the migration of posts from the original subreddit, and is used purely to store posts since was no new posts as of March 31st, 2019.

We will exclude that subreddit so that we can see differences in activity level change between the rest of the subreddits in a clearer fashion.

After collecting average changes for each subreddit and removing the outlier, this is now our distribution of change!

png

We can see that there are still lots of subreddits with average change around 0, but we know they are all above 1 based on our filtering.

Subreddit clustering by submission text

Now that we’ve finally cleaned and organized our subreddits, we want to be able to cluster the subreddits in each group together and examine the similarities they share. This is in an effort to understand why they have shared the same growth numbers.

I’ve decided not to include comment text in my word embeddings, due to the fact comments commonly stray off topic and are more related to personal experiences rather than the submission/post. The post itself is more reliably related to the given subreddit, since many subreddits have rules on what can and cannot be posted within them.

We are essentially gathering all the words associated to a given subreddit into one column on which we can create bag of words vectors. We will then feed that matrix into the term frequency - Inverse Document Frequency (tf-IDF) transformer which augments the bag of words matrix by giving less weight to common words to improve the amount of information we can obtain.

Below I’ve joined all the text within the title and selt_text columns from every submission into one column for every subreddit. The text column is what we will perform the bag of words algorithm from sklearn.feature_extraction.text called CountVectorizer.

text avg_chg category
subreddit
196 \n\n[View Poll](https://www.reddit.com/poll/np... 27.8067 high_growth
2007scape looking for some west coasters to personally a... -1.7927 low_contract
ACNHvillagertrade Pink hippo bitty in boxes heavily gifted I don... -1.5510 low_contract
ACTrade Howdy all! I haven’t played AC for about a yea... 4.7686 high_growth
ACVillager Currently can do 10 NMT and 3 Million Bells or... -6.2110 high_contract
... ... ... ...
wallstreetbets Let me introduce you to the ratio spread.\n\nT... 54.2944 high_growth
wholesomememes Every shitty horror movie you’re doing just f... -1.5314 low_contract
wildrift https://ibb.co/3NLSm9G\n\nNothing impressive t... 1.8571 low_growth
worldnews [removed]\n\n[View Poll](https://www.reddit.co... -3.4403 high_contract
yeagerbomb What I love about Eldia is I just get to be m... 5.3669 high_growth

326 rows × 3 columns

Visualizing our subreddits

So we’ve created a tf-IDF matrix based on the text column above. What now? Well because each subreddit is now represented by 500 words, we can think of all our subreddits residing in a 500 dimensional space. Thats crazy! Theres no way we can look at 500 dimensions at once, so in order to make the digestion of our data easier, we’ll use Principle Components Analysis (PCA) to find a way to reduce that all the dimensions down into just 10, and we will subsequently look at the top 2 dimensions (which capture the most variance in the data).

Using K-Means, we’re able cluster subreddits together based on their tf-IDF matrices after performing dimensionality reduction on them. We can see that the clusters below don’t really care whether a subreddit grew by a lot, or not by much.

What the following plots do is group subreddits together into topics. We will start off by examining the subreddits that experienced growth.

Subreddits that experienced growth

It’s much harder to distinguish groups within the growth subreddits, as they are made up of many different online communities. I attempt to classify them below:

Cluster Number Cluster Colour Potential Grouping
5 Purple Memes/General
0 Cyan Politics
2 Blue Tech/Pokemon/Animal Crossin/CryptoCurrency
7 Light Green Spanish
6 Orange Gaming
1 Pink Sports/memes/Anime
4 Red Stocks/Crypto
3 Green Turkish/Roblox/memes

Subreddits that experienced contraction

On the left, we have a cluster of primarily gaming subreddits in green. They had a mix of low and high contraction in activity levels. These games however seemed to belong to a sub-category of big name FPS games, such as Overwatch, Call of Duty: Cold War, Battlefield V, The Last of Us 2, and Destiny.

In purple, we have a variety of subreddits, from sports like NBA and NFL, to r/wallstreetbets spinoffs like Wallstreetsilver and WallStreebetsELITE. This cluster is not very cohesive, so it is hard derive any insight from it.

Near the bottom, we have a red cluster of primarily joke/meme/funny topic related subreddits such as TheMemersClub, dankmemes, wholesomememes as well as some shows such as WANDAVISION, gameofthrones, and rupaulsdragrace.

On the right, we have a cluster of blue subreddits which don’t seemingly relate to eachother as they contain subreddits focused on the Animal Crossing game such as ACVillager, ACNHvillagertrade, AnimalCrossingTrading, AnimalCrossingNewHor, and TurnipExchange while also containing various subreddits from other topics such as todayilearned and MemeEconomy.

On the right we also have a tiny orange cluster of subreddits almost exclusively dedicated to the COVID-19 pandemic/world news.

Cluster Number Cluster Colour Potential Grouping
2 Green First Person Shooter (FPS) Video Games
4 Purple NBA/NFL/Wall Street Bet spinoffs/Russian
1 Red Jokes/memes/funny
3 Blue Animal Crossing/memes
0 Orange COVID-19

As we can see, there are some clear topics/communities that contracted such as COVID-19, FPS Video Games while some clear communities that grew such as politics, stocks/cryptocurrency, and other gaming categories.

We could probably explain the decline in COVID-19 related subreddits as people got tired of talking and hearing about the on-going pandemic. As restrictions lifted, people could have played less video games (specifically FPS games and Animal Crossing) resulting in related subreddits to decrease in activity as well. Interestingly, the actual Animal Crossing subreddit had high growth in activity level, even though it’s related subreddits decreased in activity level.

animal_crossing.png

Analyzing the growth patterns of each cluster

Growth clusters

png

Contraction clusters

png

Analysis of the community activity plots

This is a lot of data to digest! Here are some thoughts:

Growth

Starting off with the communities/clusters of subreddits that grew, we see that the Tech/Pokemon/Animal Crossing/CryptoCurrency and Stocks/Crypto Currency clusters grew quite similarly, but thats most likely due to the fact they both contained similar subreddits. The rest of the subreddit clusters don’t share many similarities in when they grew, so unfortunately we cannot make any insight into whether certain clusteres of communities on reddit grew.

What we can observe from these graphs is the jump in activity in almost all the clusters around early 2020, between January and July. This correlates well with when COVID-19 lockdowns started to be put in place around the world.

Contraction

Less clusters worked better in terms of clustering subreddits that contracted, since in the first two dimensions of our submission text embedding space the subreddits appeared to have some sem-distinct groupings already.

Also very similarly to the growth category of subreddits, we see a jump in activity early 2020 between January and July as well for these subreddits. Albeit with this group on subreddits, it doesn’t seem as though that was enough to spur new activity in these subreddits.

Conclusion of part 1: Analysis of similarities between subreddits that changed together over January 2019 - June 2021.

So to conclude this section, we did quite a bit of data cleaning and separation. We first calculated the average rate of change in the amount of submissions and comments in each subreddit which we rolled up into total_activity. We then filtered out subreddits that didn’t change much over the time span of data we have, and subsequently separated them into groups of sub-reddits that grew, and subreddits that contracted. After figuring out which subreddits we want to analyze, we were able to obtain all the submission text (including titles of posts and the content, if any, attached) and construct a tf-IDF matrix.

The NLP analysis

This matrix represents each subreddit by a vector of 500 words, which more “important” words weighing more than common words. Using this we performed PCA analysis to identify 10 dimensions that capture the most variance between the subreddits, visualizing the top 2. We clustered subreddits together based on these 10 dimensions using the KMeans algorithm, and tried our best to interpret the clustering. From there we examined how each of our clusters changed in total_activity in relation to eachother. We were unable to identify any strong similarities between subreddits, but all subreddits grew in early 2020, between January and June as lockdowns around the world started being put in place.

Comment data

Due to time constraints, I will not be using the comment text data for the sentiment analysis even though it provides a lot of information due to the time constraints with this project. It would involve running ~2.8gb of data through a sentiment processor, and that will take too long to run.

Part 2: Sentiment Analysis

In the following section, we’ll take a look at the sentiment associated with the growth and contraction groups of subreddits respectively to determine whether there was any sentiment associated with the growth or decline in activity level.

To do this, we will be using the SentimentIntensityAnalyzer powered by the popular vader lexicon from nltk.sentiment.

Subreddits that grew

From the left, we see the subreddit with the most postiive sentiment also has the most negative sentiment. A deeper dive into r/TIHI shows that its generally about things people hate, and asks for otheres opinion on whether their opinion is justified. It seems that the subreddit itself is negative, but with the “Thanks…” required in each submission, it seems as though positive and negative sentiments have been balanced out.

We can also examine the average sentiment statistics:

neg neu pos
count 185.000000 185.000000 185.000000
mean 0.095876 0.765146 0.138962
std 0.049683 0.079701 0.048680
min 0.005000 0.274000 0.005000
25% 0.064000 0.725000 0.119000
50% 0.091000 0.762000 0.140000
75% 0.124000 0.798000 0.161000
max 0.409000 0.991000 0.317000

Based purely on sentiment, we have an average of 13.89% positive sentiment, and 9.58% negative sentiment. It appears there exists a bit more positive sentiment than negative sentiment within subreddits that grew.

Subreddits that contracted

Starting from the left side of this graph, we can see that r/wholesomememes had tons of positive sentiment related to it, which is understandable as being wholesome is central to what the community is, and being wholesome is trivially related to positive sentiment.

Further to the right, we can see a noticeable spike in negative sentiment in the r/MurderedbyWords subreddit, which is dedicated to comebacks and counter-argumemts.

We can also examine the average sentiment statistics:

neg neu pos
count 140.000000 140.000000 140.000000
mean 0.099221 0.754379 0.146371
std 0.034446 0.052458 0.037560
min 0.003000 0.609000 0.007000
25% 0.079000 0.726250 0.131000
50% 0.098500 0.755000 0.147500
75% 0.119000 0.780250 0.165000
max 0.240000 0.989000 0.326000

If we look at the row that describes the mean, we can see that there is on average 14.63% positive sentiment associated with these subreddits and 9.92% negative sentiment.

Based solely on this analysis, we see that the proportions of positive and negative sentiment are really similar between the subreddits that grew and those that contracted.

T-tests on positive and negative sentiment between subreddits that grew and subreddits that contracted

From our t-tests of each negative and positive sentiment scores for each subreddit, we can see clearly from their p-values that they are both greater than 5%, which means we cannot reject the null hypothesis, which states that subreddits that contracted and grew should have the same negative and positive sentiment value on average.

Therefore we there does not exist enough statistically significant proof that subreddits that grew and subreddits that contracted are different in terms of sentiment.

Conclusion

We were able to find some interesting results with our analysis of the activity level changes between subreddits that grew and subreddits that contracted, and we were somewhat successful in clustering subreddits based on similar topics. We found that regardless of the overall activity change observed over the 2.5 years of data we have, there was a significant uptick in activity for all our filtered subreddits in early 2020, when COVID-19 lockdowns were being put in place around the world.

Based on our brief sentiment analysis of the sentiment associated with each subreddit’s submissions, we cannot say whether activity level changes in subreddits are positively correlated with the degree of positive or negative sentiment associated with the activity. We have only seen that there seems to be more positive sentiment than negative sentiment associated with these all the subreddits we looked at.

Therefore we cannot accept/reject our hypothesis, that activity level changes in subreddits are positively correlated with the degree of positive or negative sentiment associated with the activity.

Thank you for taking the time to read through my data story and coming along with me on this data journey. I hope you learned something new today!

If you’d like to learn more about me, check out my video introduction






Or you could checkout my personal website