The last couple of blogs that I have written have been great for beginners (Data Concepts Without Learning To Code or Developing A Data Scientist’s Mindset). But, I would really like to push myself to create content for other members of my audience as well. If you’ve got a handle on SQL basics, but you’re ready to take your query skills to the next level, you’re in the right place. Today, we are going to take a step into more intermediate data analysis territory *dun dun dun*…. and discuss the self join.
Disclaimer: This post assumes that you already understand how joins work in SQL. If you are not familiar with this concept yet, no worries at all! Save this article for later because I think it’ll definitely be useful as you master SQL in the future.
What is a “Self” Join?
A self join is actually as literal as it gets – it is joining a database table to itself. You can use any kind of join you want to perform a self join (left, right, inner, etc.) – what makes it the self join is that you use the same table on both sides. Just make sure that you select the correct join type for your specific scenario and desired outcome.
When Should I Use a Self Join?
If you’ve been working or studying in the field of data analytics and data science for more than, say, 5 minutes, you’ll know that there are always 27 ways to solve a problem. Some are better than others of course, but sometimes the differences are almost indiscernible.
That being said, there is probably never going to be one exact case where you MUST HAVE a self join or your analysis will shrivel up and die with nowhere to turn.
But, I do at least have some scenarios where I have used self joins to solve my analytics problems, at work or in personal analysis. Here’s my own spin on two of the best (AKA the ones that I remember and can think of a good example for).
Scenario #1: Message/Response
Suppose that there exists a database table called Chats that holds all of the chat messages that have been sent or received by an online clothing store business.
It would be extremely beneficial for the clothing store owner to know how long it usually takes her to respond to messages from her customers.
But, the messages from her customers and messages to her customers are in the same data source. So, we can use a self join to query the data and provide this analysis to the store owner. We will need one copy of the Chats table to get the initial message from the customer and one copy of the Chats table to get the response from the owner. Then, we can do some date math on the dates associated with those events to figure out how long the store owner is taking to respond.
I would write this hypothetical self join query as the following:
This query is relatively straightforward,. The RespondingTo column gives us a one-to-one mapping of which original message to join back to.
Scenario #2: On/Off
Let’s say you’re presented a database table AccountActivity that holds a log of events that can occur on a yoga subscription site. The yoga site offers certain “premium trial periods” where new customers can get a discounted membership rate for some period. The trials starting and ending date are tracked in this table with the Premium_Start and PremiumEnd event types.
Suppose that some employees on the business side at this yoga subscription company are asking:
- how many people have the premium trial period currently active
- how many used to have the premium trial period active but now they don’t
Analysis Request A: Accounts in Premium Trial Period
To answer the first question, we need to find events where a premium membership was started, but has not been ended yet. So, we need to join the AccountActivity table to itself to look for premium start and premium end event matches. But, we can’t use an inner join this time. We need the null rows in the end table… so left join it is.
Notice how we also check and make sure that the events we are joining are in the right order. We want the premium trial start on the left side of the join, and the premium trial end on the right side of the join. Also, make sure that the User Id matches on both sides. We wouldn’t want to join events from two different customers!
Analysis Request B: Accounts Who Used to Be in Premium Trial Period
Regarding the second question, we want to find the customers whose sweet premium trial has come to an end. We are going to need to self join AccountActivity again, but in a stricter way. We want matches from both the left and right, since in this population, the trial has ended. So, we can choose an inner join this time.
See, self joins are pretty fun. They can be pretty useful when you have events that are related to each other in the same database table. And on that note, I shall end this with my one of my favorite memes of all time that was made for self joins: the Spiderman meme.
Thanks for reading, and happy querying. 🙂