If you are someone who wants better clarity on how "WHERE" and "HAVING" clauses actually work in SQL, I have found out a place.
let's make this clear first, I assume you have a basic understanding before reading further a refresher will be a plus before reading this article. Check out my SQL essentials which are 10 mins read and come back here.
I am using here is a legendary titanic dataset, download here.
After landing on the page right-click and save as select the proper location in your pc. I will encourage you to try out it on your own dataset.
I am querying the data using pandasql inside jupyter notebook using pandas library of python. considering hard ways to import CSV into SQL I found this easy way. all syntax and clauses will work fine in your databases. regardless to say it's a great opportunity to get hands-on pandasql.
Let's start with Where first what where clause does is filter out the data according to a given condition.
import pandas as pd
import pandasql as ps
df = pd.read_csv("titanic.csv")
#Selecting everything from dataset
ps.sqldf("""
SELECT * FROM df
LIMIT 5
""")
#Filtering with where from dataset
ps.sqldf("""
SELECT * FROM df
WHERE Age = 25
LIMIT 5
""")
Note: ps.sqldf(""" """) is syntax of pandasql you can ignore if you are using another database
Look closely at the results of the WHERE clause output, you will be getting all the rows where Age is 25.
here we can conclude that WHERE is filtering our data set and coming out with output.
So here I would like to think about a problem If I am going to create another column with a SELECT statement then our WHERE can not filter out the values of that newly created column.
#creating new columns with aggregate functions.
ps.sqldf("""
SELECT
Name,
Age,
AVG(Age) AS avg_age,
SUM(Age) AS tot_age,
MIN(Age) AS min_age,
MAX(Age) AS max_age
FROM df
GROUP BY Embarked
""")
In this case, we can use HAVING to filter out the output of the above query. like this
#Using HAVING to filter out the output
ps.sqldf("""
SELECT
Name,
Age,
AVG(Age) AS avg_age,
SUM(Age) AS tot_age,
MIN(Age) AS min_age,
MAX(Age) AS max_age
FROM df
GROUP BY Embarked
HAVING Age > 30
""")
This will simply give you the filtered results on the output of the query.
we can conclude that
Where clause will filter the data from a table.
HAVING clause will filter the data from the output of the query.
You will get an understanding of what's happening here once you try out it on your own.
Till then thanks for reading, catch you in the next one.✌