Coding In Data Science and Analytics - Articles on Data36.com

The Junior Data Scientist’s First Month (Video Course)

Tomi Mester — Mon, 05 Dec 2022 16:26:00 +0000

100% practical 6-week data science challenge & video course — simulating of being a junior data scientist at a true-to-life startup.

The post The Junior Data Scientist’s First Month (Video Course) appeared first on Data36.

Learn Python 3 for Data Science – from scratch

Tomi Mester — Sun, 04 Dec 2022 23:40:00 +0000

I put together a ‘Python 3 for Data Science’ tutorial series starting from the very basics. It features 9 practical articles – that covers everything you need to know!

The post Learn Python 3 for Data Science – from scratch appeared first on Data36.

Learn SQL for Data Analysis – from scratch

Tomi Mester — Sat, 03 Dec 2022 23:51:00 +0000

Learning SQL is very useful for anyone in the online world. And it’s a definite must for Data Analysts/Scientists! Here are 8 articles to start with…

The post Learn SQL for Data Analysis – from scratch appeared first on Data36.

Learn Data Analytics in Bash – from scratch

Tomi Mester — Fri, 02 Dec 2022 12:32:00 +0000

Read these 7 free articles about Bash/Command Line and take your very first step to learn the basics of coding for Data Science!

The post Learn Data Analytics in Bash – from scratch appeared first on Data36.

Correlation: What is it? How to calculate it? .corr() in pandas

Levi Kulcsar — Mon, 03 Oct 2022 09:34:44 +0000

Exploratory Data Analysis (EDA), Machine Learning projects, Economical/Financial analysis, scientific research, even single articles on different topics in newspapers involve examining correlation between variables.

But what is correlation?
How do we use it?
Can we measure it?
Can we visualize it?
What is causation?
How is it helping your business?

You’ll find the answers to all those questions in this article!

The author of this article is Levente Kulcsar from Sweden. He creates awesome data science content on his twitter account. Follow him, here.

What is correlation?

According to Wikipedia: “Correlation refers to the degree to which a pair of variables are linearly related.” [1]

In plain English: correlation is a measure of a statistical relationship between two sets of data.

Let’s call those two datasets X and Y now for a little example:.

Variables of X and Y are positively correlated if:

high values of X go with high values of Y
low values of X go with low values of Y

Variables X and Y are negatively correlated if:

high values of X go with low values of Y
low values of X go with high values of Y

Note: It’s important to note that correlation does not imply causation. In other words, just because you see that two things are correlated to each other, it doesn’t necessarily mean that one causes the other. More on this later.

Here are visualizations of correlations. (Stay tuned, we will learn how to create these scatterplots!)

In this example the two variables are skill and attacking. It is clearly visible that high skill values go with high attacking values, so they are positively correlated.

The next scatterplot is a visual presentation of a negative correlation (not so strong however):

In this case high height_cm values go with low movement values.

How can we measure correlation?

To measure correlation, we usually use the Pearson correlation coefficient, it gives an estimate of the correlation between two variables.

To compute Pearson’s coefficient, we multiply deviations from the mean for X times those for Y and divide by the product of the standard deviations. Here is the formula: [2]

Note: as always – it’s important to understand how you calculate Pearson’s coefficient – but luckily, it’s implemented in pandas, so you don’t have to type the whole formula into Python all the time, you can just call the right function… more about that later.

Pearson’s correlation coefficient is good to measure linear correlation.

Wait! Do we have nonlinear correlation as well? Yes, we have, so it’s time to define what is the difference.

Linear correlation: The correlation is linear if the ratio of change is constant. [3] If we double X, Y will be doubled as well.
Nonlinear correlation: If the ratio of change is not constant, we are facing nonlinear correlation. [3] To measure nonlinear correlation, we use the Spearman’s correlation coefficient. More on this here [4]

So back to linear correlation and Pearson’s coefficient. The coefficient always has a value between −1 and 1.

-1 means perfect negative linear correlation
+1 means perfect positive linear correlation
0 means no linear dependency between variables.

A few examples from a Wikipedia article:

source: https://en.wikipedia.org/wiki/Correlation#/media/File:Correlation_examples2.svg

What does Pearson’s correlation coefficient tells us?

the “noisiness” of the relationship,
the direction of the relationship

What does the coefficient not tell us?

The slope of the relationship
If there is a relationship, but not necessarily linear. (E.g. in the image from the Wikipedia article above, we can assume that there is some kind of correlation in the bottom row, but since those are not linear, we cannot measure them with Pearson’s correlation coefficient.)

Correlation vs Causation

It is important to understand that if two values are correlated it doesn’t mean that one causes the other.

Correlation does not imply causation – as they say.

It only means that X and Y move together. But this correlation can be due to:

Causation
Third variable
Coincidence

What is causation

Causation means that there is a cause-and-effect link between X and Y. The result of this link is that if a change in X occurs, a change in Y will occur as well.

A really simple example:

(Generally) when someone exercises more, they will gain more muscle.

But when we think about causation we need to be careful, because some problems can emerge.

Third variable problem

For example, we usually see a positive correlation between shark attacks and ice cream sales. Can we conclude that there is a causation between these variables? Of course not. The sales of ice cream won’t cause shark attacks and vice versa.

Instead, a third variable enters the conversation: temperature.

When it’s warmer out, more people buy ice cream and more people swim in the ocean. [5]

This is a typical example for the third variable problem. The third variable problem means that X and Y are correlated, but a third variable Z causes the changes both in X and Y.

Directionality Problem

Another thing we need to consider is the direction of the relationship.

Aggressive people watch lots of violence on TV.

But does violence on TV make them aggressive? Or they are aggressive, hence they watch violence on TV?

We cannot tell for sure.

Directionality Problem means that we know that X and Y are correlated and we assume that there is a link between them, but we don’t know if X causes Y or Y causes X.

Spurious Correlation

A spurious correlation is when two variables are related through a hidden third variable or simply by coincidence. [7]

You can find some funny examples of Spurious Correlation here[6]

source: https://tylervigen.com/spurious-correlations

Correlation in Pandas

Now it is time to code!

First we need to import packages and our data. In this exercise we will use Kaggle’s FIFA 22 top 650 players.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('../input/top-650-fifa-22-players-simplified/Top_650_FIFA.csv')

This dataset contains player details from the well known soccer computer game. We will mainly focus on their skills, such as power, mentality, passing, shooting etc. Each player has a rating out of 100 in these categories.

Note: you can learn Pandas basics and how to load a dataset into pandas, here: https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

Correlation matrix – How to use .corr()

The easiest way to check the correlation between variables is to use the .corr() method.

data.corr() will give us the correlation matrix for the dataset. Here is a small sample from the big table:

Note: If you want to learn in detail, how to read this matrix, check this article out.

We will use only some of the columns for better understanding. Also, columns like the index (Unnamed 0) and club_jersey_number are not relevant to us. We do not anticipate any connection between a jersey number and the player’s skills.

We will define a variable with column names and apply .corr() only on those columns:

columns = ['age', 'height_cm', 'weight_kg', 'skill_moves',
                    'pace','shooting','passing',
                    'dribbling','defending','physic',
                    'attacking','skill','movement','power']
data[columns].corr()

Again, here is part of the table:

Note: .corr() by default will use Pearson’s coefficient; we can change that by defining the method inside the parantheses. Use method= 'spearman' to check Spearman’s coefficient and nonlinear correlation.

Coloring the correlation matrix (so it’s easier to read)

Since the matrix contains many numbers, it is hard to read. For better understanding, we can add some coloring.

In this example I used a gradient background called coolwarm, by adding .style.background_gradient(cmap='coolwarm') to the end of the code defined earlier.

The result for:

data[columns].corr().style.background_gradient(cmap='coolwarm')

will be something like this:

From the table presented this way, you can immediately find the negative and positive correlations.

Using these colors it is also easy to spot that the correlation matrix contains every value twice. It is mirrored on the diagonal.

To clear the table even further we will use seaborn and masks.

Note: For a better understanding of how we use mask in this example click here [9]

import seaborn as sns
import matplotlib.pyplot as plt
corrmat = data[columns].corr()
mask= np.zeros_like(corrmat)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corrmat,
            vmax=1, vmin=-1,
            annot=True, annot_kws={'fontsize':7},
            mask=mask,
            cmap=sns.diverging_palette(20,220,as_cmap=True))

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

Scatterplots

We can visualize a pair of variables and check if they are correlated or not on scatter plots as well.

In Pandas we just need to use .plot.scatter() and define our X and Y variables:

data.plot.scatter(x='attacking',y='skill')

Note: Did you notice that this is the chart that we have already discussed at the beginning?

We know from the matrix that the correlation coefficient for the two variables is 0.95, so they are strongly, positively correlated.

We just need to change the x and y variable names to recreate the example chart for negative correlation. (corr coefficient is -0.7):

data.plot.scatter(x='movement',y='height_cm')

What about no-correlation? What does that look like?

Here is the example (corr coefficient is 0.1):

data.plot.scatter(x='passing',y='pace')

You can also use seaborn to visualize not just one pair of variables on scatter plots.

Adding .pairplots() will create a matrix of scatterplots.

How correlation can help your business?

Correlation is widely used in real-life decision making. You will find correlation in Marketing, Finance, Sales, basically we could mention domains endlessly.

A few benefits:

Pattern recognition. In the big data world looking at millions of rows of raw data will not tell you anything about the business. Using existing information for better decision making will be crucial in the future. It can reveal new business opportunities, give insights about existing processes, and help to communicate clearly. Recognizing patterns is one of the main goals of data science and correlation analysis can help with that.

Financial decision making – investment decisions. Diversifying is essential. Investing in negatively correlated sectors can help you mitigate risk.
For example: if the airline industry is negatively correlated with the social media industry, the investor may choose to invest in a social media stock. If a negative event affects one of those industries, the other sector will be a safer place for the money [11]

Projections. If a company finds a positive correlation between two variables and has some predictions on the one variable involved in the correlation then they can try to make predictions on the second variable as well.
For example: Company X finds a positive correlation between the number of tourists in city Y and its sales. A 10% rise in visitors for the coming year is predicted in city Y. Company X can anticipate an increase in sales as well. Of course, when it gets to predictions, one should always consider the above mentioned correlation-causation issue.

All of the above-mentioned activities will enhance decision-making, reduce risk, reveal new opportunities through correlation.

Cheers,
Levi Kulcsar

Sources

[1]: https://en.wikipedia.org/wiki/Correlation

[2]: Practical Statistics for Data Scientists by Peter Bruce, Andrew Bruce, and Peter Gedeck (O’Reilly).

[3]: https://www.emathzone.com/tutorials/basic-statistics/linear-and-non-linear-correlation.html

[4]: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

[5]: https://www.statology.org/third-variable-problem/

[6]: http://www.tylervigen.com/spurious-correlations

[7]: https://www.scribbr.com/methodology/correlation-vs-causation/

[8]: https://www.statology.org/how-to-read-a-correlation-matrix/

[9]: https://www.kdnuggets.com/2019/07/annotated-heatmaps-correlation-matrix.html#

[10]: https://twitter.com/levikul09/status/1542051235510902784?s=20&t=vPWeG5_Yhi3AJ7RDo4ZsiA

[11]: https://www.investopedia.com/terms/c/correlation.asp

The post Correlation: What is it? How to calculate it? .corr() in pandas appeared first on Data36.

K-means Clustering with scikit-learn (in Python)

Tamas Ujhelyi — Tue, 13 Sep 2022 11:23:51 +0000

You’re here for two reasons: 1) you want to learn to create a K-means clustering model in Python, and 2) you’re a cool person because of that (people reading data36.com are cool persons ).

Back to reason number one: it’s not surprising, because K-means clustering is one of the most popular and easy-to-grasp unsupervised machine learning models.

Lucky for you, you’re about to learn everything you need to know to get your feet wet. To code along with me, you have to have these libraries installed: pandas, scikit-learn, matplotlib.

Also, some basic knowledge of Python, statistics, and machine learning won’t hurt, either.

Let the fun begin.

The difference between supervised and unsupervised machine learning, and why the latter can be scary

Broadly speaking, machine learning models can be categorized as either supervised or unsupervised.

Supervised means that your model receives both the “questions” (input data) and the “answers” (output data) during learning. So if you want your model to recognize whales, you can show images of different animals (input data) to it and also the solutions (output data, e.g. “this image shows a whale”, but “this image shows a lion, so it’s not a whale”).

This is a classification problem; however, supervised algorithms can be used to solve regression tasks as well (e.g. predicting the price of a car based on its attributes). If you’d like to learn more about these topics, you can find relevant articles on data36 for both types of supervised models (regression, classification).

With unsupervised learning, the story is a bit different. You don’t have any answers that supervise the model’s learning, just the inputs. You basically tell the model to do whatever it wants to do with the data.

And it’s kind of cool because unsupervised models can find hidden patterns and relationships in your dataset that you’d otherwise never be able to find (not even in your lifetime).

You don’t know what aspects of the data points the model takes into consideration when drawing its conclusions, but you don’t even care, because you just want to see what your model figured out on its own. An example would be finding different groups of this blog’s visitors based on what articles they read. Well, actually, we have created an analysis like that before:

To perform such a task, you need to use something called clustering.

Note: don’t worry about what exactly you see in the picture above. It’s just a random example of clustering. But before anything else, you have to understand what clustering is in machine learning.

What is clustering in machine learning?

Clustering means grouping.

*crickets chirping*

No, seriously. Just remember – in the context of unsupervised learning, we let our model take control, and make its own decisions. In the case of clustering, the model studies the dataset to find similarities and differences between the data points, then it creates distinct groups out of those data points.

Here’s the funny (scary?) part: once the model is done with clustering, we are left on our own.

What do I mean by that?

The model creates the groups, but it won’t attach an explanation to them detailing what each group is about, and why certain data points belong to one group and not the other. It won’t say “this group contains people who tend to spend more money on Sundays” or “this group contains cars that will break down in 5 years”.

The model will only say that this data point belongs to group a, and that belongs to group b – interpreting the clusters will be totally up to you.

Let’s see how K-means clustering – one of the most popular clustering methods – works.

Here’s how K-means clustering does its thing

You’ll love this because it’s just a few simple steps!

For starters, let’s break down what K-means clustering means:

clustering: the model groups data points into different clusters,
K: K is a variable that we set; it represents how many clusters we want our model to create,
means: each cluster has a mean, and each data point will be assigned to the cluster whose mean is closest to the given data point. Read on, and you’ll get it, I promise!

Let’s look at an example. We have the following data points that we’d like to group into three groups (K = 3):

Here’s how our K-means clustering model goes about it.

First, it randomly selects three data points (I’ve marked them with different colors):

These points are called cluster centroids because they mark the center of the clusters.

Our model then measures – using Euclidean distance – the distance of each data point to the cluster centroids, and assigns each data point to the cluster whose centroid is closest to the data point:

Now that we have three groups, each group’s cluster centroid gets recalculated (=will move to a new position) based on the average of the data points within the group, then the data points get assigned again to the closest cluster centroid (marked with a cross). Notice how data points can switch teams:

The algorithm repeats the above steps until the data points stop switching groups (=each data point is assigned to its final group, no more team switching for them).

And that’s it.

I believe two questions arise at this point:

Is that all? Answer: yep.
How do we know what value to choose for K? Answer: you’ll find it out in the coding section.

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

How can you code a K-means clustering model in Python

I. Dropping unnecessary rows/columns before clustering

We’ll work with a dataset from Kaggle (download it from here).

The dataset contains customer data, like gender, age, profession, size of family, etc. Our plan is to create clusters out of these customers.

Read in the dataset, save it to df, and view five random rows of it with the sample method:

df = pd.read_csv("file-location-don't-copy/Train.csv")
df.sample(5, random_state=44)

Here’s the output:

As the Segmentation column suggests, the customers have been already segmented by a certain logic (if you read the dataset’s description, you’ll know the full story). Since we want to create our own clusters, let’s remove this column along with ID:

df = df.drop(["Segmentation", “ID”], axis="columns")

A quick df.head() will attest to our success:

It’s always recommended to get a general sense of the dataset you’re working with, so let’s do just that with df.info():

According to the RageIndex part, our dataframe (df) holds 8,068 rows. By looking at the number of data points in each column, we can establish that we have many missing data points (e.g. Graduated has 7,990 rows instead of 8,068).

For the sake of simplicity, we remove all rows with any missing values with df.dropna():

df.info() now shows that all columns have 6,665 rows, but our index still goes from 0 to 8067, so we reset it with df.reset_index(), then remove the freshly created index column with df.drop("index", axis="columns"):

df = df.reset_index()
df = df.drop("index", axis="columns")
df.head()

Perfection:

I’ll level with you: to create a K-means clustering model, it’d be perfectly fine if you just removed the Segmentation and the ID columns, and the rows with missing values.

But I like to keep my data neat and organized, so that’s why you had to go through all this – sorry!

I’ll make it up to you by showing you the fun part, okay?

II. Formatting the data for the K-means clustering model

Since our soon-to-be existing model doesn’t understand categorical data (e.g. Graduated = No), we need to convert categorical values to numerical data:

df_new = pd.get_dummies(df)
df_new.head()

The output:

pd.get_dummies() successfully converted our data, and that we’re happy about! What we’re not happy about is the many rows that don’t even fit in one screenshot.

Before we fix that, let me explain what just happened:

previously we had columns holding categorical data (e.g. Graduated = No),
after get_dummies(), we have numerical columns representing our categorical data (e.g. Graduated_No = 1 or 0),
where 1 means true and 0 means false (e.g. Graduated_No = 0 is just Graduated = No in a new form that our model understands).

If you pay close attention, you may notice that now we have plenty of redundant data: Graduated_No = 0 and Graduated_Yes = 1 both represent the same thing (that someone graduated), so we don’t need both columns. We can keep only the first of the two by adding drop_first=True to get_dummies():

df_kmeans = pd.get_dummies(df, drop_first=True)
df_kmeans.head()

The result:

(Okay, it’s not a must step, because our model would do just fine without it, but again, I like to keep things simple – mea maxima culpa. You can check with df_new.columns.size and df_kmeans.columns.size that we managed to reduce the number of columns from 28 to 22.)

III. Standardization

Now, because we have columns with different ranges of numbers (e.g. most of them are either 0 or 1, but Age can be more than 1), we need to standardize our data. According to scikit-learn’s documentation:

“Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data […]”

Let’s just believe that, and type the following:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df_kmeans = scaler.fit_transform(df_kmeans)

The first row imports StandardScaler, which does the standardization for us. The second one creates the scaler itself, and the third one actually implements it on our data (scaled_df_kmeans is the output, our standardized data).

IV. Creating the K-means clustering model!

Finally, we can create our K-means clustering model:

from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=3)
clusters = kmeans_model.fit_predict(df_kmeans)
df_kmeans.insert(df_kmeans.columns.get_loc("Age"), "Cluster", clusters)
df_kmeans.head(3)

I don’t want to keep you waiting, so first I show you the output, then explain what happened.

Here’s the output:

from sklearn.cluster import KMeans imports the K-means clustering algorithm,
KMeans(n_clusters=3) saves the algorithm into kmeans_model, where n_clusters denotes the number of clusters we’d like to create,
kmeans_model.fit_predict(df_kmeans) clusters our customers into one of the three clusters, and then the cluster labels are saved to clusters,
df_kmeans.insert(df_kmeans.columns.get_loc("Age"), "Cluster", clusters) creates a new column (Cluster) holding the previously created clusters’ values (aka what cluster our model assigned to each customer in our dataset); df_kmeans.columns.get_loc("Age") is responsible for inserting Cluster as the first column.

If you’re curious, df_kmeans.Cluster.unique() will show you the labels for the clusters:

And that’s how you create a K-means clustering model.

Two more important things to know:

Now you should get an expert, who can help you interpret what each cluster possibly can mean.
You may wonder if three was the best choice for K… fortunately, there’s a method called elbow to determine the optimal number of clusters for our model. Read on to learn more about it.

Finding the ideal number of clusters with the elbow method

The so-called elbow method is a common way to find the ideal number of clusters within a dataset.

It’s called elbow, because:

it takes the possible values of K that you choose,
1. for example, beginning from two clusters and going up to eight clusters the values for K would be: 2, 3, 4, 5, 6, 7, 8,
calculates the sum of squared distances (SSD for short) for every value of K (=number of clusters),
1. SSD is the result of calculating in each cluster, how far the data points are from the cluster centroid, then adding up the results for all clusters; think of it as something that shows the variance of your data (fun fact: as the number of clusters increase, the SSD decreases),
creates a line plot where the x-axis contains the values that K can take on, and the y-axis shows the SSD (=variance) for each K (=number of clusters).

Once the above steps have been carried out, our task is to identify the elbow point (=the number for K), after which there are no more sudden drops in the line plot (=SSD is not significantly reduced, so there’s no need to add more clusters).

You’ll immediately get what I’m talking about by looking at this chart:

The solution here is five; after it, there seem to be no more sudden drops, so the ideal number for this dataset would be five.

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

We’ll apply the elbow method to our dataset with the below code:

ssd = []
for k in range(2, 9):
    kmeans_model = KMeans(n_clusters=k)
    kmeans_model.fit(df_kmeans)
    ssd.append(kmeans_model.inertia_)
plt.figure(figsize=(6, 4), dpi=100)
plt.plot(range(2, 9), ssd, color="green", marker="o")
plt.xlabel("Number of clusters (K)")
plt.ylabel("SSD for K")
plt.show()

Here’s the breakdown of the code:

ssd is a list of the calculated SSD values for each value of K,
we set a minimum and a maximum value for K with range(2, 9), and loop through them (for k in range(2, 9)),
at each loop, we create a K-means clustering model for k (kmeans_model = KMeans(n_clusters=k)),
then we fit the model (kmeans_model.fit(df_kmeans)),
and add its calculated SSD (ssd.append(kmeans_model.inertia_)) to ssd,
- note: inertia means SSD,
and finally, we visualize it with the rest of the code.

Now you should see a chart like this:

I’d say that the ideal number of clusters for this dataset is four instead of three. So we can rerun the code with K being equal to four, but I won’t do that now, because there’s still one important aspect that we need to touch on.

What if we don’t see an elbow in the chart?

Yes, it can happen. It’s not guaranteed that you’ll always get an elbow.

Why?

Because life isn’t always sunshine and puppies, and data science is hard.

But we’re tough, ain’t we?

Whenever you encounter such difficulties, start googling. You’ll be encouraged by seeing others having the same problems – like here, here, or here.

The thing is, there’s always a solution. If you don’t see an elbow point, the solution could be to look at the silhouette score.

Or maybe not.

Conclusion

Okay, the last sentence from the previous section may have sounded too distressing – but I didn’t mean it that way! Oftentimes, data science can be challenging, but if you take one step at a time, it can be conquered.

And you did take an important step by following this tutorial because you’ve learned about a new model. If you’re curious about other machine learning models, just click here to see what we’ve got for you.

And don’t forget – if you get stuck with a problem, just Google it!

If you want to learn more about how to become a data scientist, take Tomi Mester’s 50-minute video course: How to Become a Data Scientist. (It’s free!)
Also check out the 6-week online course: The Junior Data Scientist’s First Month video course.

Cheers,

Tamas Ujhelyi

The post K-means Clustering with scikit-learn (in Python) appeared first on Data36.

How to Import Data into SQL Tables

Tomi Mester — Sun, 10 Jul 2022 20:34:00 +0000

Following the previous article about creating data tables in SQL, now we want to load data into our freshly created SQL table. In this article, I’ll show you three different import methods:

When you want to load the data line by line.
When you want to insert the data from a .csv file.
When you add rows to your new SQL table that are the results of another SQL query.

Note: This is going to be a practical tutorial, so I encourage you to do the coding part with me.

Note 2: If you are new here, let’s start with these SQL articles first:

How to Become a Data Scientist
(free 50-minute video course by Tomi Mester)

Just subscribe to the Data36 Newsletter here (it’s free)!

Method #1: Load the Data Line by Line (INSERT INTO)

When we have only a few lines of data, the easiest way is to add them manually. We can do this by using the INSERT SQL statement:

Let’s get back to our test_results table that we created in the previous tutorial.

Currently, it’s an empty table. Let’s change that — and add a line to it using INSERT:

INSERT INTO test_results
VALUES
('Walt', 1, '1980-12-01', 95.50, 'A', TRUE);

Excellent — that inserted a new row into our SQL table.

But let’s see the result and query our table!

SELECT * FROM test_results;

Oh, yeah! Walt’s test results are in the SQL table, indeed!

While this is a very manual process, you can speed it up if you INSERT the rest of the students with one bigger SQL statement:

INSERT INTO test_results
VALUES
('Jesse', 2, '1988-02-11', 74.00, 'C', TRUE),
('Todd', 3, '1987-06-13', 60.00, 'D', TRUE),
('Tuco', 4, '1970-11-11', 15.50, 'F', FALSE),
('Gus', 5, '1975-08-08', 80.00, 'B', TRUE)
;

Query the table once again to check out the results:
SELECT * FROM test_results;

Now, we have 5 students’ data loaded into this sweet SQL table. That was easy as pie, right?

Now I want you to spend a few more seconds reviewing the syntax:

INSERT INTO is the SQL keyword.
test_results is the name of the table that we want to put the data into.
VALUES is another SQL keyword.
Then the actual data rows come one by one – each of them between parentheses (()) and separated by commas (,).
The field values are also separated by commas (,).
Watch out for the data points which data types are TEXT or DATE — these data points have to go between apostrophes (') when you write your SQL query!
And never forget the semicolon (;) at the end of your SQL statement!

If you are more of the visual type, here’s your cheat sheet:

Commit your changes!

As we have discussed in the previous article, if you do changes in your database with an SQL manager tool (like pgadmin4 or SQL Workbench), you have to COMMIT them. Always! What does it mean? Learn more here.
But for now, let’s just run this one extra line in your SQL manager:

COMMIT;

Note: If you turned auto-commit on or if you are in the command line and not in an SQL query tool, then you can skip this commit step.

SQL TRUNCATE: empty your table without deleting the table

You have already learned about the DROP TABLE SQL statement that deletes your SQL table. But very often you don’t want to delete your table (because you want to keep its structure), only clear that data. You can do this by using the TRUNCATE TABLE statement.

Type this:
TRUNCATE TABLE test_results;
This will delete all the rows that we have inserted in the table before, but it will keep the table itself.

Don’t forget that you have to commit your changes!

COMMIT;

Note: more about emptying an SQL table here: SQL TRUNCATE TABLE and DROP TABLE tutorial.

Okay, if everything is set, let’s see the…

Method #2: insert a .csv file into an SQL table (COPY)

To be honest, this is a more common scenario than the first method I showed. As a data analyst, you quite regularly get raw data sets in file formats, like .xlsx or .csv or .txt. You can insert these data files using the COPY statement.

The general format of the statement looks like this:

COPY table_name FROM '/path/step/file_name' DELIMITER ' ';

Let me break it down for you:

COPY is the SQL keyword that specifies that you’ll insert data from a file into an SQL table.
table_name is the name of the table that you want to put the data into. (This is a bit counter-intuitive in the syntax… But we know that SQL is not the most “coder-friendly” tool syntax-wise. So just get over it and simply learn this way.)
FROM is another SQL keyword after that you’ll…
…specify the name and the location of the file that you want to COPY the data from. This goes between apostrophes.
And eventually, you have to specify the field separator in your original file by typing DELIMITER and the field separator itself between apostrophes. So in this case ' ' means that the delimiter would be a space.

Example for COPY (insert .csv data into SQL)

Let’s go through the whole process with an example.

Note: in this example, I’ll help you to create a dummy .csv file. If you have your .csv file, you can just skip STEP #1, #2, and #3.

STEP 1) First, you have to open your Terminal window and connect to your data server.
(Note: At this point assume you know how to do it – if not: this way please.)

STEP 2) Then type this (just copy-paste from here) into the command line:

echo "Walt,1,1980-12-01,95.50,A,TRUE
Jesse,2,1988-02-11,74.00,C,TRUE
Todd,3,1987-06-13,60.00,D,TRUE
Tuco,4,1970-11-11,15.50,F,FALSE
Gus,5,1975-08-08,80.00,B,TRUE" > test_results.csv

This will create a .csv file called test_results.csv.
(In real-life cases you will get this .csv file from someone at your company.)

STEP 3) Double-check your new file: cat test_results.csv.
And find out the exact location of it by typing pwd.

STEP 4) Then you have to log in to PostgreSQL (still in your Terminal window):
psql -U [your_sql_username] -d postgres
(For me it’s psql -U dataguy -d postgres)

STEP 5) Then type the COPY statement we have just discussed above:
\COPY test_results FROM '/home/dataguy/test_results.csv' DELIMITER ',';
And boom, the data is inserted from our freshly created .csv file into our SQL table.

command line: copy the content of the .csv file

You can even query it from your SQL manager tool to double-check it:

SQL Workbench: check the results

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

A few comments on the .csv data-load method

I typed \COPY and not just COPY because my SQL user doesn’t have SUPERUSER privileges, so technically I could not use the COPY command (this is an SQL thing). Typing \COPY instead is the simplest workaround — but the best solution would be to give yourself SUPERUSER privileges then use the original COPY command. (In this video starting at 2:55 I show how to give SUPERUSER privileges to your SQL user. If you are here from one of my online courses, probably we have already fixed this issue in the course.)
Why we didn’t do the COPY command in our SQL manager tool? Same reason: if you don’t have SUPERUSER privileges, you can’t run the COPY command from an SQL manager tool — only from the command line. If you follow the video that I linked in the previous point, you will be able to run the same COPY statement from pgadmin or SQL Workbench.
The '/home/dataguy/test_results.csv' is the location and the name of the file, together. Again, we found out the location by using the pwd command.
And finally: if you are uncomfortable with these command-line steps, read the first few articles from my Command Line for Data Analysts article series.

And boom, the data is inserted from a .csv file into our SQL table.
Run this query from your SQL manager:

SELECT * FROM test_results;

Awesome!

Method #3: Insert the output of another SQL query into your SQL table

Do you want to store the output of your SQL query? Not a problem… Maybe you want to save your daily KPIs that are calculated from SQL tables — or you want to have the cleaned version of a data set next to the original. In SQL, you can do that easily.

Say we want to create a table where we want to store only the names from our test_results table. (This is a dummy example but it’ll do the job for now.)

Step 1) Create this new SQL table :

CREATE TABLE student_names
(
name TEXT
);

Step 2)
Use the INSERT INTO statement (that we learned in the “Method #1” section, at the beginning of this article), only instead of typing the values manually, put a SELECT statement at the end of the query.

Something like this:

INSERT INTO student_names
(SELECT name FROM test_results);

Done!
The subquery between parentheses will run first — then its output will be inserted automatically into the recently created student_names table.

Check the result:
SELECT * FROM student_names;

You can even combine this method with SQL functions. For instance, you can calculate the average of the test results and then save that info into a new table. Something like this:

CREATE TABLE test_averages
(
test_average DECIMAL
);

Then:

INSERT INTO test_averages
(SELECT AVG(test_result) FROM test_results);

This new SQL table will store only one value: the average test result… but if we also had math test results, biology test results, and physics test results in other SQL tables, this test_averages table would be the perfect place to collect the different averages.

This was the third – slightly more advanced – way to insert your data into an SQL table. Now go ahead and test these methods on your own data set!

Conclusion

In this article we learned three methods to load data into SQL tables:

When you want to INSERT your data manually. (INSERT INTO ___ VALUES (____);)
When you want to COPY your data from a file. (COPY ____ FROM '_____' DELIMITER ' ';)
When you want to store the output of another SQL query. (INSERT INTO ____ (SELECT ____);)

Now, you know how to create new tables in SQL and how to load data into them!

If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.

Cheers,
Tomi Mester

The post How to Import Data into SQL Tables appeared first on Data36.

SQL functions (SUM, COUNT, AVG, MIN, MAX) and GROUP BY | SQL for Data Analysis Tutorial, ep3

Tomi Mester — Sat, 25 Jun 2022 08:44:00 +0000

In this article, I’ll show you the most essential SQL functions that you will use for calculating aggregates — such as SUM, AVG, COUNT, MAX, MIN — in a data set. Then I’ll show you how these work with the GROUP BY clause, so you will be able to use SQL for segmentation projects. (Eg. you’ll learn, how you can calculate averages with GROUP BY in SQL.) Eventually, you’ll learn some intermediate SQL moves using ORDER BY and DISTINCT.

The good news: these things don’t change too much over time. SQL functions and GROUP BY were the same in 2012 as they are today in 2022. So if you learn them now, you’ll be good with that knowledge at least till 2032.

Anyways, you’ll learn a lot of new stuff here… so buckle up — because you have to know all these to efficiently use SQL for data analysis! Oh, and this is going to be super exciting, as we will still use our 7M+ line data set!

Note: to get the most out of this article, you should not just read it, but actually do the coding part with me!

Before we start…

…I recommend going through these articles first – if you haven’t done so yet:

Set up your own data server: How to set up Python, SQL, R, and Bash (for non-devs)
Install SQL Workbench to manage your SQL queries better: How to install SQL Workbench for PostgreSQL
Read the first two episodes of the SQL for Data Analysis series: ep 1 and ep 2
Make sure that you have the flight delays data set imported – and if you don’t, check out this article.

How to Become a Data Scientist
(free 50-minute video course by Tomi Mester)

Just subscribe to the Data36 Newsletter here (it’s free)!

SQL functions to aggregate data

Okay, let’s open SQL Workbench and connect to your data server!

Can you recall our base query?
It was:

SELECT *
FROM flight_delays
LIMIT 10;

And it returned the first 10 lines of this huge data set.

We are going to modify this query to get answers to 5 important questions:

How many lines are there in our SQL table? (We’ll use the SQL COUNT function for that.)
What’s the total airtime of all the flights on our table? (That’ll be an SQL SUM function.)
What’s the average of all arrival delays in the table– and what’s it for all the departure delays? (SQL AVG function.)
What’s the maximum distance value in our SQL table? (SQL MAX function.)
What’s the minimum distance value in our SQL table? (SQL MIN function.)

Getting the answers to all these questions is going to be very easy, I promise. But again: make sure you are doing the coding part with me. Coding is the easiest to learn by doing it. So please spare no effort at this point: type in everything you see here into your SQL manager, too, and build a solid foundation of knowledge!

Okay, let’s see this!

SQL COUNT function. Let’s count lines!

The easiest aggregation function is to count lines in your SQL table. And this is what the COUNT function is for. The only thing you have to change – compared to the above base query – is what you SELECT from your table. Remember? It can be everything (*), or it can be specific columns (arrdelay, depdelay, etc). Now, let’s expand this list with functions. Copy this query into SQL Workbench and run it:

SELECT COUNT(*)
FROM flight_delays
LIMIT 10;

The result is: 7275288.

The function itself is called COUNT, and it says to count the lines using every column (*)… You can change the * to any column’s name (eg. arrdelay) – and you will get the very same number. Try this:

SELECT COUNT(arrdelay)
FROM flight_delays
LIMIT 10;

Right? Same result: 7275288.

So yes, this means that we have 7275288 lines in our flight_delays table.

Note 1: This is true only when you don’t have NULL values (empty cells) in your table! (We don’t have NULL values in the flight_delays data set at all.) I’ll get back to the importance of NULL later.
Note 2: in fact, you won’t need the LIMIT clause in this SQL query, as you will have only one line of data on your screen. But I figured that sometimes it might be better to keep it there, so even if you mistype something, your SQL Workbench won’t freeze by accidentally trying to return 7M+ lines of data.

SQL SUM function. Calculate sum!

Now we want to get the airtime for all flights – added up. In other words: get the sum of all values in the airtime column. The SUM function works with a similar logic as COUNT does. The only difference is that in the case of SUM you can’t use * — you’ll have to specify a column. In this case, it’ll be the airtime column.

Try this query:

SELECT SUM(airtime)
FROM flight_delays;

The total airtime is a massive 748015545 minutes.

SQL AVG function. Calculate averages… I mean the mean.

Our next challenge is to calculate the average arrival delay value and the average departure delay value. It’s important to know that there are many types of statistical averages in mathematics. But we usually refer to the average type called mean — when we say “average” in everyday life. (A quick reminder: mean is calculated by calculating the sum of all values in a dataset, then dividing it by the number of values.)

In SQL, the function called AVG (which of course stands for “average”) returns the mean… so the average type is what we expect from it.

Note: well, I have to add that many data scientists find it a bit lazy and ambiguous that in SQL the general word of “average” (AVG) is used for one specific average type: mean. And they are right! Median and mode are also averages. In Python/pandas, for example, the function to calculate the mean is actually called mean — and then there is another one called median to calculate median. That’s much more coherent. Well, like it or not, in SQL we have *AVG* for mean.

The syntax and the logic are the same as it was for the previous two SQL functions.

You can try it out by running this query:

SELECT AVG(depdelay)
FROM flight_delays;

The result is 11.36.

But of course, you’d get the exact same value if you typed:

SELECT SUM(depdelay)/COUNT(depdelay)
FROM flight_delays;

But let’s not run that far ahead… Instead of that, let’s calculate the average arrdelay value, too:

SELECT AVG(arrdelay)
FROM flight_delays;

Result: 10.19

Cool!

SQL MAX and MIN functions. Let’s get maximum and minimum values.

And finally, let’s find the maximum and the minimum values of a given column. Finding the maximum and minimum distances for these flights sounds interesting enough. SQL-syntax-wise, MIN and MAX operate just like SUM, AVG and COUNT did.

Here’s the minimum distance:

SELECT MIN(distance)
FROM flight_delays;

Result: 11 miles. (Man, maybe take the bike next time.)

SELECT MAX(distance)
FROM flight_delays;

Result: 4962

Okay! That was it – these are the basic SQL functions you have to know.

COUNT
SUM
AVG
MAX
MIN

It wasn’t that difficult so far, so it’s time to tweak this a little bit…

Introducing the GROUP BY clause!

SQL GROUP BY – for basic segmentation analysis and more…

SQL GROUP BY – the theory

As a data scientist, you will probably run segmentation projects all the time. For instance, it’s interesting to know the average departure delay of all flights (we have just learned that it’s 11.36). But when it comes to business decisions, this number is not actionable at all.

However, if we turn this information into a more useful format – let’s say we break it down by airport – it will instantly become something we can act on!

Here’s a simplified chart showing how SQL uses GROUP BY to create automatic segmentation based on column values:

The process has three important steps:

STEP 1 – Specify which columns you want to work with as an input. In our case, we want to use the list of the airports (origin column) and the departure delays (depdelay column).

STEP 2 – Specify which column(s) you want to create your segments from. For us, it’s the origin column. SQL automatically detects every unique value in this column (in the above example these were airport 1, airport 2, and airport 3). Then it creates groups (segments) from them and sorts each line from your data table into the right group.

STEP 3 – Finally it calculates the averages (using the SQL AVG function) for each and every group (segment) and returns the results on your screen.

The only new thing here is the “grouping” at STEP 2. We have an SQL clause for that. It’s called GROUP BY. Let’s see it in action.

SQL GROUP BY – in action

Here’s a query that combines an SQL AVG function with the GROUP BY clause — and does the exact thing that I described in the theory section above:

SELECT
  AVG(depdelay),
  origin
FROM flight_delays
GROUP BY origin;

Fantastic!

If you scroll through the results, you will see that there are some airports with an average departure delay of more than 30 or even 40 minutes. From a business perspective, it’s important to understand what’s going on at those airports. On the other hand, it’s also worth taking a closer look at how the good airports (depdelay close to 0) are managing to reach this ideal phase. (Okay, I know, the business case is over-simplified, but you get the point.)

But what just happened SQL-wise?

We have selected two columns – origin and depdelay. origin has been used to create the segments (GROUP BY origin). depdelay has been used to calculate the averages of the departure delays in these segments (AVG(depdelay)).

Note: As you can see, the logic of SQL is not as linear as it was for Python, pandas, or bash. If you write an SQL query, the first line of it could highly rely on the last line. When you write long and complex queries, this might cause some unexpected errors and thus of course a little headache too… But that’s why I find it very, very important to give yourself enough time to practice the basics and make sure that you fully understand the relationships between the different clauses, functions, and other stuff in SQL.

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

Test yourself #1 (SQL SUM + GROUP BY)

Here’s a little assignment to practice! Let’s try to solve this task and double-check that you understand everything so far! It’s simple:
Print the total airtime by month!
.
.
.
Ready?
Here’s my solution:

SELECT
  month,
  SUM(airtime)
FROM flight_delays
GROUP BY month;

I did pretty much the same stuff that I have done before, but now I’ve created the groups/segments based on the months – and this time I had to use the SUM function instead of AVG.

Test yourself #2 (SQL AVG + GROUP BY)

And here’s another exercise:
Calculate the average departure delay by airport again, but this time use only those flights that flew more than 2000 miles (you will find this info in the distance column).
.
.
.
Here’s the query:

SELECT
  AVG(depdelay),
  origin
FROM flight_delays
WHERE distance > 2000
GROUP BY origin;

There are two takeaways from this assignment.

You might have suspected this but now it’s confirmed: you can use the SQL WHERE clause with GROUP BY and SQL functions.
You can filter with WHERE even those columns that are not part of your SELECT statement.

SQL ORDER BY – to sort the data based on the value of one (or more) column(s)

Let’s say we want to see which airport was the busiest in 2007.

You can get the number of departures by airport really easily using the COUNT function with the GROUP BY clause, right? We have done this before in this article:

SELECT
  COUNT(*),
  origin
FROM flight_delays
GROUP BY origin;

The problem: this list is not sorted by default… To have that too, you need to add one more SQL clause: ORDER BY. When you use it, you always have to specify which column you want to order by… It’s pretty straightforward:

SELECT
  COUNT(*),
  origin
FROM flight_delays
GROUP BY origin
ORDER BY count;

Note: the column you will get after the COUNT function will be a new column… And it has to have a name – so SQL automatically names it “count” (check the latest screenshot above). When you refer to this column in your ORDER BY clause, you have to use this new name. I’ll get back to this in my next article in detail. If you find it weird, let’s try the same query but with ORDER BY origin – and you will understand it instantly.

Hm, almost there. But the problem is that the least busy airport is on the top – in other words, we got a list in ascending order. That’s the default for ORDER BY (in our PostgreSQL database at least). But you can change this to descending order by simply adding the DESC keyword at the end!

SELECT
COUNT(*),
origin
FROM flight_delays
GROUP BY origin
ORDER BY count DESC;

Excellent! Just what we wanted to see!

SQL DISTINCT — to get unique values only

This is the last new thing for today. And this will be short and sweet.

If you are curious how many different airports are in your table:

a) you can find it out using the GROUP BY clause. (Can you figure out how? :-))
b) you can find it out even more easily by using DISTINCT

DISTINCT removes all duplicates. Try this:

SELECT DISTINCT(origin)
FROM flight_delays;

Now you have unique airports!

By the way, the GROUP BY version would look like this:

SELECT origin
FROM flight_delays
GROUP BY origin;

Though result-wise it’s pretty much the same, the preferred way to do this is to use the DISTINCT syntax. (When writing more complex queries, DISTINCT will help you to keep your query simpler… But I’ll get back to this in a later article.)

Test yourself #3

Today you have learned a ton of small but useful stuff. I’ll give you one more assignment that will summarize pretty much everything – even the previous two articles (ep 1 and ep 2). This is going to be a difficult one, but you can do it! If it doesn’t work, try to break it down into smaller tasks, then build and test your query until you get the result.
The task is:

List the:

top 5 planes (identified by the tailnum)
by the number of landings
at PHX or SEA airport
on Sundays

(eg. if the plane with the tailnumber 'N387SW' landed 3 times in PHX and 2 times in SEA in 2007 on any Sunday, then it has a total of 5. And we need the top 5 planes with the higher total.) Ready? Set! Go!
.
.
.
Done? Here’s my solution:

SELECT
  COUNT(*),
  tailnum
FROM flight_delays
WHERE dayofweek = 7
  AND dest IN ('PHX', 'SEA')
GROUP BY tailnum
ORDER BY count DESC
LIMIT 5;

And some explanation:

SELECT –» select…
COUNT(*), –» This function counts the number of rows in a given group; to do that it needs the GROUP BY clause later.
tailnum –» This will help to specify the groups (referred in the GROUP BY function later).
FROM flight_delays –» the name of the table, of course
WHERE dayofweek = 7 –» a filter for Sundays only
AND dest IN ('PHX', 'SEA') –» filter for PHX and SEA destinations only
GROUP BY tailnum –» This is the clause that helps us to put the lines into different groups by tailnumbers.
ORDER BY count DESC –» and let’s order by the number of lines in a given group
LIMIT 5; –» list only the top 5 elements.

Conclusion

And that’s it! You have learned a lot today – SQL aggregate functions (MIN, MAX, COUNT, SUM, AVG), GROUP BY and two more important SQL clauses (DISTINCT and ORDER BY).

If you managed to get the last exercise done by yourself, I can tell you that you have a really good basic knowledge of SQL! Congrats! If not, don’t worry, just make sure that you re-read these first 3 chapters (ep 1, ep 2, ep 3) before you continue with episode 4!

If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.

Cheers,
Tomi Mester

The post SQL functions (SUM, COUNT, AVG, MIN, MAX) and GROUP BY | SQL for Data Analysis Tutorial, ep3 appeared first on Data36.

Pandas groupby(), count(), sum() and Other Aggregation Methods (Pandas Tutorial 2.)

Tomi Mester — Sat, 18 Jun 2022 18:59:00 +0000

Let’s continue with the pandas tutorial series! This is the second episode, where I’ll introduce pandas aggregation methods — such as count(), sum(), min(), max(), etc. — and the pandas groupby() function. These are very commonly used methods in data science projects, so if you are an aspiring data scientist, make sure you go through every detail in this article… because you’ll use these probably every day in real-life projects.

Note 1: This is a hands-on tutorial, so I recommend doing the coding part with me!

Before we start

If you haven’t done so yet, I recommend checking out these articles first:

How to Become a Data Scientist
(free 50-minute video course by Tomi Mester)

Just subscribe to the Data36 Newsletter here (it’s free)!

Aggregation – in theory

Aggregation is the process of turning the values of a dataset (or a subset of it) into one single value. Let me make this clear! If you have a pandas DataFrame like…

animal	water_need
zebra	100
lion	350
elephant	670
kangaroo	200

…then a simple aggregation method is to calculate the sum of the water_need values, which is 100 + 350 + 670 + 200 = 1320. Or a different aggregation method would be to count the number of the values in the animal column, which is 4. The theory is not too complicated, right?

So let’s see the rest in practice!

Pandas aggregation methods (in practice)

Where did we leave off last time? We opened a Jupyter notebook, imported pandas and numpy and loaded two datasets: zoo.csv and article_reads. We will continue from there – so if you have no idea what I’ve just talked about in my previous sentence, move over to this article: pandas tutorial – episode 1!

If you are familiar with the basics, for your convenience, here are the datasets we’ll use again:

zoo dataset: 46.101.230.157/datacoding101/zoo.csv
article_reads dataset: 46.101.230.157/dilan/pandas_tutorial_read.csv

Ready? Cool!

Let’s start with our zoo dataset! (If you want to download it again, you can find it at this link.) We have loaded it by using:

pd.read_csv('zoo.csv', delimiter = ',')

Let’s store this dataframe into a variable called zoo.

zoo = pd.read_csv('zoo.csv', delimiter = ',')

To learn the basic pandas aggregation methods, let’s do five things with this data:

Let’s count the number of rows (the number of animals) in zoo!
Let’s calculate the total water_need of the animals!
Let’s find out which is the smallest water_need value!
And then the greatest water_need value!
And eventually the average water_need!

Note: for a start, we won’t use the groupby() method but don’t worry, I’ll get back to that when we went through the basics.

#1 pandas count()

The most basic aggregation method is counting. To count the number of the animals is as easy as applying a count pandas function on the whole zoo dataframe:

zoo.count()

That’s interesting. “What are all these lines?” – you might ask…

Actually, the pandas .count() function counts the number of values in each column. In the case of the zoo dataset, there were 3 columns, and each of them had 22 values in it.

If you want to make your output clearer, you can select the animal column first by using one of the selection operators (that we learned about in the previous article). Something like this:

zoo[['animal']].count()

Or in this particular case, the result could be even nicer if you use this syntax:

zoo.animal.count()

This also selects only one column, but it turns our pandas dataframe object into a pandas series object. And the count function will be applied to that. (Which means that the output format is slightly different.)

#2 sum() in pandas

Following the same logic, you can easily sum the values in the water_need column by typing:

zoo.water_need.sum()

Just out of curiosity, let’s run our .sum() function on all columns, as well:

zoo.sum()

Note: I love how .sum() turns the words of the animal column into one string of animal names. (By the way, it’s very much in line with the logic of Python.)

Pandas Data Aggregation #3 and #4: min() and max()

How to make pandas return the smallest value from the water_need column?

I bet you have figured it out already:

zoo.water_need.min()

And getting the max value works pretty similarly:

zoo.water_need.max()

#5: averages in pandas: mean() and median()

Eventually, let’s calculate statistical averages, like mean and median!

The syntax is the same as it was with the other aggregation methods above:

zoo.water_need.mean()

zoo.water_need.median()

Okay, this was easy, right? Pandas aggregation methods are much, much easier than SQL’s, for instance.

So it’s time to spice this up — with a little bit of grouping! Introducing the groupby() function!

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

The pandas groupby() function (aka. grouping in pandas)

As a data scientist, you will probably do segmentations all the time. For instance, it’s nice to know the mean water_need of all animals (we have just learned that it’s 347.72). But very often it’s much more actionable to break this number down – let’s say – by animal types. With that, we can compare the species to each other. (Do lions or zebras drink more?) Or we can find outliers! (Elephants drink a lot!)

Here’s a simple visual showing how pandas performs “segmentation” – with groupby and aggregation:

It’s just grouping similar values and calculating the given aggregate value (in the above example it was a mean value) for each group.

Pandas groupby() – in action

Let’s do the above-presented grouping and aggregation for real, on our zoo DataFrame!
We have to fit in a groupby keyword between our zoo variable and our .mean() function:

zoo.groupby('animal').mean()

Just as before, pandas automatically runs the .mean() calculation for every column (the animal column disappeared since that was the column we grouped by). You can either ignore the uniq_id column or you can remove it afterward by using one of these syntaxes:

zoo.groupby('animal').mean()[['water_need']] –» This returns a DataFrame object.

zoo.groupby('animal').mean().water_need –» This returns a Series object.

Obviously, you can change the aggregation method from .mean() to anything, we learned above!

Let’s see one more example and combine pandas groupby and count!

Pandas groupby() and count()

I just wanted to add this example because it’s the most common operation you’ll do when you discover a new dataset. Using count and groupby together is just as simple as the previous example was.

Just type this:

zoo.groupby('animal').count()

And magically the different animals are counted by pandas:

Okay! Now you know everything, you have to know!
It’s time to…

Test yourself #1 (another count + groupby challenge)

Let’s get back to our article_read dataset.

(Note: Remember, this dataset holds the data of a travel blog. If you don’t have the data yet, you can download it from here. Or you can go through the whole download-open-store process step by step by reading the previous episode of this pandas tutorial.)

If you have everything set, here’s my first assignment:

What’s the most frequent source in the article_read dataframe?
.
.
.
And the solution is Reddit!

How did I get it? Use this code:

article_read.groupby('source').count()

I’ll break it down for you:

Take the article_read dataset!
Use groupby() and create segments by the values of the source column!
And eventually, count the values in each group by using .count() after the groupby() part.

You can – optionally – remove the unnecessary columns and keep the user_id column only, like this:

article_read.groupby('source').count()[['user_id']]

Test yourself #2

Here’s another, slightly more complex challenge:

For the users of country_2, what was the most frequent topic and source combination? Or in other words: which topic, from which source, brought the most views from country_2?
.
.
.
The result is the combination of Reddit (source) and Asia (topic), with 139 reads!
And the Python code to get this result is:

article_read[article_read.country == 'country_2'].groupby(['source', 'topic']).count()

Here’s a brief explanation:

First, we filtered for the users of country_2 with article_read[article_read.country == 'country_2']
Then on this subset, we applied a groupby pandas method… Oh, did I mention that you can group by multiple columns? Now you know that, too! (Syntax-wise, watch out for one thing: you have to put the name of the columns into a list. That’s why the bracket frames go between the parentheses.) (That was the groupby(['source', 'topic']) part.)
And as per usual: the count() function is the last piece of the puzzle.

Conclusion (pandas groupby, count, sum, min, max, etc.)

This was the second episode of my pandas tutorial series. Now you see that aggregation and grouping are not too hard in pandas… and believe me, you will use them a lot!

Note: If you have used SQL before, I encourage you to take a break and compare the pandas and the SQL methods of aggregation. With that, you will understand more about the key differences between the two languages!

In the next article, I’ll show you the four most commonly used pandas data wrangling methods: merge, sort, reset_index and fillna. Stay with me: Pandas Tutorial, Episode 3!

If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.

Cheers,
Tomi Mester

PS. for a detailed guide, check out pandas’ official guide (it’s very updated — last refreshed in 2022), here.

The post Pandas groupby(), count(), sum() and Other Aggregation Methods (Pandas Tutorial 2.) appeared first on Data36.

SQL Interview Questions: 3 Tech Screening Exercises For Data Analysts (in 2022)

Tomi Mester — Tue, 31 May 2022 04:50:00 +0000

I’ve been part of many job interviews – on both sides of the table. The most fun, but also the most feared, part of the process is the technical screening. In this article, I’ll show you three SQL test exercises that, in my experience, are quite typical in data analyst job interviews — as of 2022. (And hey, these are “sample” SQL interview questions but they are heavily based on reality!)

Before the tasks – What can you expect in an SQL technical screening?

There are two common ways an SQL tech screening can be done.

The simpler but less common way is that you get a computer, a data set, and a task. While you are solving the task, the interviewers are listening and asking questions. A little trial-and-error is totally fine, as long as you can come up with the correct solution in a reasonable amount of time.

The other, more difficult (and by the way much more common) way is the whiteboard interview. In this case, you don’t get a computer. You have to solve the task and sketch up the code on a whiteboard. This means that you won’t get feedback (at least not from a computer) on whether you made a logical or a syntax error in your code. Of course, you can still solve the tasks by thinking iteratively. You can crack the different SQL problems one by one… But you have to be very confident with your SQL skills.

Additionally, usually, you have to solve the tasks on the fly. Maybe you will get 3-5 minutes of thinking time but that’s the maximum you can expect.

I know, this sounds stressful. And it is. But don’t worry, there is some good news, as well. Because companies know that this is a high-stress interview type, compared to the real-life challenges, you will get relatively simpler tasks. (See the difficulty level below!)

SQL tech assessments in 2022

There are several types of SQL tech assessments. The one that I described above (and for that, I’ll provide a few exercises below) is the most common one. When people say “SQL tech screening,” they usually refer to that. To be more precise, I like to call it “in-person SQL screening.”

But, in fact, there are four different types of SQL assessments:

In-person SQL screening. The one that we discussed so far (and will discuss in the rest of the article).
SQL quiz questions. For example: “What is a primary key?” Or “List the different types of JOINs!” That’s a stupid type of SQL tech assessment — as it focuses on theory and not on practice. Still, some companies… you know.
Take-home SQL assignment. You get a more complex task and you’ll have to write multiple SQL queries to solve it. The upside is that can work from home, as you get the task and the dataset by email. You get these on a workday you choose, and you’ll have ~12 hours to solve it and send the solution back (SQL queries and a short presentation). I like this assessment type, as it creates a less stressful environment for the applicant.
Automated SQL screening. With the rise of remote work, automated SQL screening becomes more common. It’s usually a one-hour process with a few simpler SQL tasks – that you can solve from home via a browser. This interview type is not very personal, but I like it as it’s less stressful and more flexible (e.g. you can skip tasks and go back later).

When someone asks you to do an “SQL tech screening,” either of the above can come up. Still, the most common is the in-person SQL screening. So let’s see a few examples of that!

Test yourself!

Here are three SQL interview questions that are really close to what I actually got or gave on data analyst/scientist job interviews!

Try to solve all of them as if they were whiteboard interviews!

In the second half of the article, I’ll show you the solutions, too!

How to Become a Data Scientist
(free 50-minute video course by Tomi Mester)

Just subscribe to the Data36 Newsletter here (it’s free)!

SQL Interview Question #1

Let’s say you have two SQL tables: authors and books.
The authors dataset has 1M+ rows. Here’s a small sample, the first six rows:

author_name	book_name
author_1	book_1
author_1	book_2
author_2	book_3
author_2	book_4
author_2	book_5
author_3	book_6
…	…

The books dataset also has 1M+ rows and here’s the first six:

book_name	sold_copies
book_1	1000
book_2	1500
book_3	34000
book_4	29000
book_5	40000
book_6	4400
…	…

Create an SQL query that shows the TOP 3 authors who sold the most books in total!

(Note: Back in the days, I got almost this exact SQL interview question for a data scientist position at a very well-known Swedish IT company.)

SQL Interview Question #2

You work for a startup that makes an online presentation software. You have an event log that records every time a user inserted an image into a presentation. (One user can insert multiple images.) The event_log SQL table looks like this:

user_id	event_date_time
7494212	1535308430
7494212	1535308433
1475185	1535308444
6946725	1535308475
6946725	1535308476
6946725	1535308477
…	…

…and it has over one billion rows.

Note: If the event_date_time column’s format doesn’t look familiar, google “epoch timestamp”!

Write an SQL query to find out how many users inserted more than 1000 but less than 2000 images in their presentations!

(Note: I personally created and used this interview question to test data analysts when I was freelancing and my clients needed help in their hiring process.)

SQL Interview Question #3

You have two SQL tables!

The first table is called employees and it contains the employee names, the unique employee ids, and the department names of a company. Sample:

department_name	employee_id	employee_name
Sales	123	John Doe
Sales	211	Jane Smith
HR	556	Billy Bob
Sales	711	Robert Hayek
Marketing	235	Edward Jorgson
Marketing	236	Christine Packard
…	…	…

The second SQL table is called salaries. It holds the same employee names and the same employee ids – and the salaries for each employee. Sample:

salary	employee_id	employee_name
500	123	John Doe
600	211	Jane Smith
1000	556	Billy Bob
400	711	Robert Hayek
1200	235	Edward Jorgson
200	236	Christine Packard
…	…	…

The company has 546 employees, so both tables have 546 rows.

Print every department where the average salary per employee is lower than $500!

(Note: I created this test question based on a real SQL interview question that I heard from a friend, who applied at one of the biggest social media companies (name starts with ‘F.’ ;))

Solution of SQL Interview Question #1

The solution code is:

SELECT authors.author_name, SUM(books.sold_copies) AS sold_sum
FROM authors
JOIN books
ON books.book_name = authors.book_name
GROUP BY authors.author_name
ORDER BY sold_sum DESC
LIMIT 3;

And here is a short explanation:

1. First you have to initiate the JOIN. I joined the two tables by using:

SELECT *
FROM authors
JOIN books
ON books.book_name = authors.book_name;

2. After that, I used a SUM() function with a GROUP BY clause. This means that in the SELECT statement I had to replace the * with the author_name and sold_copies columns. (It’s not mandatory to indicate from which table you are selecting the columns, but it’s worth it. That’s why I used authors.author_name and books.sold_copies.)

3. Eventually, I ORDERed the results in DESCending order. (Just for my convenience, I also renamed the sum column to sold_sum using the AS sold_sum method in the SELECT statement.)

Solution of SQL Interview Question #2

The solution SQL query is:

SELECT COUNT(*) FROM
  (SELECT user_id, COUNT(event_date_time) AS image_count
  FROM event_log
  GROUP BY user_id) AS image_per_user
WHERE image_count < 2000 AND image_count > 1000;

The trick in this task is that you had to use the COUNT() function two times: first, you had to count the number of images per user, then the number of users (who fulfill the given condition). The easiest way to do that is to use a subquery.

Write the inner query first! Run a simple COUNT() function with a GROUP BY clause on the event_log table.
Make sure that you create an alias for the subquery (AS image_per_user). It’s a syntax requirement in SQL.
Eventually, in an outer query, apply a WHERE filter and a COUNT() function on the result of the subquery.

Solution of SQL Interview Question #3

Solution:

SELECT department_name, AVG(salaries.salary) AS avg_salaries
FROM employees
JOIN salaries
ON employees.employee_id = salaries.employee_id
GROUP BY department_name
HAVING AVG(salaries.salary) < 500;

Note: You can solve this task using a subquery, too – but in an interview situation the interviewer will like the above solution better.

Brief explanation:

1. First JOIN the two tables:

SELECT *
FROM employees
JOIN salaries
ON employees.employee_id = salaries.employee_id

Watch out! Use the employee_id column – not the employee_name. (You can always have two John Does at a company, but the employee id is unique!)

2. Then use the AVG() function with a GROUP BY clause — and replace the * with the appropriate columns. (Just like in the first task.)

3. And the last step is to use a HAVING clause to filter by the result of the AVG() function. (Remember: WHERE is not good here because it would be initiated before the AVG() function.)
Watch out: in the HAVING line, you can’t refer to the alias – you have to use the whole function itself again!

Prepare for SQL tech screenings by practicing!

If you solved all these questions properly, you are probably ready for a junior or even a mid-level Data Analyst SQL technical screening.

If not, let me recommend my new online course: SQL for Aspiring Data Scientists (7-day online course) – where you can level up (or brush up) your SQL skills in only 7 days. When you finish the course, just come back to this article and I guarantee that you will be able to solve these questions!

And if you are just about to start with SQL, start with my SQL For Data Analysis series on the blog!

And ultimately, if you feel that you are ready for a junior data scientist position but you want to try out how it works before you apply for a job, take my 6-week data science course:

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

Conclusion

The hard part of these SQL interview questions is that they are abstract. The tasks say to “imagine the data sets” and show only a few lines of them. When you get an exercise like that, it helps a lot if you have seen similar datasets and solved similar problems before. I hope solving the tasks in this article will boost your confidence!

If you have questions or alternative solutions, don’t hesitate to send them in via email and I’ll review them for you!

If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.

Cheers,
Tomi Mester

The post SQL Interview Questions: 3 Tech Screening Exercises For Data Analysts (in 2022) appeared first on Data36.

Coding In Data Science and Analytics - Articles on Data36.com

The Junior Data Scientist’s First Month (Video Course)

Learn Python 3 for Data Science – from scratch

Learn SQL for Data Analysis – from scratch

Learn Data Analytics in Bash – from scratch

Correlation: What is it? How to calculate it? .corr() in pandas

What is correlation?

How can we measure correlation?

Correlation vs Causation

What is causation

Third variable problem

Directionality Problem

Spurious Correlation

Correlation in Pandas

Correlation matrix – How to use .corr()

Coloring the correlation matrix (so it’s easier to read)

The Junior Data Scientist's First Month

Scatterplots

How correlation can help your business?

Sources

K-means Clustering with scikit-learn (in Python)

The difference between supervised and unsupervised machine learning, and why the latter can be scary

What is clustering in machine learning?

Here’s how K-means clustering does its thing

The Junior Data Scientist's First Month

How can you code a K-means clustering model in Python

I. Dropping unnecessary rows/columns before clustering

II. Formatting the data for the K-means clustering model

III. Standardization

IV. Creating the K-means clustering model!

Finding the ideal number of clusters with the elbow method

The Junior Data Scientist's First Month

Conclusion

How to Import Data into SQL Tables

How to Become a Data Scientist(free 50-minute video course by Tomi Mester)

Thank you!

Method #1: Load the Data Line by Line (INSERT INTO)

Commit your changes!

SQL TRUNCATE: empty your table without deleting the table

Method #2: insert a .csv file into an SQL table (COPY)

Example for COPY (insert .csv data into SQL)

The Junior Data Scientist's First Month

A few comments on the .csv data-load method

Method #3: Insert the output of another SQL query into your SQL table

Conclusion

SQL functions (SUM, COUNT, AVG, MIN, MAX) and GROUP BY | SQL for Data Analysis Tutorial, ep3

Before we start…

How to Become a Data Scientist(free 50-minute video course by Tomi Mester)

Thank you!

SQL functions to aggregate data

SQL COUNT function. Let’s count lines!

SQL SUM function. Calculate sum!

SQL AVG function. Calculate averages… I mean the *mean*.

SQL MAX and MIN functions. Let’s get maximum and minimum values.

SQL GROUP BY – for basic segmentation analysis and more…

SQL GROUP BY – the theory

SQL GROUP BY – in action

The Junior Data Scientist's First Month

Test yourself #1 (SQL SUM + GROUP BY)

Test yourself #2 (SQL AVG + GROUP BY)

SQL ORDER BY – to sort the data based on the value of one (or more) column(s)

SQL DISTINCT — to get unique values only

Test yourself #3

Conclusion

Pandas groupby(), count(), sum() and Other Aggregation Methods (Pandas Tutorial 2.)

Before we start

How to Become a Data Scientist(free 50-minute video course by Tomi Mester)

Thank you!

Aggregation – in theory

Pandas aggregation methods (in practice)

#1 pandas count()

#2 sum() in pandas

Pandas Data Aggregation #3 and #4: min() and max()

#5: averages in pandas: mean() and median()

The Junior Data Scientist's First Month

The pandas groupby() function (aka. grouping in pandas)

Pandas groupby() – in action

Pandas groupby() and count()

Test yourself #1 (another count + groupby challenge)

Test yourself #2

How to Become a Data Scientist
(free 50-minute video course by Tomi Mester)

How to Become a Data Scientist
(free 50-minute video course by Tomi Mester)

SQL AVG function. Calculate averages… I mean the mean.

How to Become a Data Scientist
(free 50-minute video course by Tomi Mester)

How to Become a Data Scientist
(free 50-minute video course by Tomi Mester)