Data36 https://data36.com/ Learn Data Science the Hard Way! Tue, 29 Oct 2024 08:38:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.4 https://data36.com/wp-content/uploads/2016/08/favico-88x88.png Data36 https://data36.com/ 32 32 Machine Learning Simple Trick: Calculating the Importance of Variables (feature_importances_ and Friends) https://data36.com/machine-learning-feature-importance/ https://data36.com/machine-learning-feature-importance/#respond Tue, 12 Nov 2024 08:37:27 +0000 https://data36.com/?p=9047 Yesterday was the live final of our data science competition. One of our finalists pulled out a trick that I think everyone should take note of… By the way, our top three finishers were: Congratulations and a big thank you to all three! I’ll be posting pictures from the event on my LinkedIn soon. But […]

The post Machine Learning Simple Trick: Calculating the Importance of Variables (feature_importances_ and Friends) appeared first on Data36.

]]>
Yesterday was the live final of our data science competition. One of our finalists pulled out a trick that I think everyone should take note of…

By the way, our top three finishers were:

  1. Roland Nagy 🏆
  2. Bernadett Horváth
  3. Tamás Berki

Congratulations and a big thank you to all three! I’ll be posting pictures from the event on my LinkedIn soon.

But let’s get back to the secret trick. (Well, it won’t be so secret after this post!)

What Makes a Dog Run Fast?

Just for context, let me quickly (and in a pretty simplified way) explain the problem that our contestants had to solve:

  • We have a large dataset containing information from dog relay races (the specific sport is called Flyball. The goal is for 4 dogs to run as a team on a track, each taking turns. Multiple teams compete in a race, and the team of four that finishes the fastest wins).
  • The question is: What factors most influence the dogs’ running performance?

At first, the task seems simple.

We might start by examining whether bigger or smaller dogs run faster…

Then, we could look at which age group of dogs performs the best…

Or analyze which dog breed is the most skilled…

But then, the question starts to get more complicated:

  • What time of day does the team perform best?
  • Does the sequence of neutered vs. non-neutered dogs affect race time?
  • And does alternating between male and female dogs have an impact?
  • Does it matter if the dogs know each other in terms of their race time?
  • And let’s not forget to factor in the weather conditions…

The more analyses we run, the more new questions arise.

I’m sure our data science competition participants could tell you a lot about that.

But the truth is, this is almost exactly what happens in real-world business projects as well.

This is where a small Machine Learning technique can come in handy.

By the way, here’s a screenshot/snippet of the delightfully challenging dataset that had to be analyzed in the competition. (Thanks a lot for the data for the Flyball Club of the Czech Republic!)

Machine Learning Recap

Before I dive into the specific method, let’s quickly go over some basics of ML.

We typically turn to Machine Learning algorithms when:

  • We want to predict something (e.g., in a factory: “This machine is likely to break down within a month”)
    OR
  • We need to classify (e.g., in a SaaS business: “This non-paying user is very similar to paying users”)
    OR
  • We might want to do a bit of clustering (e.g., in a store: “These products are typically bought together by a certain customer group”)

(These are just a few examples, but I’ll dive deeper into this in a future blog post.)

The core idea behind Machine Learning projects (at least for most predictive and classification tasks) is that:

  1. we have a bunch of input variables
  2. and one output variable that we want to optimize for

Here’s a simple, everyday example:

SITUATION AND PROBLEM TO SOLVE:
You walk into a store, make your purchases, and want to decide which checkout line to join.

OUTPUT VARIABLE (what you’re optimizing for):
Time spent waiting in line. (You want this to be as short as possible, so you can leave the store quickly.)

INPUT VARIABLES (factors that could influence your output):

  • How many people are in each line
  • How many items are in the carts of the people ahead of you
  • The average age of the people in the line (I didn’t say this, algorithm said it! 🙃)

If we were to solve this problem using a Machine Learning approach, we’d feed the algorithm, say, two years of past data, and it would calculate how much each input variable impacts the output. Based on that, it would estimate the waiting time for each line in the current situation, allowing us to choose the best one.

Note: Of course, I’m simplifying—there are a lot of nuances and challenges in ML, but I think this example works well as a comparison.

The Importance of Input Variables

And here’s the key takeaway that we can bring into any analysis project from a Machine Learning (ML) project.

Often, the greatest business value isn’t knowing which specific line to join, but rather understanding how each input variable impacts the output.

(Often, it’s not the “what?” but the “why?” that’s most interesting.)

Going back to the dog example, Roland, the winner of our DS competition, told me during the coffee break that although he mostly worked with basic retrospective analyses, he did use a bit of Machine Learning to verify if the algorithm identified the same important variables for a dog’s running time as he did. This is what he got:

Of course, this was just a small step in his otherwise very complex analysis, but it definitely caught my attention (and I know the judges liked it too).

And how cool is it in any real-life data project when you get a chart like this? It helps point you in the right direction when dealing with a seemingly endless dataset.

A Business Example

Let me show you another example.

On my blog (data36.com) I was consistently writing articles for a while. Naturally, I was curious to see which articles had the biggest impact on later conversions (in my case, course purchases).

I had the data, so I quickly ran my own analysis:

The method is quite similar to Roland’s—except, in my case, each column represents the reading of a specific article. It’s coded in the chart, but for example, column 39 represents my article “Learning Data Science (4 Untold Truths).” This article had the largest impact on whether a reader would later become a customer. (Interestingly, the next three articles were also similar in being introductory and topic-related.)

This quick and simple insight helped me a lot in deciding what new articles to write – and also in identifying which ones to promote more heavily to my audience through my international newsletter, ads, etc.

What Methods Are Available for Evaluating Variables Importance?

There are many ways to perform this kind of analysis on the importance and weights of input variables in Python.

Several methods and implementations exist, each with its own strengths. They differ slightly from one another (and can produce different results) – plus, not all methods can be applied to every model.

  • In the competition, Roland used the SHAP methodology/library. (LINK)
  • In my article example, I ran a function called feature_importances_ (LINK)
  • I’ve also seen senior data professionals use a method called Partial Dependence Plots (PDP) (LINK)

So, there are plenty of solutions, each suited for different cases and backed by different math.

I had ChatGPT create a handy little table to help you get started: here.

Summary

I think this input variable analysis is a super useful trick. I even use it in projects where I wouldn’t necessarily apply machine learning in the traditional sense.

To wrap things up, let me add a general data science reminder:

Although you’ll see that running one of these solutions in Python can be done easily in just a few lines (especially if you already have your ML model), it’s crucial to deeply understand what each method is actually telling you about the importance of the variables. This will prevent any misinterpretation. (The little table from ChatGPT above is a good starting point, but it’s also worth diving deeper into the documentation of these methods.)

That’s all for today!
I hope you found it interesting!

Cheers,
Tomi Mester

The post Machine Learning Simple Trick: Calculating the Importance of Variables (feature_importances_ and Friends) appeared first on Data36.

]]>
https://data36.com/machine-learning-feature-importance/feed/ 0
Data Sources: Three Free Data Collection Methods for Data Science Projects https://data36.com/data-sources-three-free-data-collection-methods-for-data-science-projects/ https://data36.com/data-sources-three-free-data-collection-methods-for-data-science-projects/#respond Tue, 29 Oct 2024 08:23:13 +0000 https://data36.com/?p=9054 “Where can I get data from?” That’s a totally relevant question, especially if: So, in this post, I’ll quickly and concisely gather a few options that can be a good starting point. I’ll expand this list in the future and eventually turn it into a proper library of resources. Let me show you three popular […]

The post Data Sources: Three Free Data Collection Methods for Data Science Projects appeared first on Data36.

]]>
“Where can I get data from?”

That’s a totally relevant question, especially if:

  • you’re building a data science hobby or side project,
    OR
  • you want to expand your company’s research project with external data sources,
    OR
  • something along those lines… 🙃

So, in this post, I’ll quickly and concisely gather a few options that can be a good starting point. I’ll expand this list in the future and eventually turn it into a proper library of resources.

Let me show you three popular methods.

There are more, but these are the three most commonly used

  1. Downloading public datasets
  2. Web scraping
  3. APIs

Let’s go through them one by one.

(1) Downloading Public Datasets

There are a few websites where the creators simply gather, upload, and make a large number of datasets searchable. These vary in quality, but with thorough searching, you can find some real gems. The only downside is that these datasets are usually not “live,” meaning they don’t update regularly. So, you can typically only analyse a fixed period from the past. However, this is often enough—especially for hobby projects.

Here’s the list:

(2) Web Scraping

Web scraping is essentially the process of gathering data from public websites.

It’s like visiting a webpage and manually collecting the data on it (e.g., collecting how many stars each movie has on IMDB). But that process is repetitive, boring, and time-consuming… So, instead of doing it yourself, one of Python’s web scraping packages does it for you. (I mostly use BeautifulSoup.)

I’ve created a 20-minute Python tutorial that demonstrates how this works.

It’s in English, and we use it to find out who the most popular Marvel superhero is by scraping data from Wikipedia in just a few simple steps:

* Is web scraping legal? Of course, if a website explicitly prohibits scraping, you should not scrape it. Where it’s not explicitly forbidden, the legality can be a bit more ambiguous. This isn’t legal advice, and you should consult your own lawyer, but I did some research on the matter as well… Different sources provide different opinions. The best guideline I found, and which is generally applicable, is the principle of “fair use.” “Fair use” is a somewhat tricky legal category to define, but it generally means that if you’re creating new and unique value without harming the original data owner’s interests, web scraping can potentially be legal. Again, this is not legal advice.

(3) APIs

A lot of online applications make some of their data accessible through API connections.

Examples:

  • Spotify API: You can get data about songs and artists (e.g., play count, popularity, etc.)
  • Coinbase API: You can get cryptocurrency data (e.g., current and historical prices)
  • Weather API: You can access weather data (e.g., current and past temperatures, precipitation, etc., based on location)

These API connections provide data directly from the application owners in a structured format. So, it’s guaranteed to be legal, high-quality, and live data.

Note: By “structured format,” I mean JSON, which can essentially be converted into a Python dictionary. This might seem intimidating at first, but if you’ve completed something like the Junior Data Scientist Academy, you’ll have no trouble extracting the data you need. Here’s an example of what it might look like:

Drawback: You do need to write Python code for this – though that’s not really a drawback in itself. The real issue is that the documentation for these APIs often has a bit of a “by developers, for developers” vibe… 😅 How can I put this politely? … … Let’s just say user-friendliness isn’t exactly the strength of these guides.

But no worries, I’ve got a demo video for this too, where I walk through the concept using the Coinbase API and the Weather API:

Context: This video was created for the internal competition of the Data Science Club, so you’ll notice a few references to that along the way.

There’s even more…

Collecting data from external sources is an endless topic with endless possibilities. 🙂

In today’s post, I wanted to highlight that there are tons and tons of free datasets available these days, so don’t let a lack of data be the thing that holds your project back!

(Sometime in the future, I plan to write specifically about internal data collection within companies… I’m just not sure how many people would be interested in that topic. If you’re one of them, feel free to drop me an email—I’d appreciate it.)

Cheers,
Tomi Mester

The post Data Sources: Three Free Data Collection Methods for Data Science Projects appeared first on Data36.

]]>
https://data36.com/data-sources-three-free-data-collection-methods-for-data-science-projects/feed/ 0
Aspiring Data Scientists! Start to learn Statistics with these 6 books! https://data36.com/aspiring-data-scientists-start-to-learn-statistics-with-these-6-books/ https://data36.com/aspiring-data-scientists-start-to-learn-statistics-with-these-6-books/#respond Wed, 20 Mar 2024 18:55:33 +0000 https://data36.com/?p=8931 Statistics is difficult. Of course it is, as it’s most of the actual science part in data science. But that doesn’t mean that you couldn’t learn it by yourself if you are smart and determined enough. In this article, I am going to list 6 books that I recommend starting with if you want to learn […]

The post Aspiring Data Scientists! Start to learn Statistics with these 6 books! appeared first on Data36.

]]>
Statistics is difficult. Of course it is, as it’s most of the actual science part in data science. But that doesn’t mean that you couldn’t learn it by yourself if you are smart and determined enough.

In this article, I am going to list 6 books that I recommend starting with if you want to learn statistics. The first three are lighter reads. These books are really good for setting your mind to think more numerically, mathematically and statistically. They also do a good job of presenting why statistics is exciting (it is!).

The second three books are more scientific — with formulas and Python or R codes.

Don’t get intimidated though! Mathematics is like LEGO: if you build the small pieces up right, you won’t have trouble with the more complex parts either!

Let’s see the list!

1. You Are Not So Smart — by David McRaney

When I first saw the title, I loved it already! This is a very well written book, containing many stories — and everything in it is based on real experiments and real scientific research.

David McRaney introduces one sad but true fact of life: that our brain constantly tricks us and we are not even smart enough to realize it. For an aspiring data scientist, this book is essential, because it lists many common statistical bias types. It points out classic mistakes like the self-serving bias, the availability heuristic, and the confirmation bias. It also shows why people tend to be tricked by fake news or scams and why people don’t always help when seeing someone having a heart attack on a busy street. Being aware of these biases should be basic, but I see even practicing data professionals fall for them from time to time…

(I wrote a detailed article about Statistical Bias Types.)

2. Think Like a Freak — by Dubner & Levitt

The previous book was about why we are not so smart. But this one is about how to be smarter! Think Like a Freak shows us how critical and unconventional thinking can lead to huge success… and, hey, that’s something that as a data scientist, you should practice every day.

The book lists a bunch of case studies from everyday life, goes into details and analyzes why a solution for a problem is good or bad. Reading it will definitely boost your analytical thinking.

3. Innumeracy — by John Allen Paulos

If you hated mathematics in middle or high school, it was for one reason: you had a bad teacher. A good teacher turns mathematical equations into mystical puzzles, probability theory into detective stories, and linear algebra into the ultimate solution for all the big questions in life. Luckily, I had really good math teachers, so I was always generally excited by mathematics and statistics. Looking back, this really affected my life.

If you didn’t have a good math teacher, John Allen Paulos is here to make up the loss for you: he’s the awesome teacher you wish you’d had. Innumeracy focuses mostly on one specific segment of statistics: probability theory and calculations. It explains the math behind it, shows the formulas and puts everything into a very logical context. And it does it by showing the real life applications of these calculations, so you can immediately understand the advantage of being more math-minded.

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

4. Naked Statistics — by Charles Wheelan

This book is the perfect transition between the previous light-read statistics books and the next two more scientific ones. Reading it, you can easily understand basic concepts like mean, median, mode, standard deviation, variance, and standard error, or the more advanced things like the central limit theorem, normal distribution, correlation analysis or regression analysis.

Almost needless to say that all of these are packed into metaphors for ease of understanding.

5. Practical Statistics for Data Scientists — by Andrew & Peter Bruce (2nd edition)

This book contains everything that a Junior Data Scientist has to know about the practical part of statistics. In my opinion, the biggest advantage of the book is the structure. It really makes it clear how things are built on top of each other. But it also goes into detail on the most common prediction and classification models — and it talks a bit about Machine Learning and Unsupervised Learning too.

The second edition of the book comes with Python code examples, too. (If you don’t know Python, that’s not a problem; you can simply skip those parts.)

6. Think Stats — by Allen B. Downey

Topic-wise, Think Stats is really similar to Practical Statistics for Data Scientists. I wanted to have it on the list, though, because even if the topic is the same, different writers usually approach things differently. On a topic as complex as data science, I think it’s worth looking at different angles and having things explained by two different data professionals.

Plus, this is a book from 2011. It’s good to see how much the interpretation of (even these standard) things has changed in as short as six years.

Oh, and I almost forgot to mention that Think Stats is available for free in PDF format, here: http://greenteapress.com/thinkstats/

And that’s it!

By reading these 6 books you can get a solid understanding of Statistics for Data Science!

Cheers,
Tomi Mester

The post Aspiring Data Scientists! Start to learn Statistics with these 6 books! appeared first on Data36.

]]>
https://data36.com/aspiring-data-scientists-start-to-learn-statistics-with-these-6-books/feed/ 0
Working Efficiently vs. Effectively with Python and Data Science https://data36.com/working-efficiently-vs-effectively-with-python-and-data-science/ https://data36.com/working-efficiently-vs-effectively-with-python-and-data-science/#respond Thu, 14 Sep 2023 13:53:31 +0000 https://data36.com/?p=8582 Yesterday 9 a.m. to 4 p.m. I worked on my data science project with complete focus. No distractions. Deep work. Just as one wishes to work when reaching for high efficiency. And indeed, I churned out ~300 lines of Python code to crack a very challenging NLP classification problem. The result was good: the model […]

The post Working Efficiently vs. Effectively with Python and Data Science appeared first on Data36.

]]>
Yesterday 9 a.m. to 4 p.m. I worked on my data science project with complete focus.

No distractions.

Deep work.

Just as one wishes to work when reaching for high efficiency. And indeed, I churned out ~300 lines of Python code to crack a very challenging NLP classification problem. The result was good: the model achieved about 90% accuracy. That’s pretty good, actually! However, for this specific task, I needed closer to 99%.

So I sat down again, later the same day, around 11 p.m.

And I realized that there is a waaay better solution to the issue! I completely deleted my original script and replaced it with ~50 lines of Python — which I wrote in roughly an hour.

Now, the model reaches the desired 99% accuracy!

9am-4pm: I worked efficiently.
11pm-midnight: I worked effectively.
There is a difference.

P.S. The funny part is that it felt painful to delete my 9-to-4 code, even though I knew the second solution would be far superior. I wish I had thought of the second solution at 9 a.m., finished it by 10 a.m., and taken the rest of the day off. I guess I need to remind myself that mistakes are just stepping stones. ¯_(ツ)_/¯

Cheers,
Tomi Mester

The post Working Efficiently vs. Effectively with Python and Data Science appeared first on Data36.

]]>
https://data36.com/working-efficiently-vs-effectively-with-python-and-data-science/feed/ 0
My Last Decade of Data Science Hobby Projects https://data36.com/my-last-decade-of-data-science-hobby-projects/ https://data36.com/my-last-decade-of-data-science-hobby-projects/#respond Tue, 05 Sep 2023 08:25:54 +0000 https://data36.com/?p=8579 Data science hobby projects I’ve built so far: 🏚 2012 – “Rent Scraper” — A tool to find me the cheapest apartment to rent by scraping several websites every minute. (Bash, Python) 🗞 2013 – “Prionews” — A system to summarize daily news in an unbiased manner. (Bash, SQL, Python) 🤖 2014 – “Jarvic” — […]

The post My Last Decade of Data Science Hobby Projects appeared first on Data36.

]]>
Data science hobby projects I’ve built so far:

🏚 2012 – “Rent Scraper” — A tool to find me the cheapest apartment to rent by scraping several websites every minute. (Bash, Python)

🗞 2013 – “Prionews” — A system to summarize daily news in an unbiased manner. (Bash, SQL, Python)

🤖 2014 – “Jarvic” — A self-learning chatbot. (Python)

🎨 2015 – “Write-Here-Anything” — A hard-to-describe website and art project. (JavaScript)

🌐 2016 – “Learn-Languages” — A simple script that prints the top 1,000 most important words to learn in any language, based on the script of the Friends TV show. (Python)

🖥 2017 – “User-Generator” — A Python script that simulates log/data creation for a mobile app, intended for educational purposes. (Python)

👬 2018 – “A/B Testing Redirect” — Code to implement an A/B test without using third-party tools. (JavaScript + Python)

📈 2019 – “Simple User Log” — A basic analytics tool designed to replace Google Analytics. (JavaScript + Python + Flask)

👨‍🏫 2020 – “Best Bet” — A game that educates people about the concept of “expected value.” (Python + Flask + HTML)

☘ 2021 – “Automated Gardener” — A hardware project that automatically takes care of my plants. (Python + Bash + Raspberry Pi)

💰 2022 – “BitPanda_DCA” — A simple automation tool that performs dollar-cost averaging on the Bitpanda platform for me. (Python + Bash)

🥃 2023 – “WhiskyReturns” — A platform that collects data on whisky investments and displays them in a simple chart. (Python, Bash, SQL, HTML, APIs, etc.)

Most of these projects are retired and offline, but they’ve been invaluable in teaching me about data science and coding. Building a hobby project is never a waste of time. You should start yours!

Cheers,
Tomi Mester

Cheers,
Tomi Mester

The post My Last Decade of Data Science Hobby Projects appeared first on Data36.

]]>
https://data36.com/my-last-decade-of-data-science-hobby-projects/feed/ 0
Margin of Error: Formula, Examples, Calculation https://data36.com/margin-of-error-formula/ https://data36.com/margin-of-error-formula/#respond Wed, 08 Mar 2023 15:33:18 +0000 https://data36.com/?p=8364 Margin of error is one of the most important statistical concepts to know when you look at the results of online surveys or polls. If you don’t know it, you can easily misinterpret the results and make false conclusions. On the other hand, if you know it, you can be more confident how to use […]

The post Margin of Error: Formula, Examples, Calculation appeared first on Data36.

]]>
Margin of error is one of the most important statistical concepts to know when you look at the results of online surveys or polls. If you don’t know it, you can easily misinterpret the results and make false conclusions. On the other hand, if you know it, you can be more confident how to use your results. Today I’ll explain this important concept for you!

Margin of error: a practical example

Let’s start with a story!

The HR Department of a company runs the same survey every year.

The company has 3000 employees and HR receives 550 responses both in 2021 and 2022 (for simplicity).

In the survey, there is a question about salary satisfaction and the results are:

2021: 85% are satisfied with the salary.

2022: 89% are satisfied with the salary.

When the HR Manager presents the results, she states that salary satisfaction clearly increased.

But, she is not necessarily true.

Why?

The HR manager forgot to mention the Margin of Error.

  • What is the Margin of error?
  • How to calculate it?
  • When and how to use it?

You will get the answers in this blog post – and you’ll see how the HR manager could have done better. Let’s dig in!

What is the Margin of error?

The Margin of Error (MoE) is a statistical concept that helps to measure the uncertainty of a survey or poll result. It is typically expressed as a percentage or number.

Surveys and polls are usually based on samples, which are smaller groups selected from the larger population. Since the sample is not the entire population, there is some uncertainty or variability in the results. If the margin of error is large, it may indicate that the sample was not representative of the population. On the other hand, if the margin of error is small, it indicates that the sample estimate is reliable, and we can be more confident in its accuracy.

From the Margin of Error, we can calculate the confidence interval

The margin of error gives us the confidence interval. It indicates the range within which the true population value likely lies. The width of the range will be two times the margin of error. It has two boundaries:

  • Lower bound = observed score – MoE
  • Upper bound = observed score + MoE

Note: remember “MoE” stands for Margin of Error.

margin of error and confidence interval

Note: These are the values for our original example, so if you are a good observant you probably already have an idea why the HR Manager was mistaken.

There are several factors that affect the margin of error.

The most important of these is the sample size. In general, the larger the sample size, the smaller the margin of error will be. This is because a larger sample provides more information about the population, and therefore it is more likely to produce an accurate estimate.

Confidence level

There is a big difference between confidence interval and confidence level. 

Did you notice the word ‘likely’ in the definition of the confidence interval? 

When we are working with samples, we cannot be entirely sure that the sample is a perfect representation of the population, hence we cannot be certain that the confidence interval is correct all the time. We can describe this uncertainty (or certainty) with probabilities. 

The confidence level tells you how confident you can be that the confidence interval is correct and it will include the true score for the population.

In business, the standard confidence interval is 95%, while for some medical or scientific studies, 99% is usually used, but it can be any value between 0 and 100.

For example, a 95% confidence level means that 95% of the time, the corresponding confidence interval will include the true score.

You may ask: Why don’t we use 99% all the time? We want to be sure that our calculation is correct.

There is a logical answer to this question. The higher the confidence level, the larger the margin of error will be. If you want to be sure that the true score is in the confidence interval, you need to increase the confidence interval. MoE with a 95% confidence level will be larger than if we want to be 90% confident in the accuracy of the estimate.

typical confidence levels

Z-score

The z-score is a statistical measure that represents the number of standard deviations a value is from the mean of a data set.

To compute the z-score we need to have information on the mean and standard deviation of the complete population. When we have no information on these (just like in our example), we can assume a normal distribution and use the standard z-scores for calculations.

normal distribution standard deviation and z-scores
source: Wikipedia

In the normal distribution, according to the 68-95-99.7 rule, we would expect that 95% of the results are within +- 1.96 standard deviations of the true mean. This interval is the confidence interval at a 95% confidence level. 

The z-table lists the z-scores for each standard deviation from the mean, and you can use it to look up the z-score for a given confidence level. For example, the z-score for the 95% confidence level is 1.96.

Here are the most used z-scores:

Confidence LevelZ-Score
80%1.28
90%1.64
95%1.96
99%2.58
99.9%3.29

The Junior Data Scientist's First Month

A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.

“Solving real problems, getting real experience – just like in a real data science job.”

Margin of Error Formulas — it’s time to calculate MoE!

We have different formulas for different scenarios. 

  1. When we calculate with numbers
  2. When we calculate with proportions
  3. When we have a small population size

The general formula assumes an infinite or very large population with normal distribution in the first 2 cases, the only difference there is if we use numbers or proportions.

Sometimes we need to work with smaller populations and sample sizes, just like in our original HR example. In these cases, we need to adjust our calculations with the sample size, relative to the population.

Margin of Error formula #1

Let’s see an example for case #1:

You are conducting a survey to determine the average height of adult men in the United States. You collect data from a random sample of 500 men and find that the average height is 68 inches with a standard deviation of 3 inches. What is the margin of error for the survey results at a 95% confidence level?

In this example, we need to use this formula:

margin of error formula #1

The z-score for a 95% confidence level is 1.96, the standard deviation is 3 inches, and the sample size is 500 men. Plugging these values into the formula, we get: 0.34

This means that we can be 95% confident that the true average height of adult men in the United States is between 67.66 inches (68 – 0.34) and 68.34 inches (68 + 0.34).

Margin of Error formula #2

Here is another example for case #2, with proportions:

You run a survey to determine the proportion of people who prefer dogs over cats. You collect data from a random sample of 1000 people and find that 600 of them prefer dogs, while 400 prefer cats. What is the margin of error for the survey results at a 90% confidence level?

To solve this we need another formula:

margin of error formula #2

In this formula, sample proportion means the proportion who chose a particular option. In this case, the proportion of people who like dogs is 600/1000 = 0.6

The z-score for a 90% confidence level is 1.645, the proportion of people who prefer dogs is 0.6 (600 out of 1000), and the sample size is 1000 people. Plugging these values into the formula, we get: 0.008

This means that we can be 90% confident that the true proportion of people who prefer dogs over cats is between 0.592 (0.6 – 0.008) and 0.608 (0.6 + 0.008). So 59.2% and 60.8%

Margin of Error formula #3

And case #3 (small population):

Now let’s see our original HR survey example, where we have a relatively small population (3,000 employees) so we need to adjust the formula a little bit.

margin of error formula #3

The z-score for a 95% confidence level is 1.96, the proportion of people who were satisfied with their job is 0.85, the sample size is 550 people, and the population is 3000. Plugging these values into the formula, we get 0.02697 so 2.7%

This means that the true value for the population in 2021 is between 85+-2.7% so 82.3% – 87.7%

Using the same formula for the second survey in 2022, we get a 2.36% MoE. The real population satisfaction is between 89 +- 2.36%, so 86.64% – 91.36%

As you can see from the results it can be the case that salary satisfaction decreased, if it was 87.7% in 2021 and 86.64% in 2022.

Conclusion

It is important to use Margin of Error (MoE) when interpreting the results of surveys or polls, as it gives an indication of how reliable the sample estimate is. Without this information, it is easy to misinterpret the results.

Besides the clear communication and understanding of results, the MoE can help in decision-making by providing an idea of how accurate the sample estimate is and it tells the level of uncertainty. It also allows us to compare the results of different surveys or polls.

Sample size has the biggest impact on MoE. A larger sample size provides more information about the population and is therefore more likely to produce an accurate estimate.

You can also affect the outcome if you adjust the confidence level. As we discussed increasing (decreasing) the confidence level will increase (decrease) the MoE and hence the confidence interval.

Don’t be like the HR manager, and always communicate the uncertainty of your results!

Cheers,
Levi Kulcsar

Resources:

The post Margin of Error: Formula, Examples, Calculation appeared first on Data36.

]]>
https://data36.com/margin-of-error-formula/feed/ 0
The Junior Data Scientist’s First Month (Video Course) https://data36.com/junior-data-scientists-first-month-video-course/ https://data36.com/junior-data-scientists-first-month-video-course/#respond Mon, 05 Dec 2022 16:26:00 +0000 https://data36.com/?p=2403 100% practical 6-week data science challenge & video course — simulating of being a junior data scientist at a true-to-life startup.

The post The Junior Data Scientist’s First Month (Video Course) appeared first on Data36.

]]>
100% practical 6-week data science challenge & video course — simulating of being a junior data scientist at a true-to-life startup.

The post The Junior Data Scientist’s First Month (Video Course) appeared first on Data36.

]]>
https://data36.com/junior-data-scientists-first-month-video-course/feed/ 0
Learn Python 3 for Data Science – from scratch https://data36.com/learn-python-3-data-science-scratch-redirect/ https://data36.com/learn-python-3-data-science-scratch-redirect/#respond Sun, 04 Dec 2022 23:40:00 +0000 https://data36.com/?p=1600 I put together a ‘Python 3 for Data Science’ tutorial series starting from the very basics. It features 9 practical articles – that covers everything you need to know!

The post Learn Python 3 for Data Science – from scratch appeared first on Data36.

]]>
I put together a ‘Python 3 for Data Science’ tutorial series starting from the very basics. It features 9 practical articles – that covers everything you need to know!

The post Learn Python 3 for Data Science – from scratch appeared first on Data36.

]]>
https://data36.com/learn-python-3-data-science-scratch-redirect/feed/ 0
Learn SQL for Data Analysis – from scratch https://data36.com/learn-sql-data-analysis-scratch-redirect/ https://data36.com/learn-sql-data-analysis-scratch-redirect/#respond Sat, 03 Dec 2022 23:51:00 +0000 https://data36.com/?p=1588 Learning SQL is very useful for anyone in the online world. And it’s a definite must for Data Analysts/Scientists! Here are 8 articles to start with…

The post Learn SQL for Data Analysis – from scratch appeared first on Data36.

]]>
Learning SQL is very useful for anyone in the online world. And it’s a definite must for Data Analysts/Scientists! Here are 8 articles to start with…

The post Learn SQL for Data Analysis – from scratch appeared first on Data36.

]]>
https://data36.com/learn-sql-data-analysis-scratch-redirect/feed/ 0
Learn Data Analytics in Bash – from scratch https://data36.com/learn-data-analytics-bash-scratch-redirect/ https://data36.com/learn-data-analytics-bash-scratch-redirect/#respond Fri, 02 Dec 2022 12:32:00 +0000 https://data36.com/?p=980 Read these 7 free articles about Bash/Command Line and take your very first step to learn the basics of coding for Data Science!

The post Learn Data Analytics in Bash – from scratch appeared first on Data36.

]]>
Read these 7 free articles about Bash/Command Line and take your very first step to learn the basics of coding for Data Science!

The post Learn Data Analytics in Bash – from scratch appeared first on Data36.

]]>
https://data36.com/learn-data-analytics-bash-scratch-redirect/feed/ 0