Statistics Of Data Science | Detailed Guide

Sharing is caring!

“Data Scientist is a person who is better at statistics than any programmer and better at programming than any statistician.”

Introduction To Statistics

It is important to know the method beyond various Machine Learning algorithms in order to know how and when to use them. Now the question arises, what exactly is Statistics? In this statistics tutorial, we will learn statistics for data science. 

Statistics is a Mathematical Science concern to data collection, analysis, interpretation and presentation. 

Statistics for Data Science is used for multiplex problems in the real-life so that Data Scientists and Analysts can look for relevant trends and changes in Data. In simple words, Statistics can be used to derive meaningful discernment from data by performing mathematical calculations on it. Several Statistical functions, principles, and algorithms are implemented to analyze raw data. Moving ahead in this blog on Statistics for Data Science, let’s learn more about the categories in statistics. 

Read: An Introduction To Mobile App Development Process

CATEGORIES IN STATISTICS

1. Inferential Statistics

Inferential Statistics makes conjecture and divination about a population based on a representation of data taken from the population in question.

Inferential statistics derives a large data set and applies probability to transpire at a conclusion. It recognizes you to infer parameters of the population based on sample stats and erect models on it.

Inferential Statistics - Math And Statistics For Data Science - Edureka

Hypothesis test..

In the Hypothesis test, there are two kinds of Hypothesis.
H0 (Null Hypothesis: it is already an existing thing.

H1 (Alternate Hypothesis): this is what we have to prove.
 
In this Hypothesis testing we use 
"t-test","z-test","Anova","F-test","chi-square"
to prove wheather Ho is correct (or) H1 is correct.
data<-read.csv(file.choose(),header = T)   # file.choose() used to point that file
data
##    Names Age   ID Gender Height     Region    Profit Talkative
## 1 sujith  23 4567   male   5.11        A.P   41.9136       yes
## 2  kumar  24 3456   male   5.30     Andhra  219.5820        no
## 3  atanu  25 2153   male   5.50    Kolkata    6.8714        no
## 4  karum  27 2134   male   4.50 Pondicherry -383.0310       yes
## 5   ravi  26 3214   male   5.40  Hyderabad    2.5164       yes
## 6  pavan  25 2614   male   3.40 Vijayavada   14.1694        no
## 7 shyama  22 2874 female   5.00     Kerala    1.9656       yes
## 8  pream  24 1234   male   2.10 Tamil nadu   90.7152       yes
attach(data)
names(data)
## [1] "Names"     "Age"       "ID"        "Gender"    "Height"    "Region"   
## [7] "Profit"    "Talkative"
class(Profit)
## [1] "numeric"

2. Descriptive Statistic

Descriptive Statistics helps to assemble data and basis on the characteristics of data providing parameters.

Descriptive Statistics - Math And Statistics For Data Science - Edureka

Descriptive Statistics is an abbreviation of the data at hand through definite numbers like mean, median etc. so as to make the understanding of the information easier. It does not occupy any notion or inference beyond what is presented. This means that the descriptive statistics are just the representation of the data (sample) available and not based on any presumption of probability.

Read: Artificial Intelligence Tutorial | A Combination of Science with Engineering

Commonly Used Measures

  1. Measures of Central Tendency
  2. Measures of Dispersion (or Variability)
  3. Measures of Central Tendency

A Measure of Central Tendency is a one number summing up of the data that typically describes the middle of the data. This one number summary is of three types.

1. Mean

Mean is defined as the ratio of the sum of all the notes in the data to the total number of observations. This is also identified as Average. Thus mean is a number around which the complete data set is spread.

2. Median

Median is the point which divides the entire data into two equal halves. One-half of the data is less than the median, and the other half is larger than the same. Median is designed by first arranging the data in also ascending or descending order.

If the number of observations are odd, the median is given by the middle inspection in the short form.

If the number of observations are even, the median is specified by the mean of the two middle observations in the sorted form.

A vital point to note is that the order of the data (ascending or descending) does not affect the median.

3. Mode

Mode is the number which has the limit frequency in the entire data set, or in other words, mode is the number that appears the maximum number of times. A number can have one or more than one mode.

  • If there is only one number that appears maximum number of times, the data has one mode, and is called Uni-modal.
  • If there are two numbers that emerge maximum number of times, the data has two modes, and is called Bi-modal.
  • If there are more than two numbers that emerge maximum number of times, the data has more than two modes, and is called Multi-modal.

Read: Can I Use The PAMM System To Learn And Earn At The Same Time?

Terminologies 

Four giant terms in statistics are population, sample, parameter, and statistic:

  • A population is the intact group of individuals you want to study, and a sample is a subset of that group.
  • A parameter is a quantitative characteristic of the population that you’re fascinated in estimating or testing (such as a population mean or proportion).
  • A statistic is a quantitative characteristic of a sample with the intention of often helping estimate or test the population parameter (such as a sample mean or proportion).

Descriptive statistics are a single outcome you get when you analyze a set of data — for example, the sample mean, median, standard deviation, correlation, regression line, margin of error, and test statistic.

Statistical inference refers toward using your data (and its descriptive statistics) to formulate conclusions about the population. Major types of presumption include regression, confidence intervals, and hypothesis tests.

TYPE:

An analysis of several event can be done in one of two ways:

1. Quantitative Analysis

Quantitative Analysis or Statistical Analysis is the science of collecting and interpreting data with numbers and graphs to spot patterns and trends.

2. Qualitative Analysis

Qualitative or Non-Statistical Analysis gives generic information and uses text, sound and new forms of media to do so.

For illustration if I want to purchase a coffee from Starbucks, it is available in Short, Tall and Grande. This is an example of Qualitative Analysis. But if a lay up sells 70 regular coffees a week, it is Quantitative Analysis because we have a number representing the coffees sold per week.  Although  the function of both these analyses is to provide results, Quantitative analysis provides a clearer picture hence making it pivotal in analytics. This brings us to the end of the blog on Statistics For Data Science. We hope that you were able to learn statistics for data science with the help of this statistics tutorial. 

Leave a Comment