“Data Scientist is a person who is better at statistics than any programmer and better at programming than any statistician.”
Introduction To Statistics
It is important to know the method beyond various Machine Learning algorithms in order to know how and when to use them. Now the question arises, what exactly is Statistics? In this statistics tutorial, we will learn statistics for data science.
Statistics is a Mathematical Science concern to data collection, analysis, interpretation and presentation.
Statistics for Data Science is used for multiplex problems in the real-life so that Data Scientists and Analysts can look for relevant trends and changes in Data. In simple words, Statistics can be used to derive meaningful discernment from data by performing mathematical calculations on it. Several Statistical functions, principles, and algorithms are implemented to analyze raw data. Moving ahead in this blog on Statistics for Data Science, let’s learn more about the categories in statistics.
CATEGORIES IN STATISTICS
1. Inferential Statistics
Inferential Statistics makes conjecture and divination about a population based on a representation of data taken from the population in question.
Inferential statistics derives a large data set and applies probability to transpire at a conclusion. It recognizes you to infer parameters of the population based on sample stats and erect models on it.
In the Hypothesis test, there are two kinds of Hypothesis. H0 (Null Hypothesis: it is already an existing thing. H1 (Alternate Hypothesis): this is what we have to prove. In this Hypothesis testing we use "t-test","z-test","Anova","F-test","chi-square" to prove wheather Ho is correct (or) H1 is correct. data<-read.csv(file.choose(),header = T) # file.choose() used to point that file data ## Names Age ID Gender Height Region Profit Talkative ## 1 sujith 23 4567 male 5.11 A.P 41.9136 yes ## 2 kumar 24 3456 male 5.30 Andhra 219.5820 no ## 3 atanu 25 2153 male 5.50 Kolkata 6.8714 no ## 4 karum 27 2134 male 4.50 Pondicherry -383.0310 yes ## 5 ravi 26 3214 male 5.40 Hyderabad 2.5164 yes ## 6 pavan 25 2614 male 3.40 Vijayavada 14.1694 no ## 7 shyama 22 2874 female 5.00 Kerala 1.9656 yes ## 8 pream 24 1234 male 2.10 Tamil nadu 90.7152 yes attach(data) names(data) ##  "Names" "Age" "ID" "Gender" "Height" "Region" ##  "Profit" "Talkative" class(Profit) ##  "numeric"
2. Descriptive Statistic
Descriptive Statistics helps to assemble data and basis on the characteristics of data providing parameters.
Descriptive Statistics is an abbreviation of the data at hand through definite numbers like mean, median etc. so as to make the understanding of the information easier. It does not occupy any notion or inference beyond what is presented. This means that the descriptive statistics are just the representation of the data (sample) available and not based on any presumption of probability.
Commonly Used Measures
- Measures of Central Tendency
- Measures of Dispersion (or Variability)
- Measures of Central Tendency
A Measure of Central Tendency is a one number summing up of the data that typically describes the middle of the data. This one number summary is of three types.
Mean is defined as the ratio of the sum of all the notes in the data to the total number of observations. This is also identified as Average. Thus mean is a number around which the complete data set is spread.
Median is the point which divides the entire data into two equal halves. One-half of the data is less than the median, and the other half is larger than the same. Median is designed by first arranging the data in also ascending or descending order.
If the number of observations are odd, the median is given by the middle inspection in the short form.
If the number of observations are even, the median is specified by the mean of the two middle observations in the sorted form.
A vital point to note is that the order of the data (ascending or descending) does not affect the median.
Mode is the number which has the limit frequency in the entire data set, or in other words, mode is the number that appears the maximum number of times. A number can have one or more than one mode.
- If there is only one number that appears maximum number of times, the data has one mode, and is called Uni-modal.
- If there are two numbers that emerge maximum number of times, the data has two modes, and is called Bi-modal.
- If there are more than two numbers that emerge maximum number of times, the data has more than two modes, and is called Multi-modal.
Four giant terms in statistics are population, sample, parameter, and statistic:
- A population is the intact group of individuals you want to study, and a sample is a subset of that group.
- A parameter is a quantitative characteristic of the population that you’re fascinated in estimating or testing (such as a population mean or proportion).
- A statistic is a quantitative characteristic of a sample with the intention of often helping estimate or test the population parameter (such as a sample mean or proportion).
Descriptive statistics are a single outcome you get when you analyze a set of data — for example, the sample mean, median, standard deviation, correlation, regression line, margin of error, and test statistic.
Statistical inference refers toward using your data (and its descriptive statistics) to formulate conclusions about the population. Major types of presumption include regression, confidence intervals, and hypothesis tests.
An analysis of several event can be done in one of two ways:
1. Quantitative Analysis
Quantitative Analysis or Statistical Analysis is the science of collecting and interpreting data with numbers and graphs to spot patterns and trends.
2. Qualitative Analysis
Qualitative or Non-Statistical Analysis gives generic information and uses text, sound and new forms of media to do so.
For illustration if I want to purchase a coffee from Starbucks, it is available in Short, Tall and Grande. This is an example of Qualitative Analysis. But if a lay up sells 70 regular coffees a week, it is Quantitative Analysis because we have a number representing the coffees sold per week. Although the function of both these analyses is to provide results, Quantitative analysis provides a clearer picture hence making it pivotal in analytics. This brings us to the end of the blog on Statistics For Data Science. We hope that you were able to learn statistics for data science with the help of this statistics tutorial.