Welcome to the new year. I promise, this year we all will learn together and become a better data analyst, technical solution masters. I am starting the new year again with basic to make you clear that these blogs are really for serious learners who want to go beyond traditional either technical or either business boundaries. In other to be good data analyst or even technologist, you must know swimming in data.
In this blog we will learn the basic of statistics with business use case point of view.
Histogram
You are in business because you want to make money. You want to make money because you can sell product or services to customers/prospects. The prospects will buy your product because they see value or there is a need of product or services. And this goes on. And everything boiled down to maslow's need hierarchy. So, if you have n variant of product, what products will be targeted to whom? All the it is a multi dimensional problem, it all boils down to create the segments of customer and then position the right product at right price at right time to right prospects.
Understanding the right prospect for right product is the biggest challenge marketeer keeps on juggling. This is called segmentation of customers. The segmentation depends on demography, psychographic, culture, socio economic status, age group, income group, risk group and several others attributes of customers. If you are new to marketing and have the access to data, the least you can do is, take the transaction data (sales) of at least a year and find out the spending pattern of customer over the quarters, months, week. The spending patterns can be grouped into a histogram where we create a range of spending pattern of customer vs. amount, product vs amount. What you will see is, 20% of customer will contribute to 80% of your revenue. Similarly, 20% of your product line will contribute to 80% of your revenue.
What it means? You will be able to segment the customer. With a map of product and customer, you will be able to create a campaign for either up sale or cross sale. How?
By adding an incremental feature to product at a differential price, you will be able to move the customer to next best product.
Examples, Have you gone to buy a car. SX, EX, VX, Audi A3, A4, A5, A6??? What is that?? Customized car? Prepaid topup vouchers, Range of Merchandise with incremental properties?
The first task of segmentation is to create that histogram, a very basic of Statistics. What happens, if you have Xetabyte of data? Big Data Analytics?
Mean and Median
I think, everybody understand 'Mean' that is, if there are N data points in a sample set, then Sum of all data points divided by N. Hence, mean is the average of all the data in a sample set.
No suppose, you have 9 single digit number and number 99 in a sample set of 10. The mean will be skewed toward a two digit number greater than 10. Statistically, we can not make any sense out of mean in such cases because one outlier 99 has made it skewed and inference is not right. Think of real-estate price. One property is situated at a strategic location and if its price is more than double of the nearest neighbour price, then it will skew the whole pricing of real estate market.
Hence, the concept of 'Median' arrive. To derive Median, you need to sort the sample set and pick the middle data in case of Odd number of elements in sample. In case of even number of sample, the average of middle two will be the Median. What happened here, the outliers does not have a significant impact on inference? Right?
So, what if you are working out on a large data set and confused whether to take mean or median as a reference point.
Play smart, derive both mean and median and check the distance between these two. If distance is higher than a limit, (depends on your problem statement), take median.
Standard Deviation
Standard deviation is the measure of dispersion of a set of data from its mean. It measures the absolute variability of a distribution; the higher the dispersion or variability, the greater is the standard deviation and greater will be the magnitude of the deviation of the value from their mean.
The concept of Standard Deviation was introduced by Karl Pearson in 1893. It is by far the most important and widely used measure of dispersion. Its significance lies in the fact that it is free from those defects which afflicted earlier methods and satisfies most of the properties of a good measure of dispersion. Standard Deviation is also known as root-mean square deviation as it is the square root of means of the squared deviations from the arithmetic mean.
It is used in every domain invariably. The interpretation is different in different situation.
In stock market, Higher the standard deviation in the price of a stock, higher the instability or volatility it predicts in a given sample prices across the days.
In Telecom, it us used to segment the customer. Average Revenue Per User (ARPU) is on of the parameter considered for segmenting the customer into Silver, Gold, Platinum or diamond segments. The standard deviation from a mean of ARPU is taken into one group, based on the micro histograms we discussed previously to identify the segments.
Regression
In statistical modelling, regression analysis is a set of statistical processes for estimating the relationships among variables. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In other words, It is a technique for determining the statistical relationship between two or more variables where a change in a dependent variable is associated with, and depends on, a change in one or more independent variables.
In business world, Regression analysis is statistical approach to forecasting change in a dependent variable (sales revenue, for example) on the basis of change in one or more independent variables (population and income, for example). Known also as curve fitting or line fitting because a regression analysis equation can be used in fitting a curve or line to data points, in a manner such that the differences in the distances of data points from the curve or line are minimized. Relationships depicted in a regression analysis are, however, associative only, and any cause-effect (causal) inference is purely subjective. Also called regression method or regression technique.
Makes sure to look at your regression analysis can help you figure out where your business needs to start focusing its attention
Let us take a simple example of a class of students that undertakes a true/false test of 100 items on a subject. Let us assume that all students randomly choose all questions. In this case, the score of each of the student would be a realization of one of the set of independent random variables that are identically distributed where expected mean is 50. It is obvious that some of the students will score more than 50 and some less than 50. If we take only 10% of the students to score at top and give them a second test where again they have to choose randomly on all the items, the mean score is still expected to be near 50. So, the mean here of the student would regress back all the way to the students who originally took the test. Irrespective of what a student has scored in original test, the best prediction on the scores of his second test is of 50.
Correlation
Correlations are very useful as they are able to indicate a predictable relationship which can be practically exploited. Likewise, on a mild day an electrical utility may be producing less power on the basis of the correlation between the demand of electricity and the weather. We can see a casual kind of relationship in this example as the extreme weather conditions makes the people use more power or electricity for heating or cooling purposes.
Degree and type of relationship between any two or more quantities (variables) in which they vary together over a period; for example, variation in the level of expenditure or savings with variation in the level of income. A positive correlation exists where the high values of one variable are associated with the high values of the other variable(s). A 'negative correlation' means association of high values of one with the low values of the other(s). Correlation can vary from +1 to -1. Values close to +1 indicate a high-degree of positive correlation, and values close to -1 indicate a high degree of negative correlation.
Values close to zero indicate poor correlation of either kind, and 0 indicates no correlation at all.
While correlation is useful in discovering possible connections between variables, it does not prove or disprove any cause-and-effect (causal) relationships between them.
In financial world, a study suggested that a strong correlation existed between a high willingness to undertake risks and strong performance in the stock market.
Correlation Examplified in detail
Let me explain it through an example.
A local shop of cold drinks keeps a track of the amount of Cold drink they sell in accordance to the temperature on that day. Below are the figures of their sale and temperature for the last 12 days. You are required to comment on the relationship between cold drink Sales Vs day's Temperature
Temperature (^{o}C) |
Cold Drink Sales |
14.2 |
$215 |
16.4 |
$325 |
11.9 |
$185 |
15.2 |
$332 |
18.5 |
$406 |
22.1 |
$522 |
19.4 |
$412 |
25.1 |
$614 |
23.4 |
$544 |
18.1 |
$421 |
22.6 |
$445 |
17.2 |
$408 |
Solution:
We will use Pearson’s Correlation here to first find the value of correlation.
Let us assume two variables ‘x’ and ‘y’. Here we are taking temperature as ‘x’ and sale of cold drink as ‘y’.
We follow the following steps:
1) We find the mean of ‘x’ and also the mean of ‘y’.
2) Then we subtract the mean of ‘x’ from every value of ‘x’ and call them as ‘a’.
3) Do the same for ‘y’ and call them ‘b’.
4) Now find out a×b , a^{2} and b^{2} for all the values.
5) Now sum up the values find in step 4 that is a×b, a^{2} and b^{2}.
6) Finally divide the sum of a×b by the result of the square root of
[(sum of a^{2}) × (sum of b^{2})]
Below is the table of values:
Correlation and Regression Table
We can write it in formula as follows:
SO, you comment on relationship will be like the following:
Since the correlation is positive and near to 1 so we can say that the relationship between the sale of cold drink sales and the temperature of the day is good but not perfect. This implies the warmer the weather, the more is the sale of the cold drink.
Unusual, Yeah! There is no perfect world.