My Data Analytics Journey | Describing data for statistical analysis

Sources of Data

As a data analyst, it is a luxury to have access to high quality, useful data. Very often, for such data to be available, a lot of behind-the-scenes actions are required. One of the ways to have high quality data is by purchasing it from companies such as Nielsen and Kantar, whose main business model is the collection of relevant data and insights of customers. Alternatively, companies might also have access to data collected for day-to-day business operations, such as the the number of visitors on their e-commerce website and their retail point-of-sales data. 


Descriptive Statistics

When confronted with a large data set, how can we quickly convey its key characteristics to someone who has never seen or worked on the data set before? One way to achieve this is through descriptive statistics,  which allows us to display and describe data. Displaying data can take many shapes and form depending on the type of data. 

Displaying Data



Describing data

When describing data, there are 3 main types of information that we are trying to convey:

1) Measures of central tendencies

In simple words, if I had to pick one data point to represent the entire dataset, what would the data point be?

Central tendencies can be measured by the
  • Mean: The Numeric "Average"
In this case, the mean is (700+705+50+820+250+400+450+50+50+100)/10 = 357.5


Even though the mean is a very useful and common measure of central tendency, it is important to note that it is very affected by outliers. For that reason, a data analyst might consider removing outliers from the dataset to prevent skewering the results. Imagine if we are trying to calculate the net worth of the "everyday American" and we do not exclude outliers such as Elon Musk, whose net worth is roughly 208 billion or $208,000,000,000. For reference, the median net worth of the American worker is $121,700, and this intuitively is a much more representative measure of the net worth of the "everyday American".

  • Median: The "Middle" Number

The median is the middle number after arranging the datapoint in ascending or descending order. For instance, in the data set above, the middle number is 5th and 6th row, with Sales values of 250 and 400 respectively. The median is thus (250+400)/2 = 325. 

Median is most useful as a measure of central tendency for ordinal data where there is strict ordering, such as median satisfaction level for customers. 
  • Mode: The "Most Common" Number

The Mode is the most common number, which in this case is the value "50", which occured on Days 3, 8 & 9. Mode is very useful as a measure of central tendency for nominal data without strict ordering, such as nationality. 

2) Measures of Spread
  • Range
Range is simply the maximum number - minimum number, which in this data set is 820 - 50 = 770. This gives us the overall "spread" of the dataset. 


  • Variance
The Variance tells us how much the data points deviate from the mean. It is calculated by taking the sum of the squared value of the difference between the x data points and the mean value, divided by number of data points -1. 

  • Standard Deviation
Since the variance is a squared value as seen from the formula, it is difficult to compare it to the mean of the data set or the data values. In order to standardise the measure of dispersion from the mean, we take the square root of variance to get the standard deviation. The standard deviation will take the same unit as the dataset. 


  • Coefficient of Variance
The coefficient of variance allows us to compare the degree of variation of one data set to another regardless of their mean values. This is because it is calculated by taking the standard deviation/mean value. In order words, it gives us the ratio of the standard deviation to the mean, and the magnitude of the ratio tells us how dispersed a data set is from the mean. 


An application of this ratio is in the calculation of the risk to reward ratio for stocks, where the standard deviation is the risk or volatility of the stock and the mean is the mean average return of the stock. As such, we would want to pick a stock that historically has a very low coefficient of variance with low volatility and higher expected return. 


3) Measures of Shape

  • Skewness
Skewness tells us how distorted the dataset is from a normal distribution, or how asymmetrical a dataset is. We can visualise this easily through a histogram. There are 3 different types of skewness: 

1) No Skew: The distribution is perfectly symmetrical (Unlikely to happen unless a very very very large random sample was taken from the population [central limit theorem]). 

2) Positive Skew or Right-Skewed: The right tail is longer than the left. 

3) Negative Skew or Left-Skewed: The left tail is longer than the right. 

Photo taken from Wikipeda


In Excel, the formula to calculate this skewness of the dataset is through the formula =skew(data)

As a general rule of thumb,

skewness > 1 and skewness < -1 are considered highly skewed

0.5 < skewness < 1 & -0.5< skewness < -1 are considered moderately skewed

-0.5 < skewness < 0.5 are considered approximately normal or symmetric. 

  • Kurtosis

Kurtosis measures the heaviness of the tail when compared to a normal distribution. High kurtosis (> 0 or < 0) tells us that there are a lot of outliers and is a source for concern, while low kurtosis (= 0) tells us that there are not many outliers in the dataset. 

When kurtosis is <0, the tails are thinner and shorter.

When kurtosis is >0, the tails are longer and fatter. 



That's a lot of work needed to describe data! Now that we understand the significance of the terms that can be used, is there an easy way to calculate them? Of course, through the power of Excel!

Excel Step-By-Step

Step 1: Click on Excel Data Tab

Step 2: Click on Data Analysis

Step 3: Make sure "Labels in first row" and "Summary statistics" are checked before clicking on ok.



Results from Excel



Inferential Statistics

The other form of statistics is inferential statistics, which allows us to infer characteristics of the larger population based on the sample we have. This is important as it is often not economical nor feasible to collect data from every single unit in our target population.



Congrats on getting through another information heavy post! Now that we know how to describe the data, we can get our hands dirty and start cleaning data for our data analysis. For that and more, I will see you in the next blog post :)

















Comments