Descriptive Stats
Author | Pujitha Gangarapu |
For Dataset | HCUP |
All Blogs of HCUP | Descriptive Stats Introduction to Logistic Regression R, a powerful tool |
Descriptive Statistics
Descriptive statistics are the numbers that are used to summarize and analyze the data.
Measures of central tendency
These are the ways of describing the central position of a frequency distribution for a group of data. We can describe this central position using a number of statistics, including the mode, median, and mean.
Mean
The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.
So, if we have n values in a data set and they have values , the sample mean, usually denoted by
This formula is usually written in a slightly different manner using the Greek capital letter,
Lets try out this in R:
In the above example we chose a sample of 100 data points from 1000 and we allowed replace to be TRUE which means values are not unique. If we choose replace=FALSE, the values will not be repetitive.
Let us look into another example.
The Iris dataset famous dataset which is available in R by default. We can load this data set and find the mean.
In the above example head(data) gives a sample of data which is primarily to understand how the data is structured. From the result obtained, it can be seen that iris data have different categories. Since iris data is in the form of data frame, we can call any data point using data [rows,columns].
data[,2] indicates all the rows of data for column 2.
Try for other columns.
Median
The Median is a simple measure of central tendency. To find the median, we arrange the observations in order from smallest to largest value.
If there is an odd number of observations, the median is the middle value.
If there is an even number of observations, the median is the average of the two middle values.
The median is less affected by the outliers and skewed data
Lets try out in R:
In the above example, we took a sample of data where the values ranges from 0 to 1000 out of which 100 numbers are chosen. median( ) function in R gives directly the median of the object which we gave for input argument.
sort( ) function sorts the input argument in the ascending or descending order. By default, the data will be arranged in ascending order. (Click sort ( ) for more information).
lets try with Iris data set:
Mode
Mode is the most frequent score in our data set. Typically, mode is used for categorical data for which most common category needs to be determined.
Let us understand the concepts of mode with some examples.
There is no direct function to find the mode for a given data. It can be obtained by the above method.
table( ) gives you the frequency of each element in the argument which is passed.
which( ) gives you the index of the logical vector which is true.
names( ) gives you the names of the object.
as.integer( ) gives the output in the integer format.
First by using table( ) function we found the number of times each element repeated. It can be clearly observed from the output that 221 has been repeated 3 times.
Then we used which( ) function to select those elements whose frequency is maximum. Since which ( ) gives the index values, we used names( ) to get their names and then we converted to integers by using as.integers( )
However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency.
We often test whether our data is normally distributed as this is a common assumption underlying many statistical tests.
When we have a normally distributed sample we can legitimately use both the mean or the median as our measure of central tendency.
In any symmetrical distribution the mean, median and mode are equal.
Type of variable | Best measure of central tendency |
---|---|
Nominal | Mode |
Ordinal | Median |
Interval/Ratio (not skewed) | Mean |
Interval/Ratio (skewed) | Median |