Machine Learning and Stats 2 - Univariate Exploratory Data Analysis

2023. 1. 17. 09:52Data science/Machine Learning

반응형

Exploratory Data Analysis(EDA) is always a starting point for Data analysis, and it is about getting an intuitive understanding of the data. First, you need to decide what to test, and then you can use statistics to test out what the data delivers. 

 

Data Quality: Where does the data come from, and how accurate are they? 

The below table summarises the star rating of data quality suggested by David Spiegelhalter(Cambridge Professor)

4*  Numbers we can believe: Well-controlled laboratory experiments
3* Numbers are reasonably accurate: Well-conducted surveys, sample or field measurement.
2* Numbers could be out by quite a long way: Poorly conducted surveys, noisy measurement
1* Numbers are unreliable: Highly biased, unrepresentative surveys or sample 
0* Numbers have been made up: Urban legends, memes, and fabricated experimental data.

Univerative Data Vectors

Univariate case- one measurement per thing; in other words, it means there is one dependent variable(Y). 

Dependent variable  = outcome variable = response variable 

cf. if there are more than two of Y, it will be multivariate. (This will be covered later)

 

Data is one significant long column vector with real-valued entries. Mathematically, the univariate dataset represents the length-n vector, x = (x₁, x₂, . . . , xn).

 

The sample mean of f(x) is

<f(x)> = 1/n ∑f(xi)             -- sum from i = 1 to n , < > : angle braces - average over the data of the thing enclosed
            =1/n[f(x₁) + f(x₂) +...+f(xn)

Visualisation and Information: Lossless vs Lossy

Lossless: If viewed at sufficiently high resolution, it could recover the original dataset

Lossy: Plot would be consistent with many different raw datasets

-> Lossy visualisation that loses the right information is key to successfully visualising complex data.

 

A few univariate visualisations will be handled here. 

1. Rug plot: tiny vertical tick on the x-axis for each data point(lossless- where the parts are merged into a continuous bar)

2. Histogram(lossy): The sample filling to the range, but the exact value of each is unknown. 

3. Kernel density(lossy): another way to estimate the underlying probability distribution of where it happens

Figure 1. The bottom part of the Rug plot is merged as one bar across the bottom

4. Jitter plot(lossless): Alternative to the rug plot, It is useful when there are a vast amount of data or a restricted range of values

Figure 2. Between 0 to 1.5 shows high density

Summary Statistics: Measure of Central Tendency

  1. Mean (x) = <x> = 1/n ∑ x i (sum i=1 to n) 
  2. Median: Middle when the data are sorted by value. A special case of an order statistic. It is used when the distribution is skewed dramatically. (No impact by Outlier)
  3. Mode: More common than those around or local maximum in the density. For discrete data, this can be uniquely determined as the most common value(number of frequency, number of events etc.). For Continuous data, it needs to be estimated(estimating distributions) 

 

Visualising Measures of Central Tendency

Right-skewed, the mode is the smallest, and the mean is the largest. (For the normal distribution, all would be equal)

Figure 3. There is one mode, and KDE Mode is the smallest among summary static values.

 4. Variance

Var(X) = E[(X-E[X])²]           //average of (X-E[X])²
	   = E[X²-2XE[X] +E[X]²]
       = E[X²]-2E[X]E[X]+E[X]²
       = E[X²]-E[X]²

5. Unbiased Variance and Computation

 Unbiased estimator of Variance and Variance is calculated differently, but it is a often, data that is so large number that the distinction is not essential. Still, Python calculates biased variance by default, and R does unbiased variance. 

Unbiased Estimation Biased Estimation
Through the survey, calculate the mean of the population. 
: It was proved that calculating the mean of the mean value of sample data matches the mean of the population. 

-> The actual mean value and the expected value of the estimator match!
-> You can calculate the statistical values of the population correctly.
Through the survey, calculate the variance. The population of variance always differ from the sampling data's variance(n-1 smaller) 
: It was proved that the variance from the sample data groups not matching with the variance of the population. 

-> Biased estimation returns different values from the population. 

Figure 4. Biased Variance V.S. Unbiased Variance

//Python
np.var(x) //biased 
np.var(x, ddof=1) // unbiased
//R
var(x) //Unbiased

library(moments)
moment(x, order=2, central =TRUE) // Biased

Natural units 

: generally the result does not depend on the units, dimensionless

: For more general data, two commonly-used quantities that have the same units as the data (mean: 𝛍=Mean(x) & standard deviation: 𝝈= √Var(x))

-> These two qunatities provid the definition of two transformations. Centring and Standradisation. 

  • Centring: By subtrating off the mean with the centred data given by -> yi= xi-𝛍. -> Mean(y)=0 ; It shifts the scale over, but retains the units. (So the mean redefined as ZERO, but the unit is same)
  • Standrardisation: y divide them by the strandard deviation -> zi = yi/𝝈 -> Var(z)=1 ; adjusts the scales of magnitude (So the Variance redefined as ONE, but the unit is same

Higher Moments

 

Moment in statistics can be used to get the statistic values. For different moment with different r, each value can be exported like below. 

E(X)  Var(X) 𝛄₁ (Skewness) 𝛄₂ (Kurtosis)
moment , r=1 central moment , r=2 standardised moment, r=3 standardised moment, r=4

r-th moment of data in general : mr = < xr>  -- expected value over the data of X to the r 

Figure 5. r-th moment; Mean calculation

r-th central moment of the data :  𝛍r = <(x-𝛍)r> = < yr> -- expected value of y to the r, so subtracting off the mean 

Figure 6. Central Moment - Variance calculation

r-th standardised moment of the data

Figure 7. Standardised moment - Skewness and Kurtosis calculation

Skewness

: It is about the symmetric and the tail length[Right-skewness: tail longer on left, Left-skewness: tail longer on the right side]

A larger(more positive) value: Right-skewness, meaning that data's variability where values of x are larger than the mean.

A smaller(more negative) value: Left-skewness, meaning that data's variability where values of x are smaller than the mean.

A value close to ZERO: means that the variability of the data is similar on either side of them (does not mean symmetric distribution unless it is normally distributed. The skewness can be close to zero, though.)

Figure 8. Skewness, from Wikipedia

//Python
ss=np.sqrt(np.var(x))
sp.stats.moment(x,3)/(ss**3)
//R
ss=sqrt(moment(x, order=2, central=TRUE))
moment(x, order=3, central=TRUE)/(ss^3)

Kurtosis:

It is about the height and the tail thickness [Lower: heavy-tailed or High: light-tailed ]

Larger than three: more of the variance of the data from tails than if it would be normally distributed

Less than three: less of the variance of the data from tails than if it would be normally distributed

Close to three: consistent with the normal distribution, but it is not strong evidence.

※ The difference between kurtosis and three is called excess kurtosis. (compares the kurtosis of a distribution against the kurtosis of a normal distribution) : Excess Kurtosis = Kurtosis - 3

Figure 9. Kurtosis

//python
sp.stats.moment(x,4)/(ss**4)-3
//r
(moment(x, order=4, central=TRUE)/(ss^4))-3

Indicator function

: If a certain value belongs to group A, then 1(TRUE), otherwise 0(FALSE).

If A=y≤x or A = 2≤x≤4, 

Figure 10. Indicator function

ECDF (Empirical cumulative distribution function) 

: is an average over the data, an expectation

: Used for comparing various distributions, assessing the data distribution fit, estimate the percentile of the population using sample data

: lossless visualization of the data 

: E(t) means that the expectation of how much data is smaller or equal to t in between 1 to n.

What fraction of the data is less than t?
https://www.mathworks.com/help/stats/cdfplot_ko_KR.html

Quantiles and Order Statistics

: z-th percentile(Pz) is the value of x for which z% of the data is smaller or equal to x (median(x)=P₅₀)

: inter-quartile range(IQR)- IQR(x) = P₇₅-P₂₅

Figure 11. ECDF + Quantile

Survival Function

Sometimes, it isn't easy to visualise the interesting part. (For the above graph) - the largest values of x 

In this case, the Survival function can be used: 1 - E(x), plotted in a logarithmic y-axis graph.

Here, conversely, we can measure what fraction of data is bigger than t

Multimodality: When data has more than one map local maximum

Mode: most frequent value in a dataset

For continuous data, we need to estimate it as there is no identical mode value.

: Define as local maxima(peaks) of the probability density function

Mode is the most relevant measure of central tendency and variability for multimodal data

The bimodal graph has two peaks, which means the mode location is more than one.

Bimodal, From Wikipedia

 

 

반응형