2023. 1. 17. 09:52ㆍData science/Machine Learning
Exploratory Data Analysis(EDA) is always a starting point for Data analysis, and it is about getting an intuitive understanding of the data. First, you need to decide what to test, and then you can use statistics to test out what the data delivers.
Data Quality: Where does the data come from, and how accurate are they?
The below table summarises the star rating of data quality suggested by David Spiegelhalter(Cambridge Professor)
4* | Numbers we can believe: Well-controlled laboratory experiments |
3* | Numbers are reasonably accurate: Well-conducted surveys, sample or field measurement. |
2* | Numbers could be out by quite a long way: Poorly conducted surveys, noisy measurement |
1* | Numbers are unreliable: Highly biased, unrepresentative surveys or sample |
0* | Numbers have been made up: Urban legends, memes, and fabricated experimental data. |
Univerative Data Vectors
Univariate case- one measurement per thing; in other words, it means there is one dependent variable(Y).
Dependent variable = outcome variable = response variable
cf. if there are more than two of Y, it will be multivariate. (This will be covered later)
Data is one significant long column vector with real-valued entries. Mathematically, the univariate dataset represents the length-n vector, x = (x₁, x₂, . . . , xn).
The sample mean of f(x) is
<f(x)> = 1/n ∑f(xi) -- sum from i = 1 to n , < > : angle braces - average over the data of the thing enclosed
=1/n[f(x₁) + f(x₂) +...+f(xn)
Visualisation and Information: Lossless vs Lossy
Lossless: If viewed at sufficiently high resolution, it could recover the original dataset
Lossy: Plot would be consistent with many different raw datasets
-> Lossy visualisation that loses the right information is key to successfully visualising complex data.
A few univariate visualisations will be handled here.
1. Rug plot: tiny vertical tick on the x-axis for each data point(lossless- where the parts are merged into a continuous bar)
2. Histogram(lossy): The sample filling to the range, but the exact value of each is unknown.
3. Kernel density(lossy): another way to estimate the underlying probability distribution of where it happens
4. Jitter plot(lossless): Alternative to the rug plot, It is useful when there are a vast amount of data or a restricted range of values
Summary Statistics: Measure of Central Tendency
- Mean (x) = <x> = 1/n ∑ x i (sum i=1 to n)
- Median: Middle when the data are sorted by value. A special case of an order statistic. It is used when the distribution is skewed dramatically. (No impact by Outlier)
- Mode: More common than those around or local maximum in the density. For discrete data, this can be uniquely determined as the most common value(number of frequency, number of events etc.). For Continuous data, it needs to be estimated(estimating distributions)
Visualising Measures of Central Tendency
Right-skewed, the mode is the smallest, and the mean is the largest. (For the normal distribution, all would be equal)
4. Variance
Var(X) = E[(X-E[X])²] //average of (X-E[X])²
= E[X²-2XE[X] +E[X]²]
= E[X²]-2E[X]E[X]+E[X]²
= E[X²]-E[X]²
5. Unbiased Variance and Computation
Unbiased estimator of Variance and Variance is calculated differently, but it is a often, data that is so large number that the distinction is not essential. Still, Python calculates biased variance by default, and R does unbiased variance.
Unbiased Estimation | Biased Estimation |
Through the survey, calculate the mean of the population. : It was proved that calculating the mean of the mean value of sample data matches the mean of the population. -> The actual mean value and the expected value of the estimator match! -> You can calculate the statistical values of the population correctly. |
Through the survey, calculate the variance. The population of variance always differ from the sampling data's variance(n-1 smaller) : It was proved that the variance from the sample data groups not matching with the variance of the population. -> Biased estimation returns different values from the population. |
//Python
np.var(x) //biased
np.var(x, ddof=1) // unbiased
//R
var(x) //Unbiased
library(moments)
moment(x, order=2, central =TRUE) // Biased
Natural units
: generally the result does not depend on the units, dimensionless
: For more general data, two commonly-used quantities that have the same units as the data (mean: 𝛍=Mean(x) & standard deviation: 𝝈= √Var(x))
-> These two qunatities provid the definition of two transformations. Centring and Standradisation.
- Centring: By subtrating off the mean with the centred data given by -> yi= xi-𝛍. -> Mean(y)=0 ; It shifts the scale over, but retains the units. (So the mean redefined as ZERO, but the unit is same)
- Standrardisation: y divide them by the strandard deviation -> zi = yi/𝝈 -> Var(z)=1 ; adjusts the scales of magnitude (So the Variance redefined as ONE, but the unit is same)
Higher Moments
Moment in statistics can be used to get the statistic values. For different moment with different r, each value can be exported like below.
E(X) | Var(X) | 𝛄₁ (Skewness) | 𝛄₂ (Kurtosis) |
moment , r=1 | central moment , r=2 | standardised moment, r=3 | standardised moment, r=4 |
r-th moment of data in general : mr = < xr> -- expected value over the data of X to the r
r-th central moment of the data : 𝛍r = <(x-𝛍)r> = < yr> -- expected value of y to the r, so subtracting off the mean
r-th standardised moment of the data
Skewness
: It is about the symmetric and the tail length[Right-skewness: tail longer on left, Left-skewness: tail longer on the right side]
A larger(more positive) value: Right-skewness, meaning that data's variability where values of x are larger than the mean.
A smaller(more negative) value: Left-skewness, meaning that data's variability where values of x are smaller than the mean.
A value close to ZERO: means that the variability of the data is similar on either side of them (does not mean symmetric distribution unless it is normally distributed. The skewness can be close to zero, though.)
//Python
ss=np.sqrt(np.var(x))
sp.stats.moment(x,3)/(ss**3)
//R
ss=sqrt(moment(x, order=2, central=TRUE))
moment(x, order=3, central=TRUE)/(ss^3)
Kurtosis:
It is about the height and the tail thickness [Lower: heavy-tailed or High: light-tailed ]
Larger than three: more of the variance of the data from tails than if it would be normally distributed
Less than three: less of the variance of the data from tails than if it would be normally distributed
Close to three: consistent with the normal distribution, but it is not strong evidence.
※ The difference between kurtosis and three is called excess kurtosis. (compares the kurtosis of a distribution against the kurtosis of a normal distribution) : Excess Kurtosis = Kurtosis - 3
//python
sp.stats.moment(x,4)/(ss**4)-3
//r
(moment(x, order=4, central=TRUE)/(ss^4))-3
Indicator function
: If a certain value belongs to group A, then 1(TRUE), otherwise 0(FALSE).
If A=y≤x or A = 2≤x≤4,
ECDF (Empirical cumulative distribution function)
: is an average over the data, an expectation.
: Used for comparing various distributions, assessing the data distribution fit, estimate the percentile of the population using sample data
: lossless visualization of the data
: E(t) means that the expectation of how much data is smaller or equal to t in between 1 to n.
Quantiles and Order Statistics
: z-th percentile(Pz) is the value of x for which z% of the data is smaller or equal to x (median(x)=P₅₀)
: inter-quartile range(IQR)- IQR(x) = P₇₅-P₂₅
Survival Function
Sometimes, it isn't easy to visualise the interesting part. (For the above graph) - the largest values of x
In this case, the Survival function can be used: 1 - E(x), plotted in a logarithmic y-axis graph.
Here, conversely, we can measure what fraction of data is bigger than t
Multimodality: When data has more than one map local maximum
Mode: most frequent value in a dataset
For continuous data, we need to estimate it as there is no identical mode value.
: Define as local maxima(peaks) of the probability density function
Mode is the most relevant measure of central tendency and variability for multimodal data
The bimodal graph has two peaks, which means the mode location is more than one.
'Data science > Machine Learning' 카테고리의 다른 글
Markov chain Monte Carlo (0) | 2023.05.20 |
---|---|
Unsupervised or Supervised Classification (0) | 2023.05.19 |
PCA (Principal Components Analysis) (0) | 2023.05.19 |
[IBM] What is Data Science? - Deep Learning & Machine Learning (0) | 2021.05.15 |