[IBM]Data Analysis with Python - Exploratory Data Analysis(EDA)

2021. 5. 15. 15:11Data science/Python

반응형

Exploratory Data Analysis(EDA)

- to summarise the miain character of the data

- uncover the relationships between different variables 

- extract important variables for the problem

 

- What are the characteristics that have the most impact ? 

Descriptive Statistics

- before building models, it's important to explore the data first 

- Calculate some Descriptive statistics for the data 

- help to describe basic features of data set and obtain short summary, measure of the data 

- numerical variables: using the describe function in pandas: basic statistics for all numerical variables 

 : df.describe(), df.describe(include=['object'])- this will include object types too

- if the method describe is applied to a dataframe with NaN values, NaN values will be excluded

- categorical variables: can be divided up into different categories or groups , using function value_counts 

: df['drive-wheels'].value_counts()

: only works on Pandas series, not Pandas Dataframes. -> we only include one bracket "df['drive-wheels']" not two brackets "df[['drive-wheels']]".

: convert the series to a Dataframe -> df['drive-wheels'].value_counts().to_frame()

: returns a Series containing the counts of unique values

- Box plot

: good to visualize the numeric data 

: good for the different group comparison, distribution of each group

- Scatter Plot 

: continuous variables in our data(numbers in some range)

: observation represented as a point 

: predictor/independent variables on x axis & target/dependant on y axis

: make sure to label the x and y 

: this scatter plot shows the positive , linear relationship between the variables 

GroupBy in Python

- use Pandas dataframe.Groupbu() method 

: on categorical variables 

: single or multiple variables 

df_test = df[['drive-wheels', 'body-style', 'price']]
df_grp = df_test.groupby(['drive-wheel','body-wheel'], as_index=False).mean()
df_grp 

- Pivot() in Pandas

: one variable displayed along the columns and the other variable displayed along the rows

df_pivot = df_grp.pivot(index='drive-wheels', columns='body-style')
#  drive-wheels displayed along the rows, body=style along the columns


grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot

- Heatmap Plot 

: Takes a rectangular grid of data and assigns a color intensity based on the data value at the grid point 

: good to plot the target variable over multiple variables and target(find out the relationship among them) 

: below heatmap seems to have hight prices than the bottom section 

: RdBu- red and blue 

Correlation

- a statistical metric for measuring to what extent different variables are interdependent

- doesn't imply causation.(we cannot say 2 variables are caused one another, when 2 variables have a certain relationship)

- Positive Linear Relationship(linear line = regression line) 

: we can use seaborn.regplot to create scatter plot 

sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0, )

- Negative Linear Relationship 

- Weak Corrlation

:  the variables can not be used for predicting the values

Correlation - Statistics

- Pearson Correlation

: continuous numerical variables 

: give you 2 values (correlation coefficient and p-value)

: Strong correlation - coefficient close to 1 or -1 and P value less than 0.001

Pearson Correlation from Wikipedia

#using scify stats PKG
pearson_coef, p_value = stats.pearsonr (df['horsepower'], df['price'])

pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

: result - Pearson correlation 0.81(close to one), P-value 9.35 e-48(very small) -> strong positive correlation 

 

-Correlation-Heatmap 

: all the values on this diagonal are highly correlated

Exercise 

import matplotlib.pyplot as plt
%matplotlib inline                 //present the graph on jupyterlab

#use the grouped results
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

 

Association between two categorical variables: Chi-Square

: between 2 categorical variables - use Chi-square test for association

: how likely it is that an observed distribution is due to chance

: null hypothesis(귀무가설) is that the variables are independent 

-> if the data doesn't fit within the expected one, the probability that the variables are dependent becomes stronger 

-> proving a null hypothesis is incorrect 

: do not tell you the type of the relationship, just whether the relationship exists or not. 

: using Pandas crosstab( a contingency table) - shows the counts in each category 

: the summation of the observed value (counts in each group - expected value all squared , divided by the expected value 

: to get the expected value - follow the above image 

: row -> the degree of freedom(자유도) = the number of samples - 1 

: columns -> find the closest chi-squared value 

- use the chi-square contingency function in the scipy.stats pkg 

: 29.6 (Chi-square test value) , 5.29...4e-08(p-value, very close to 0), 1 (degree of freedom)

: expected values returned in the array  

 

Exercise Use the "groupby" function to find the average "price" of each car based on "body-style" ?

df_gptest = df[['body-style','price']]
df_group_by_4 = df_gptest.groupby(['body-style'], as_index=False).mean()
df_group_by_4

ANOVA: Analysis of Variance

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption and reports it as the F-test score. A larger score means there is a larger difference between the means.

P-value: P-value tells how statistically significant is our calculated score value. 

If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])

grouped_test2.get_group('4wd')['price']

# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)  

DA0101EN-3-Review-Exploratory-Data-Analysis (1).ipynb
0.17MB
automobileEDA.csv
0.03MB

 

 

Exercise 

1. What is the largest possible element resulting in the following operation? (df.corr())

 -> 1, the correlation of a variable with itself is 1

 

2.10 columns,  100 samples: how large is the output of df.corr()?

-> as there are 10 columns that can be correlated to each other the output of df.corr() will be 10*10

 

Data Analysis with Python Cognitive Class Answers - Everything Trending

Enroll Here: Data Analysis with Python Module 1 – Introduction Question 1: What does CSV stand for ? Comma Separated Values Car Sold values Car State values None of the above Question 2: In the data set what represents an attribute or feature? Row Column

priyadogra.com

반응형