2021. 5. 11. 21:28ㆍData science/Python
*Data Analysis: Data acquisition in various ways and obtain necessary insights from a dataset
*Binary File Format: when the file is not readable. containing formatting information
: To read this file, it must be run on the appropriate software or processor first.
: images, jpegs, GIFs, MP3s, documents format like word or pdf etc.
*Reading the Image file
: Python has PILlibrary which provides the python interpreter with image editing capabilities.
#importing PIL
from PIL import Image
import urllib.request
#downloading dataset
urllib.request.urlretrieve("http://hips.hearstapps.com/...")
#result
('dog.jpg', <http.client.HTTPMessage at 0x7fb8548e0518>)
#read image
img = Image.open('dog.jpg')
#output images
display(img)
Exercise
import pandas as pd
#reading dataset and save it into df
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/diabetes.csv"
df = pd.read_csv(path)
#show the first 5 rows using dataframe.head() method
#dataframe.head(n), or datafram.tail(n) can be used
print("The first 5 rows of the dataFrame")
df.head(5)
#To view the dimensions of the dataframe -> .shape parameter can be used
df.shape
result -> (768, 9) : 768 rows and 9 columns
#To print information about dataFrame including index dtype and columns, non-null values and memory usage
df.info()
#To view some basic statistical details like percentile, mean , std etc. of a data frame or a series of numeric values
df.describe()
#To identify and handle missing values . isnull(), .notnull()
#Count missing values in each columns (True - missing , False- present/ value_counts() counts the number of "True" values)
-> There are no missing values in this dataset (no True)
#To check the data type : .dtype() - check data type / astype() - change the data type
*Visualization: Seaborn and Matplotlib are 2 of python's most powerful visualization libraries.
import matplotlib.pyplot as plt
import seaborn as sns
lables='Diabetic', 'Not Diabetic'
plt.pie(df['Outcome'].values_counts(), labels=labels.autopct='%0.02f%%')
plt.legend()
plt.show()