[IBM]Databases and SQL for Data Science with Python - Analyzing data with Python

2021. 5. 2. 19:32Data science/Database

반응형

Load CSV file into DB2 on cloud : Source-Target-Define-Finalize 

 

data.cityofchicago.org/Health-Human-Services/Census-Data-Selected-socioeconomic-indicators-in-C/kn9c-c2s2?utm_email=Email&utm_source=Nurture&utm_content=000026UJ&utm_term=10006555&utm_campaign=PLACEHOLDER&utm_id=SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-DB0201EN-SkillsNetwork-20127838

 

Census Data - Selected socioeconomic indicators in Chicago, 2008 – 2012 | City of Chicago | Data Portal

 

data.cityofchicago.org

  • Community Area Number (ca): Used to uniquely identify each row of the dataset
  • Community Area Name (community_area_name): The name of the region in the city of Chicago
  • Percent of Housing Crowded (percent_of_housing_crowded): Percent of occupied housing units with more than one person per room
  • Percent Households Below Poverty (percent_households_below_poverty): Percent of households living below the federal poverty line
  • Percent Aged 16+ Unemployed (percent_aged_16_unemployed): Percent of persons over the age of 16 years that are unemployed
  • Percent Aged 25+ without High School Diploma (percent_aged_25_without_high_school_diploma): Percent of persons over the age of 25 years without a high school education
  • Percent Aged Under 18 or Over 64:Percent of population under 18 or over 64 years of age (percent_aged_under_18_or_over_64): (ie. dependents)
  • Per Capita Income (per_capita_income_): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population
  • Hardship Index (hardship_index): Score that incorporates each of the six selected socioeconomic indicators

 

We will first read the dataset source .CSV from the internet into pandas dataframe.

Then we need to create a table in our Db2 database to store the dataset. The PERSIST command in SQL "magic" simplifies the process of table creation and writing the data from a pandas dataframe into the table

import pandas
chicago_socioeconomic_data = pandas.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')
%sql PERSIST chicago_socioeconomic_data

 

Create a scatter plot using the variables per_capita_income_ and hardship_index. Explain the correlation between the two variables.

 

import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

income_vs_hardship = %sql select per_capita_income_, hardship_index from chicago_socioeconomic_data; 

plot = sns.jointplot(x= 'per_capita_income_', y='hardship_index', data=income_vs_hardship.DataFrame())

 

https://gist.github.com/3eb707f353307c7d173af087134f8f64

 

Created on Skills Network Labs

Created on Skills Network Labs. GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

  • You can access a database from a language like Python by using the appropriate API. Examples include ibm_db API for IBM DB2, psycopg2 for ProstgreSQL, and dblib API for SQL Server.
  • DB-API is Python's standard API for accessing relational databases. It allows you to write a single program that works with multiple kinds of relational databases instead of writing a separate program for each one.
  • The DB_API connect constructor creates a connection to the database and returns a Connection Object, which is then used by the various connection methods.
  • The connection methods are:
    • The cursor() method, which returns a new cursor object using the connection.
    • The commit() method, which is used to commit any pending transaction to the database.
    • The rollback() method, which causes the database to roll back to the start of any pending transaction.
    • The close() method, which is used to close a database connection.
  • You can use SQL Magic commands to execute queries more easily from Jupyter Notebooks. Magic commands have the general format %sql select * from tablename.
  • Cell magics start with a double %% (percent) sign and apply to the entire cell. Line magics start with a single % (percent) sign and apply to a particular line in a cell.

An ibm_db.exec_immediate function that includes connection information and a SQL statement to create the table.

connect('database name', 'username', 'password')

질문 1

A database cursor is a control structure that enables traversal over the records in a database

The ibm_db API provides a variety of useful Python functions for accessing and manipulating data in an IBM data server like Db2

A Dataframe represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type. Indicate whether the following statement is True or False:

A pandas dataframe in Python can be used for storing the result set of a SQL query.

 

 

반응형