2021. 5. 2. 19:32ㆍData science/Database
Load CSV file into DB2 on cloud : Source-Target-Define-Finalize
Census Data - Selected socioeconomic indicators in Chicago, 2008 – 2012 | City of Chicago | Data Portal
data.cityofchicago.org
- Community Area Number (ca): Used to uniquely identify each row of the dataset
- Community Area Name (community_area_name): The name of the region in the city of Chicago
- Percent of Housing Crowded (percent_of_housing_crowded): Percent of occupied housing units with more than one person per room
- Percent Households Below Poverty (percent_households_below_poverty): Percent of households living below the federal poverty line
- Percent Aged 16+ Unemployed (percent_aged_16_unemployed): Percent of persons over the age of 16 years that are unemployed
- Percent Aged 25+ without High School Diploma (percent_aged_25_without_high_school_diploma): Percent of persons over the age of 25 years without a high school education
- Percent Aged Under 18 or Over 64:Percent of population under 18 or over 64 years of age (percent_aged_under_18_or_over_64): (ie. dependents)
- Per Capita Income (per_capita_income_): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population
- Hardship Index (hardship_index): Score that incorporates each of the six selected socioeconomic indicators
We will first read the dataset source .CSV from the internet into pandas dataframe.
Then we need to create a table in our Db2 database to store the dataset. The PERSIST command in SQL "magic" simplifies the process of table creation and writing the data from a pandas dataframe into the table
import pandas
chicago_socioeconomic_data = pandas.read_csv('https://data.cityofchicago.org/resource/jcxq-k9xf.csv')
%sql PERSIST chicago_socioeconomic_data
Create a scatter plot using the variables per_capita_income_ and hardship_index. Explain the correlation between the two variables.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
income_vs_hardship = %sql select per_capita_income_, hardship_index from chicago_socioeconomic_data;
plot = sns.jointplot(x= 'per_capita_income_', y='hardship_index', data=income_vs_hardship.DataFrame())
https://gist.github.com/3eb707f353307c7d173af087134f8f64
Created on Skills Network Labs
Created on Skills Network Labs. GitHub Gist: instantly share code, notes, and snippets.
gist.github.com
- You can access a database from a language like Python by using the appropriate API. Examples include ibm_db API for IBM DB2, psycopg2 for ProstgreSQL, and dblib API for SQL Server.
- DB-API is Python's standard API for accessing relational databases. It allows you to write a single program that works with multiple kinds of relational databases instead of writing a separate program for each one.
- The DB_API connect constructor creates a connection to the database and returns a Connection Object, which is then used by the various connection methods.
- The connection methods are:
- The cursor() method, which returns a new cursor object using the connection.
- The commit() method, which is used to commit any pending transaction to the database.
- The rollback() method, which causes the database to roll back to the start of any pending transaction.
- The close() method, which is used to close a database connection.
- You can use SQL Magic commands to execute queries more easily from Jupyter Notebooks. Magic commands have the general format %sql select * from tablename.
- Cell magics start with a double %% (percent) sign and apply to the entire cell. Line magics start with a single % (percent) sign and apply to a particular line in a cell.
An ibm_db.exec_immediate function that includes connection information and a SQL statement to create the table.
connect('database name', 'username', 'password')
질문 1
A database cursor is a control structure that enables traversal over the records in a database
The ibm_db API provides a variety of useful Python functions for accessing and manipulating data in an IBM data server like Db2
A Dataframe represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type. Indicate whether the following statement is True or False:
A pandas dataframe in Python can be used for storing the result set of a SQL query.