[IBM]Python for Data Science, AI & Development

[IBM]Python for Data Science, AI & Development - HTML for Webscraping

2021. 5. 11. 17:39ㆍData science/Python

HTML for Web Scraping

*Web Pages have lots of variable information.

*HTML Tags : marked with blue ink inside of angle brackets

*HTML Composition : head, body, <p> tag : paragraph

*HTML Anchor Tag - a piece of text which marks the beginning and/or the end of a hypertext link.

: <a hef ="https://www.ibm.com"> IBM webpage </a>

*HTML Hyperlink tag - link to the web page

HTML Trees

: HTML tag is parents of below tags. head and body are in the same level (siblings) and the others are child tags of them.

HTML Tables

<table>
	<tr>
		<td> ,,,</td>
	</tr>
</table>

Webscraping

: automatically extract information from a website and can easily be accomplished within a matter of minutes.

: We need python code + 2 modules (requests and beautiful soaps)

1) import beautifulSoup

: is a python library for pulling data out of HTML and XML files.

: Store the webpage HTML as a string in the variable HTML. To parse a document, pass it into the BeautifulSoup constructor.

-> then we get the beautifulSoup object (soup) which represents the document as a nested data structure

2)find_all

table_row = table.find__all(name='tr')

:loos through a tag's descendants and retrieves all descendants that match your filters.

:The methods signature for find_all(name, attrs, recursive, string, limit, **kwargs)

#HOW TO APPLY BeautifulSoup to a webpage.

1. import modules (requests, BeautifulSoup)

2. using get method to download the webpage

3. beautifulSoup object creating

Exercise

!pip install bs4              #!pip install requests

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests               # this module helps us to download a web page

%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

#We can store it as a string in the variable HTML.
html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

#To parse a document, pass it into the BeautifulSoup constructor, the BeautifulSoup object, 
#which represents the document as a nested data structure:
soup = BeautifulSoup(html, 'html5lib')

#We can use the method prettify() to display the HTML in the nested structure:
print(soup.prettify())

1. convert to Unicode (similar to ASCII)

2. HTML entities are converted to Unicode

3. Beautiful Soup transforms a complex HTML document into a complex tree of Python Objects.

(Beautiful Soup object can create other types of objects)

#we can get the designated data we want using Tag
tag_object=soup.title
print("tag object:",tag_object)
print("tag object type:",type(tag_object))

#if there are more than 1 tag, first one will be called
tag_object=soup.h3
tag_object

#We can access to its parent or sibling 
parent_tag=tag_child.parent
parent_tag

sibling_1=tag_object.next_sibling
sibling_1

HTML Attributes

If the tag has attributes, the tag id="boldest" has an attribute id whose value is boldest. You can access a tag’s attributes by treating the tag like a dictionary:

[22]: tag_child['id']

[22]: 'boldest'

You can access that dictionary directly as attrs:

[23]: tag_child.attrs

[23]: {'id': 'boldest'}

You can also work with Multi-valued attribute check out [1] for more.

We can also obtain the content if the attribute of the tag using the Python get() method.

[24]: tag_child.get('id')

[24]: 'boldest'

https://gist.github.com/ca23b851f884f278a834b3d24309db2a

Created on Skills Network Labs

Created on Skills Network Labs. GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

저작자표시 비영리 변경금지 (새창열림)

'Data science > Python' 카테고리의 다른 글

[IBM]Python for Data Science, AI & Development - Data engineering (0)	2021.05.11
[IBM]Python for Data Science, AI & Development - Working with different file formats (0)	2021.05.11
[IBM]Python for Data Science, AI & Development - REST APIs & HTTP Reqeusts (0)	2021.05.11
[IBM]Python for Data Science, AI & Development - Watson Speech to Text Translator (0)	2021.05.11
[IBM]Python for Data Science, AI & Development - API Excercise (0)	2021.05.11

Piccole Gioie🪴

Piccole Gioie🪴

태그

최근글

댓글

공지사항

아카이브

HTML Attributes

'Data science > Python' 카테고리의 다른 글

관련글

티스토리툴바