pip install bs4 pip install html5lib
Srape with BeautifulSoup
Scrape IBM Page
GET request the IBM home page
Create a BS object using BS constructor
Scrape all the links on the page
Scrape all the images on the page
GET Request/Soup
from bs4 import BeautifulSoup
import requests
= "http://www.ibm.com"
url
# sent get request to retrieve data in text format
= requests.get(url).text
data
# Use BS constructor to create a BS object
= BeautifulSoup(data,"html5lib") soup
Scrape Links
- If we use href=True without ‘a’ it will pull unwanted data
- So include ‘a’ in the argument list
= soup.find_all('a',href=True)
links links
Download
for link in links:
print(link.get('href'))
Scrape Images
- As we can see from the results, no images are present
'img') soup.find_all(
# This will print out the entire object
soup
for link in soup.find_all('img'):# in html image is represented by the tag <img>
print(link)
print(link.get('src'))
# It appears that none are present which doesn't make sense
Scrape IBM Wikipedia
In this example we’ll scrape the IBM wikipedia page and extract all the links from it
# Specify the URL of the webpage you want to scrape
= 'https://en.wikipedia.org/wiki/IBM'
url
# Send an HTTP GET request to the webpage
= requests.get(url)
response
# Store the HTML content in a variable
= response.text
html_content
# Create a BeautifulSoup object to parse the HTML
= BeautifulSoup(html_content, 'html.parser')
soup
# Display a snippet of the HTML content
print(html_content[:500])
Extract Links
- If you remember from above, links are marked with ‘a’ tag
- To extract all the links from the IBM page we just have BS find_all(‘a’)
# Find all <a> tags (anchor tags) in the HTML-result is a list
= soup.find_all('a')
links
# Iterate through the list of links and print their text - too long to execute
for link in links:
print(link.text)