Srape with BeautifulSoup

Scrape IBM Page


GET request the IBM home page

Create a BS object using BS constructor

Scrape all the links on the page

Scrape all the images on the page

GET Request/Soup

pip install bs4
pip install html5lib
from bs4 import BeautifulSoup
import requests

url = "http://www.ibm.com"

# sent get request to retrieve data in text format
data = requests.get(url).text

# Use BS constructor to create a BS object
soup = BeautifulSoup(data,"html5lib")

Download

for link in links:
        print(link.get('href'))

Scrape Images

  • As we can see from the results, no images are present
soup.find_all('img')
# This will print out the entire object
soup
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

# It appears that none are present which doesn't make sense

Scrape IBM Wikipedia


In this example we’ll scrape the IBM wikipedia page and extract all the links from it

# Specify the URL of the webpage you want to scrape
url = 'https://en.wikipedia.org/wiki/IBM'

# Send an HTTP GET request to the webpage
response = requests.get(url)

# Store the HTML content in a variable
html_content = response.text

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Display a snippet of the HTML content
print(html_content[:500])