Srape with BeautifulSoup

Scrape IBM Page


  • GET request the IBM home page
  • Create a BS object using BS constructor
  • Scrape all the links on the page
  • Scrape all the images on the page

GET Request/Soup

pip install bs4
pip install html5lib
from bs4 import BeautifulSoup
import requests

url = "http://www.ibm.com"

# sent get request to retrieve data in text format
data = requests.get(url).text

# Use BS constructor to create a BS object
soup = BeautifulSoup(data,"html5lib")

Download

for link in links:
        print(link.get('href'))
https://www.ibm.com/granite?lnk=dev
https://developer.ibm.com/technologies/artificial-intelligence?lnk=dev
https://www.ibm.com/products/watsonx-code-assistant?lnk=dev
https://www.ibm.com/watsonx/developer/?lnk=dev
https://www.ibm.com/thought-leadership/institute-business-value/report/ceo-generative-ai?lnk=bus
https://www.ibm.com/think/videos/ai-academy
https://www.ibm.com/products/watsonx-orchestrate/ai-agent-for-hr?lnk=bus
https://www.ibm.com/products/guardium-data-security-center?lnk=bus
https://www.ibm.com/artificial-intelligence?lnk=ProdC
https://www.ibm.com/hybrid-cloud?lnk=ProdC
https://www.ibm.com/consulting?lnk=ProdC

Scrape Images

  • As we can see from the results, no images are present
soup.find_all('img')
[]
# This will print out the entire object
soup
for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

# It appears that none are present which doesn't make sense

Scrape IBM Wikipedia


In this example we’ll scrape the IBM wikipedia page and extract all the links from it

# Specify the URL of the webpage you want to scrape
url = 'https://en.wikipedia.org/wiki/IBM'

# Send an HTTP GET request to the webpage
response = requests.get(url)

# Store the HTML content in a variable
html_content = response.text

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Display a snippet of the HTML content
print(html_content[:500])
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-