Srape with BeautifulSoup

Scrape IBM Page

GET request the IBM home page

Create a BS object using BS constructor

Scrape all the links on the page

Scrape all the images on the page

GET Request/Soup

pip install bs4
pip install html5lib

from bs4 import BeautifulSoup
import requests

url = "http://www.ibm.com"

# sent get request to retrieve data in text format
data = requests.get(url).text

# Use BS constructor to create a BS object
soup = BeautifulSoup(data,"html5lib")

Scrape Links

If we use href=True without ‘a’ it will pull unwanted data
So include ‘a’ in the argument list

links = soup.find_all('a',href=True)
links

Download

for link in links:
        print(link.get('href'))

Scrape Images

As we can see from the results, no images are present

soup.find_all('img')

# This will print out the entire object
soup

for link in soup.find_all('img'):# in html image is represented by the tag <img>
    print(link)
    print(link.get('src'))

# It appears that none are present which doesn't make sense

Scrape IBM Wikipedia

In this example we’ll scrape the IBM wikipedia page and extract all the links from it

# Specify the URL of the webpage you want to scrape
url = 'https://en.wikipedia.org/wiki/IBM'

# Send an HTTP GET request to the webpage
response = requests.get(url)

# Store the HTML content in a variable
html_content = response.text

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Display a snippet of the HTML content
print(html_content[:500])

Extract Links

If you remember from above, links are marked with ‘a’ tag
To extract all the links from the IBM page we just have BS find_all(‘a’)

# Find all <a> tags (anchor tags) in the HTML-result is a list
links = soup.find_all('a')

# Iterate through the list of links and print their text - too long to execute
for link in links:
    print(link.text)