pip install bs4 pip install html5lib
Srape with BeautifulSoup
Scrape IBM Page
- GET request the IBM home page
- Create a BS object using BS constructor
- Scrape all the links on the page
- Scrape all the images on the page
GET Request/Soup
from bs4 import BeautifulSoup
import requests
= "http://www.ibm.com"
url
# sent get request to retrieve data in text format
= requests.get(url).text
data
# Use BS constructor to create a BS object
= BeautifulSoup(data,"html5lib") soup
Scrape Links
- If we use href=True without ‘a’ it will pull unwanted data
- So include ‘a’ in the argument list
= soup.find_all('a',href=True)
links links
[<a cta-type="local" data-dynamic-properties='{"ctaUrl":"href"}' data-link-type="local" href="https://www.ibm.com/granite?lnk=dev" icon-placement="right" target="_self">
<span class="bx--link-text" data-dynamic-inner-content="ctaLabel" data-link-text="">Start building with IBM Granite 3.0 models</span>
<!-- LTR - Left to Right version -->
<svg aria-hidden="true" fill="currentColor" focusable="false" height="20" slot="icon" viewBox="0 0 20 20" width="20" xmlns="http://www.w3.org/2000/svg">
<path d="M11.8 2.8L10.8 3.8 16.2 9.3 1 9.3 1 10.7 16.2 10.7 10.8 16.2 11.8 17.2 19 10z"></path>
</svg>
<!-- RTL - Right to Left version -->
</a>, <a cta-type="local" data-dynamic-properties='{"ctaUrl":"href"}' data-link-type="local" href="https://developer.ibm.com/technologies/artificial-intelligence?lnk=dev" icon-placement="right" target="_self">
<span class="bx--link-text" data-dynamic-inner-content="ctaLabel" data-link-text="">Explore AI courses, APIs, data sets and more</span>
<!-- LTR - Left to Right version -->
<svg aria-hidden="true" fill="currentColor" focusable="false" height="20" slot="icon" viewBox="0 0 20 20" width="20" xmlns="http://www.w3.org/2000/svg">
<path d="M11.8 2.8L10.8 3.8 16.2 9.3 1 9.3 1 10.7 16.2 10.7 10.8 16.2 11.8 17.2 19 10z"></path>
</svg>
<!-- RTL - Right to Left version -->
</a>, <a cta-type="local" data-dynamic-properties='{"ctaUrl":"href"}' data-link-type="local" href="https://www.ibm.com/products/watsonx-code-assistant?lnk=dev" icon-placement="right" target="_self">
<span class="bx--link-text" data-dynamic-inner-content="ctaLabel" data-link-text="">Accelerate software development with watsonx Code Assistant</span>
<!-- LTR - Left to Right version -->
<svg aria-hidden="true" fill="currentColor" focusable="false" height="20" slot="icon" viewBox="0 0 20 20" width="20" xmlns="http://www.w3.org/2000/svg">
<path d="M11.8 2.8L10.8 3.8 16.2 9.3 1 9.3 1 10.7 16.2 10.7 10.8 16.2 11.8 17.2 19 10z"></path>
</svg>
<!-- RTL - Right to Left version -->
</a>, <a cta-type="local" data-dynamic-properties='{"ctaUrl":"href"}' data-link-type="local" href="https://www.ibm.com/watsonx/developer/?lnk=dev" icon-placement="right" target="_self">
<span class="bx--link-text" data-dynamic-inner-content="ctaLabel" data-link-text="">Check out the watsonx.ai Developer Toolkit</span>
<!-- LTR - Left to Right version -->
<svg aria-hidden="true" fill="currentColor" focusable="false" height="20" slot="icon" viewBox="0 0 20 20" width="20" xmlns="http://www.w3.org/2000/svg">
<path d="M11.8 2.8L10.8 3.8 16.2 9.3 1 9.3 1 10.7 16.2 10.7 10.8 16.2 11.8 17.2 19 10z"></path>
</svg>
<!-- RTL - Right to Left version -->
</a>, <a cta-type="local" data-dynamic-properties='{"ctaUrl":"href"}' data-link-type="local" href="https://www.ibm.com/thought-leadership/institute-business-value/report/ceo-generative-ai?lnk=bus" icon-placement="right" target="_self">
<span class="bx--link-text" data-dynamic-inner-content="ctaLabel" data-link-text="">Read the CEO's guide to generative AI</span>
<!-- LTR - Left to Right version -->
<svg aria-hidden="true" fill="currentColor" focusable="false" height="20" slot="icon" viewBox="0 0 20 20" width="20" xmlns="http://www.w3.org/2000/svg">
<path d="M11.8 2.8L10.8 3.8 16.2 9.3 1 9.3 1 10.7 16.2 10.7 10.8 16.2 11.8 17.2 19 10z"></path>
</svg>
<!-- RTL - Right to Left version -->
</a>, <a cta-type="local" data-dynamic-properties='{"ctaUrl":"href"}' data-link-type="local" href="https://www.ibm.com/think/videos/ai-academy" icon-placement="right" target="_self">
<span class="bx--link-text" data-dynamic-inner-content="ctaLabel" data-link-text="">Explore an AI curriculum designed for business leaders</span>
<!-- LTR - Left to Right version -->
<svg aria-hidden="true" fill="currentColor" focusable="false" height="20" slot="icon" viewBox="0 0 20 20" width="20" xmlns="http://www.w3.org/2000/svg">
<path d="M11.8 2.8L10.8 3.8 16.2 9.3 1 9.3 1 10.7 16.2 10.7 10.8 16.2 11.8 17.2 19 10z"></path>
</svg>
<!-- RTL - Right to Left version -->
</a>, <a cta-type="local" data-dynamic-properties='{"ctaUrl":"href"}' data-link-type="local" href="https://www.ibm.com/products/watsonx-orchestrate/ai-agent-for-hr?lnk=bus" icon-placement="right" target="_self">
<span class="bx--link-text" data-dynamic-inner-content="ctaLabel" data-link-text="">Deploy an AI agent for HR with watsonx Orchestrate</span>
<!-- LTR - Left to Right version -->
<svg aria-hidden="true" fill="currentColor" focusable="false" height="20" slot="icon" viewBox="0 0 20 20" width="20" xmlns="http://www.w3.org/2000/svg">
<path d="M11.8 2.8L10.8 3.8 16.2 9.3 1 9.3 1 10.7 16.2 10.7 10.8 16.2 11.8 17.2 19 10z"></path>
</svg>
<!-- RTL - Right to Left version -->
</a>, <a cta-type="local" data-dynamic-properties='{"ctaUrl":"href"}' data-link-type="local" href="https://www.ibm.com/products/guardium-data-security-center?lnk=bus" icon-placement="right" target="_self">
<span class="bx--link-text" data-dynamic-inner-content="ctaLabel" data-link-text="">Protect your data with IBM Guardium Data Security Center</span>
<!-- LTR - Left to Right version -->
<svg aria-hidden="true" fill="currentColor" focusable="false" height="20" slot="icon" viewBox="0 0 20 20" width="20" xmlns="http://www.w3.org/2000/svg">
<path d="M11.8 2.8L10.8 3.8 16.2 9.3 1 9.3 1 10.7 16.2 10.7 10.8 16.2 11.8 17.2 19 10z"></path>
</svg>
<!-- RTL - Right to Left version -->
</a>, <a href="https://www.ibm.com/artificial-intelligence?lnk=ProdC">next-generation AI</a>, <a href="https://www.ibm.com/hybrid-cloud?lnk=ProdC">hybrid cloud solutions</a>, <a href="https://www.ibm.com/consulting?lnk=ProdC">IBM Consulting</a>]
Download
for link in links:
print(link.get('href'))
https://www.ibm.com/granite?lnk=dev
https://developer.ibm.com/technologies/artificial-intelligence?lnk=dev
https://www.ibm.com/products/watsonx-code-assistant?lnk=dev
https://www.ibm.com/watsonx/developer/?lnk=dev
https://www.ibm.com/thought-leadership/institute-business-value/report/ceo-generative-ai?lnk=bus
https://www.ibm.com/think/videos/ai-academy
https://www.ibm.com/products/watsonx-orchestrate/ai-agent-for-hr?lnk=bus
https://www.ibm.com/products/guardium-data-security-center?lnk=bus
https://www.ibm.com/artificial-intelligence?lnk=ProdC
https://www.ibm.com/hybrid-cloud?lnk=ProdC
https://www.ibm.com/consulting?lnk=ProdC
Scrape Images
- As we can see from the results, no images are present
'img') soup.find_all(
[]
# This will print out the entire object
soup
for link in soup.find_all('img'):# in html image is represented by the tag <img>
print(link)
print(link.get('src'))
# It appears that none are present which doesn't make sense
Scrape IBM Wikipedia
In this example we’ll scrape the IBM wikipedia page and extract all the links from it
# Specify the URL of the webpage you want to scrape
= 'https://en.wikipedia.org/wiki/IBM'
url
# Send an HTTP GET request to the webpage
= requests.get(url)
response
# Store the HTML content in a variable
= response.text
html_content
# Create a BeautifulSoup object to parse the HTML
= BeautifulSoup(html_content, 'html.parser')
soup
# Display a snippet of the HTML content
print(html_content[:500])
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-
Extract Links
- If you remember from above, links are marked with ‘a’ tag
- To extract all the links from the IBM page we just have BS find_all(‘a’)
# Find all <a> tags (anchor tags) in the HTML-result is a list
= soup.find_all('a')
links
# Iterate through the list of links and print their text - too long to execute
for link in links:
print(link.text)