Back to blog

Scraping the Web with Selenium and Python: A Step-By-Step Tutorial

Since the late 2000s, web scraping has become essential for extracting public data, giving a competitive edge to those who use it. A common challenge is scraping pages with delayed data loading due to dynamic content, which traditional tools often struggle with. Fortunately, Selenium Python web scraping can effectively handle this issue. In this blog post, you'll learn how to scrape dynamic web data with delayed JavaScript rendering using Python and the Selenium library, with a complete code example and a video tutorial available at the end.

Dominykas Niaura

Nov 09, 2023

10 min read

Selenium Python web scraping

Preparing Selenium Python

First things first, let’s prepare our Selenium Python web scraping approach by using the virtualenv package. It should come as a default library with Python 3.3 and above, or you can learn how to install it here.

  1. Download the full project from our GitHub
  2. Open the Terminal or command-line interface based on your operating system.
  3. Navigate to the directory where you downloaded the project to create the virtual environment. You can use the command cd path/to/directory to get there quickly.
  4. Input the virtualenv package commands. On macOS and Linux: source myenv/bin/activate. On Windows (in Command Prompt or PowerShell): .\myenv\Scripts\activate. ‏‏‎ ‎

Now, you’ll be working within the virtual environment, and any Python packages you install will be local to that environment. So, let’s talk about the packages we’ll need for this Selenium Python web scraping method:

  • Webdriver-manager is a utility tool that streamlines the process of setting up and managing different web drivers for browser automation.
  • Selenium is a powerful tool for controlling a web browser through code, facilitating automated testing and web scraping.
  • Bs4, also known as BeautifulSoup, is a parsing library that makes it easy to parse the scraped information from web pages, allowing for efficient HTML and XML data extraction. ‏‏‎ ‎

You can download the packages using these commands via your terminal:

pip install webdriver-manager
pip install selenium
pip install beautifulsoup4

Once we have the packets installed, the first thing to do is to import everything into the script file:

from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from extension import proxies
from bs4 import BeautifulSoup
import json

Setting up residential proxies

The next step involves integrating proxies into our code. Without proxies, target websites might detect your Selenium Python web scraping efforts and halt any program attempting data collection. To efficiently gather public data, your scraping project must seem like a regular internet user.

A residential proxy is an intermediary that provides the user with an IP address allocated by an Internet Service Provider (ISP). Maintaining a low profile when web scraping is essential, so residential proxies are the perfect choice. They provide a high level of anonymity and are unlikely to be blocked by websites.

We at Smartproxy offer industry-leading residential proxies with a vast 55M+ IP pool across 195+ locations, the fastest response time in the market (<0.5s), a 99.68% success rate, and an excellent entry-point via the Pay As You Go payment option.

Once you get yourself a proxy plan and set up your user, insert the proxy credentials into the code:

username = ’your_username’
password = ’your_password’
endpoint = ’proxy_endpoint’
port = ’proxy_port’

Replace your proxy username, password, endpoint, and port by replacing your_username, your_password, proxy_endpoint, and proxy_port, respectively.

WebDriver page properties

Time to truly unleash the power of Selenium Python web scraping. The first line creates a web driver with Chrome options that define how the browser should work. The first thing we add to the options is telling it to use proxies. We enable an extension (from the extension.py file) with our credentials to enable the proxy, ensuring your scraping activity remains anonymous and uninterrupted. Note that you don’t have to enter your proxy information here; it’s already been defined before.

Then, we’re adding one more Chrome option to activate headless mode instead of browser mode. The last line indicates that we’re spawning a web driver over the Chrome instance and providing Chrome options, saying we’d like to install the proxy extension.

chrome_options = webdriver.ChromeOptions()
proxies_extension = proxies(username, password, endpoint, port)
chrome_options.add_extension(proxies_extension)
chrome_options.add_argument("--headless=new")
chrome = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)

Targeting and delaying

When performing Selenium Python web scraping, precision is of the utmost importance. A good rule of thumb is to define what you’re targeting on the page that you intend to scrape. Dynamic content sometimes means that you’ll have to adapt creatively.

In our example, we’ve selected the URL of a website dedicated to showcasing quotes from famous people. It’s a purposefully slow-loading page, so it will return an error if we don’t give the web driver enough delay time before scraping. Therefore, we’re setting a delay time of 30 seconds and targeting only the quote element by class name.

url = "https://quotes.toscrape.com/js-delayed/"
chrome.get(url)
wait = WebDriverWait(chrome, 30)
quote_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "quote")))

Extracting the HTML element

Once we get all the elements from the page, we can create a simple loop where we iterate through all the elements and extract the necessary data from the HTML code. For this, we’re using the BeautifulSoup library.

We extract the quote by targeting a specific span with a class text and extracting the text. We do the same for the elements author and tags. Then, we create a dictionary and format it based on our preferences.

quote_data = []
for quote_element in quote_elements:
soup = BeautifulSoup(quote_element.get_attribute("outerHTML"), ’html.parser’)
quote_text = soup.find(’span’, class_=’text’).text
author = soup.find(’small’, class_=’author’).text
tags = [tag.text for tag in soup.find_all(’a’, class_=’tag’)]
quote_info = {
"Quote": quote_text,
"Author": author,
"Tags": tags
}
quote_data.append(quote_info)
with open(’quote_info.json’, ’w’) as json_file:
json.dump(quote_data, json_file, indent=4)
chrome.quit()

Run the code by executing the following command in your terminal:

python quotes.py

The result will appear in your Terminal or command line interface and be saved in a JSON file. The benefit of storing JSON files is that it makes the data well-organized and easy to interpret.

Prefer browser mode?

For those who like a visual representation of the Selenium Python web scraping process, you can switch the headless mode off. In that case, you’ll witness a Chrome instance being launched, offering a real-time view of the scraping. It’s a matter of personal preference, but it’s always good to have an option for checking if it works or at which point the errors strike.

If you go to the WebDriver page properties step, simply put a # symbol before the line that mentions headless to comment it out and make it inactive:

# chrome_options.add_argument("--headless=new")

The full Selenium Python web scraping code and video

Let's recap. The project is downloadable from our GitHub. And the code is as follows:

from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from extension import proxies
from bs4 import BeautifulSoup
import json
# Credentials and Proxy Details
username = ’your_username’
password = ’your_password’
endpoint = ’proxy_endpoint’
port = ’proxy_port’
# Set up Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
proxies_extension = proxies(username, password, endpoint, port)
chrome_options.add_extension(proxies_extension)
# Comment the next line to disable headless mode
chrome_options.add_argument("--headless=new")
chrome = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
# Open the desired webpage
url = "https://quotes.toscrape.com/js-delayed/"
chrome.get(url)
# Wait for the "quotes" divs to load
wait = WebDriverWait(chrome, 30)
quote_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "quote")))
# Extract the HTML of all "quote" elements, parse them with BS4 and save to JSON
quote_data = []
for quote_element in quote_elements:
soup = BeautifulSoup(quote_element.get_attribute("outerHTML"), ’html.parser’)
quote_text = soup.find(’span’, class_=’text’).text
author = soup.find(’small’, class_=’author’).text
tags = [tag.text for tag in soup.find_all(’a’, class_=’tag’)]
quote_info = {
"Quote": quote_text,
"Author": author,
"Tags": tags
}
quote_data.append(quote_info)
# Save data to JSON file
with open(’quote_info.json’, ’w’) as json_file:
json.dump(quote_data, json_file, indent=4)
# Close the WebDriver
chrome.quit()

To wrap up

We hope our walkthrough has taught you how to be mindful of your target and how to extract the desired data from dynamically rendering pages successfully. The Selenium Python web scraping technique in our vast digital universe of data is like equipping yourself with a Swiss army knife in the dense, unpredictable jungles of the Amazon.

Remember that our residential proxies’ added power will ensure a smooth, uninterrupted scraping journey. Whether you’re a beginner or someone with a bit more experience, combining these tools guarantees efficiency for any web scraping project.

About the author

Dominykas Niaura

Copywriter

As a fan of digital innovation and data intelligence, Dominykas delights in explaining our products’ benefits, demonstrating their use cases, and demystifying complex tech topics for everyday readers.

LinkedIn

All information on Smartproxy Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Smartproxy Blog or any third-party websites that may be linked therein.

Related articles

Frequently asked questions

What is web scraping?

Web scraping is a method to gather public data from websites. With a dedicated API (Application Programming Interface), you can automatically fetch web pages to retrieve the entire HTML code or specific data points.

At Smartproxy, we offer Social Media Scraping API for social media platforms, SERP Scraping API for search engine result pages, eCommerce Scraping API for online marketplaces, Web Scraping API for various other websites, and No-Code Scraper for codeless data collection.

But if you’re already set with a web scraping tool for your project, don’t forget to equip it with residential proxies for ultimate success.

What are the use cases of web scraping?

What are the challenges of web scraping?

What is parsing?

© 2018-2024 smartproxy.com, All Rights Reserved