Smartproxy>Scraping>Selenium Scraping With Node.js

Selenium Scraping with Node.Js

This is an article about Web Scraping with Selenium and Node.js for people interested in collecting public data from a high-value website to gain good sales leads or data for pricing analysis.

If you're driven by results, you know that Selenium is a great choice to pair with other tools for collecting information. This data can be used to conduct:

Price comparison

Brand protection

Ad verification

and lots of other crucial tasks, such as travel fare comparison, increasing your cyber security, and testing your links. Selenium is a browser automation tool that has the support of the largest browser vendors. Companies use Selenium to automate day-to-day tasks.

Using Selenium for scraping

Selenium was originally developed as a driver to test web applications, but it has since become a great tool for getting data from web sites. Since it can automate a browser, Selenium lets you forego some honeypot traps that many scraping scripts run into on high-value websites.

Scraping is quite an intricate process, but it can also bring you results that will boost your sales, up your business intelligence game, and make you competitor to any market leader.

From a more technical side, a great example of why people use Selenium for scraping is its delay function, which is perfect for loading up delayed data, especially when a site uses lazy-loading, Ajax or endless scroll.

Web scraping tools

If you want to access and gather public data at scale, you need the right tools to overcome common obstacles like IP blocking and cloaking. Since scrapers rotate the IP address for each request, they cannot be detected by their IP. Of course, if you make sure to use a fast proxy server, you will also avoid cloaking, captchas, and other struggles.

You've probably heard about free Python scrapers like Scrapy and Beautiful Soup. The latter parses and extracts data from HTML files, while the former downloads, processes, and saves it. They both have their pros, but Scrapy is better than Beautiful Soup. However, you don't have to choose – you can easily use both! Just import Beautiful Soup to parse the content you get through Scrapy.

One last note: Beautiful Soup is a lot more suited for someone who just started scraping, so if you're planning to learn by yourself (and BS does offer very useful documentation), then starting with this scraper might be the better choice.

Besides these scrapers, Scrapebox, Scrapy Proxy Middleware, Octoparse, Parsehub, and Apify are also quite popular in the scraping world. These tools work best with residential proxies, guaranteeing you smooth processes and reliable results.

First things first: read the ToS

First things first – there is a possibility that scraping a target site is illegal. Even if you cannot access the data you want through an API and see web scraping as the only solution to collect the data you need, you still have to consider your target site. Many scrapers ignore the target site’s request limits in the robots.txt file, but those limits are there for a reason. Many times you will be able to crash a site by sending too many requests, especially if you use a scraping proxy network like ours, which allows you to send unlimited requests at the same time through unique IP addresses.

Crashing your data source is not only bad for your data collection, it’s also bad for legal reasons: your scrape might be seen as a DDoS attack, especially if you do the scrape through a datacenter proxy network. If this happens to a US company, you might face federal charges for CFAA infringement.

Setting up Selenium with Node.js for scraping

Since you are looking to web scrape, you probably don't need information on how to install Selenium webdriver or get the Node.Js library for your device. To check whether you are ready to scrape after installing Selenium and Node.js, launch PowerShell, Terminal or any other command line prompt and use the command:

npm -v

Also, you will need to download a webdriver like Chromium for Selenium to use. If you choose to use several browsers for a scrape, it will make it less detectable. Also, consider having a large list of random User Agents to keep the scrape under wraps, especially if you are ignoring my first tip to follow the target’s ToS.

Since we will use Selenium to access a web page and Node.js to parse the html file, we have to know what Selenium is capable of doing. It has many functions, which let it navigate any website. For example, you can use for different click events.

action.click()
action.doubleClick()
action.contextClick()

Most of the time you will use only a few commands to navigate a page:

driver.get()
driver.navigate().back()
driver.navigate.forward()

Even though these examples are very simple and bare-bones, they will be enough for most scraping targets. To find out more about Selenium driver's possibilities, read the Selenium documentation.

Now, to start Selenium and access a website, you can use code like this:

var webdriver = require ('selenium-webdriver'),
  By = webdriver.By;
var driver = new webdriver.Builder()
  .forBrowser('chrome')
  .build();

driver.get("https://www.smartproxy.com/");

This will launch Selenium, which will use the Chromium driver to load a website you specify. You can also command Selenium to navigate the site, enter texts into fields, click buttons or do other actions that you need to get to the page with data. When you are ready to automatically navigate to a page you need, it's time to parse that data! Node.js is what lets you parse the html document and extract the data you need. As most data scraping is textual, scrapers use the get.Text() command to get any text from an element on the page.

Parsing data with Node.js

Now, we will give you a couple of ways you could scrape a web page element, but you need to combine these methods for a particular site, as each one is different and has its own structure.

To parse elements in an html file you can use findElement() or findElements() commands. You can find an element or a set of elements by id, class, name, tag name or absolute/relative xpath with Node.js.

Using getText() with Node.js to scrape text data

Since you are looking to scrape a page, you must know how to check its structure. Use any browser's Developer tool to inspect an element you want to scrape, then use any method (xpath or other) to make Node.Js access it and then get the information you need. We'll wrap up this article with a couple of examples of how to scrape a simple web element with Node.js.

EXAMPLE 1 – scraping web page elements by their id name.

This example works for scraping data from sites that use ID names, for example:

<h2  id="employee-name"> Jane Doe</h2>

As this website uses 'employee-name' ID for employee names, we can scrape it with the command:

driver.findElement(By.id('employee-name').then(function(element){
    element.getText().then(function(text){
        console.log(text);
    });
});

This command will output the first element with ID name 'employee-name' in the command prompt. Now, if there are multiple items with the same ID name and you want to scrape them all, you'll need to use this command:

driver.findElements(By.id('employee-name').then(function(elements){
  for (var i = 0; i < elements.length; i++){
      elements[i].getText().then(function(text){
        console.log(text)
      });
    };
});

EXAMPLE 2 - scraping web page elements by xpath.

This example works for scraping data from sites that use ID names, for example:

<figure class="box">
    <div class="name">
        Jane Doe
    </div>
</figure>

To get the employee name with a relative xpath, we can scrape it with the command:

driver.findElement(By.xpath('//figure[@class="box"]/div[@class="name"]').then(function(element){
    element.getText().then(function(text){
        console.log(text);
    });
});

This command will output the first element within a figure class 'box' and div class 'name'. If there are multiple items with the same ID name and you want to scrape them all, you'll need to use this command:

driver.findElements(By.xpath('//figure[@class="box"]/div[@class="name"]').then(function(elements){
  for (var i = 0; i < elements.length; i++){
      elements[i].getText().then(function(text){
        console.log(text)
      });
    };
});

Setting up a Selenium proxy for scraping

Selenium is very good for scraping because it can use a proxy. You can set a proxy up for Selenium with our Selenium proxy middleware on GitHub.

There's so much more to learn about scraping than we can ever write in a single article. To gain more knowledge about these topics, read our other blog posts to find out more about web scraping:

What is web scraping

Selenium proxies

Need more real-life examples? Web scraping with Selenium, python, and proxies turned out to be a winning mix for an online e-commerce website. This startup managed to double its revenue by scraping Amazon and finding the best deals on offer. This company used scraping for price comparison, and the story we covered also gives you the code to DIY.

You can always LiveChat about using proxies for scraping.