5
minutes
Mis à jour le
20/6/2023


Share this post

The article covers scraping methods, recommends official APIs, and introduces tools like Jsoup and Selenium WebDriver. Featuring a financial data scraping demo, it underscores strategy comprehension. Find the full code on GitHub.

#
FinTech
#
Finance
#
API
#
Performance
Alex-Adrien Auger
Engineering Manager

Please note that this article will guide you through the process of extracting data from a website. Always make sure its usage respects the law and the Terms of Service of the targeted websites. Often, doing so requires written agreement from the party. If an official API meets your needs, make sure to always use it instead.

Definition of web scraping Web scraping : Extracting data from a website to use it in a different context.

Also called web harvesting or web data extracting.

The web is full of resources. But the data is not always available through an API. When a website owner is not able to develop APIs to fit your needs but is ok for you to fetch it in any way you see fit, one of the solutions to overcome this restriction is to perform web scraping to extract the data you need.

Two types of scraping

Web pages are written in 3 main languages: HTML (HyperText Markup Language), CSS (Cascading Style Sheet), and JS(JavaScript). When your web browser access a URL, it downloads the page behind the requested URL. Each language has its own responsibility, detailed as follows :

twotypescrap

The two types of scraping don’t support the same languages.

Passive scraping

Downloading the source code of a page and parsing it.

This is called passive scraping because you cannot interact with the DOM. The main advantage of this method is the speed. Scraping is a slow method to get data from the internet. By performing passive scraping you reduce the number of operations to the strict minimum.

But performing the smallest number of operation also reduces your ability to interact with the web page, which can be restricting.

Active scraping

Taking control of a real web browser and using it to find data on a web page

To display a page, your web browser will perform a set of operations. Sometimes the data you want is displayed after the JavaScript is executed, which means that you need to do every operation a web browser will normally do to be able to get the data you want. In this case, you need active scraping.

Tip: Some website will detect that you are harvesting their data by performing passive scraping on their website. To protect their data, they can redirect you to another page which won’t contain the data you want. When it happens, you will need to use active scraping to be like a normal user of their website. (For more information about preventing web scraping https://github.com/JonasCz/How-To-Prevent-Scraping) This is the most elaborated way of scraping. To the eyes of the web server, your computer is like any of the page’s users. But this ability also implies bad sides : active scraping is generally very slow due to all the operations the computer has to perform before being ready to parse the web page.

activescraping

What do we need?

  • Jsoup (maven) Jsoup is an HTML parser Java library. It provides the necessary tools to help you parse a website and extract the information you want. Jsoup allows you to parse HTML from a string, from a file, but also from a URL which means we will be able to scrape a website by providing the URL of the page where the data is. Jsoup will scrape HTML data, it won’t be able to interact with the page like you can with your web browser. As explained below, this way of scraping data is the passive one. Jsoup is a useful tool for that.
  • Selenium WebDriver (maven) Selenium is a Java framework for web browser automation. It allows you to “take control” of a web browser. Unlike Jsoup, Selenium WebDriver will not only download the page for you but also execute it just as a web browser would do, and then let you decide programmatically what you want to do.
  • Jackson (maven) Jackson is a Java library used to create JSON objects from Java code. We will use it to store data as a JSON object.

Concrete example of scraping financial data

A Github repo is available for this example at https://github.com/alexadrien/scrapingarticle

Description of the page

As an example, we are going to scrape financial data from a page that looks like that:

sectorperform

This page contains the financial performance per sector. The data you can see on the right side of the page is the data we are seeking for.

If you inspect the page, you will see that the page performs an HTTP GET request to get the data behind that list. If we really wanted the data behind that page, we could get the data by performing the same request, but the data is not always available from an API. For this we will not use the API.

This page is interesting to parse because the website implemented a security check to avoid people to harvest their data at high frequency, if they detect that your IP address is trying to get the HTML too often, they will redirect you to another page. We are going to use both passive and active scraping for the sake of the example.

Scraping strategy

Let’s first try to understand which scraping method we should use. The first step of every scraping strategy is to identify whether the data is displayed by executing JavaScript or if it is embedded in the HTML request. To do that, get to the page and disable JavaScript for this page.

scrapStrat

As you can see, when the page is refreshed after the JavaScript is disabled, the donut chart does not load. The data on the right is still there, and if you click on the sector names, you can see that the data still loads with the page. This means that passive scraping is a good candidate.

StratSchema

Now there is a second test to perform in order to choose the correct scraping strategy. It is about security. Most websites will not allow you to scrape their website at high frequency. Loading webpage multiple times per second for several seconds is very suspicious and if you do, the web server will redirect you to a page where you need to prove you are a human.

A complete decision tree would be :

StratSchemaComplet

Now let’s describe how we are going to scrape this page. First, let’s take a look at the structure of the page.

tableau

Each sector name is a link to a new page where the sector’s section is expanded and exposes new and more specific data. We will need to visit each page and scrape the specific data for each sector. For the sake of the example, we are going to scrape all the sector’s links with Jsoup (passive scraping method) and then use Selenium (active scraping method) for the rest of the data.

Targeting the correct HTML elements

One of the most recurrent operations when doing scraping is the search for the right HTML element where you can efficiently scrape the data. Let’s take a look at the DOM (Document Object Model) of the page and try to find a CSS query.

The best way I found to find efficiently the elements in the DOM is by right-clicking on the elements in the Chrome Developer tools window (in the “elements” tab) and select “Copy selector” in the “Copy” section. If you target a sector name in the DOM and copy the CSS selector, you find :

#content > div > div > div.sector__data > div > div > div.sector-chart-and-data__data-table > div > div > div.sector-data-table__container > div:nth-child(8) > div > div.sector-data-table__sector-name > a

Notice that at the end of the CSS selector is

div.sector-data-table__sector-name > a

That CSS selector means that we are searching for link elements inside a div that has the class name indicated above. So we are searching for HTML elements like this :

<div class="sector-data-table__sector-name">
<a href="#">My link</a>
</div>

<++pre++code> <++code++pre>

It corresponds to the link elements we want. Guidelines to find the perfect CSS selector :

  • Do you want to select multiple elements with the same query? Search for a class name that is present in HTML elements.
  • Do you want to select one item? Search for an id in the HTML element.
  • You do not know how to find the class name / id that would work? Start by looking at the precise HTML element and then search for a class name / id in all the parent elements (start at the direct parent)

Defining helping Java classes

Let’s define some Java classes to help us store the data in a comprehensive way.

class Performance {
   private String name;
   private String value;

   public Performance(String name, String value) {
       this.name = name;
       this.value = value;
   }
  // getters and setters
  // ...
}

class Sector {
private String name;
private List performances;
public Sector(String name, List performances) {
this.name = name;
this.performances = performances;
}
// getters and setters
// ...
}

class ScrapedData {
   private List sectors = new ArrayList();
   public ScrapedData() {
   }
   // getters and setters
   // ...
}

Scraping the sector’s links

The first step in the Jsoup scraping method is to get the page behind a URL.

Document document = Jsoup.connect(url).get();

This statement will perform a request to the URL specified, get the source code of the page, and then parse it. The Document object is a Jsoup-specific object. The next step is to get all the sector links:

Elements linksToSector = document.select("a.sector-data-table__sector-name-link");

The select method will search in the HTML DOM for an object that matches the CSS query specified. The next step is to adapt the data we found:

ArrayList<String> stringLinksToSectors = new ArrayList<String>();
for (int i = 1; i < linksToSector.size(); i++) {
   stringLinksToSectors.add(BASE_URL + linksToSector.get(i).attr("href"));
}

The code block will iterate over the links we found and prepend the beginning of each URL (as those links are internal to the website, they do not have the domain name, but we need it).

<a href="markets/sectors/telecommunication-services">
Telecommunication Services
</a>

This link does not contain the root domain name. This is why we need to append it.

We want to get rid of the “All Sectors” link, thus the iteration begins at i=1.

At that point, we have a List object with all the sector’s links.

Scraping the data

To access the sector-specific data, we need to visit and scrape every link we discovered in the previous steps. In order to avoid loading the pages too fast which seems wrong to the web server, we will use the Selenium way of scraping.

The first step is to initialize the ChromeDriver.

System.setProperty("webdriver.chrome.driver", "chromedriver");
ChromeDriver driver = new ChromeDriver();

The first line set the location property of the ChromeDriver, it will indicate the program how to find the chromedriver file. This is a file required by Selenium to emulate a Chrome browser. The second line instantiates a ChromeDriver Java object.

The next step is to iterate over the links we found in the previous steps and access the page behind the URL.

for (int i = 0; i < stringLinksToSectors.size(); i++) {
   String linkToCurrentSector = stringLinksToSectors.get(i);
   driver.get(linkToCurrentSector);
}

Once each page is loaded, we can find the elements we want

for (int i = 0; i < stringLinksToSectors.size(); i++) {
   String linkToCurrentSector = stringLinksToSectors.get(i);
   driver.get(linkToCurrentSector);
   String masterGroupName = driver.findElementByCssSelector(".sector-data-table__sector-row--selected a").getText();
   List<WebElement> groupNames = driver.findElementsByCssSelector(".sector-data-table__industry-group-name");
   List<WebElement> groupValues = driver.findElementsByCssSelector(".sector-data-table__industry-group-return");
}
driver.close();


The lines 4, 5 and 6 get the name of the sector which page the driver is currently on, and then all the sub-sectors name and value.

The last line shuts down the Chrome instance.

Now let’s use our Java classes Performance, Sector, and ScrapedData to store all the data we scraped

ScrapedData resultData = new ScrapedData();
System.setProperty("webdriver.chrome.driver", "chromedriver");
ChromeDriver driver = new ChromeDriver();
for (int i = 0; i < stringLinksToSectors.size(); i++) {
   String linkToCurrentSector = stringLinksToSectors.get(i);    driver.get(linkToCurrentSector);
   String sectorName = driver.findElementByCssSelector(".sector-data-table__sector-row--selected a").getText();
   List<WebElement> subSectorName = driver.findElementsByCssSelector(".sector-data-table__industry-group-name");
   List<WebElement> subSectorValue = driver.findElementsByCssSelector(".sector-data-table__industry-group-return");
   ArrayList<Performance> values = new ArrayList<Performance>();
   for (int j = 0; j < subSectorName.size(); j++) {
       values.add(
           new Performance(
               subSectorName.get(j).getText(),
               subSectorValue.get(j).getText()
           )
       );
   }
   Sector currentFinancialCategory = new Sector(sectorName, values);
   resultData.getSectors().add(currentFinancialCategory);
}

Writing data to a file

Once that the scraping operation is done, we need to store data somewhere. The solution you choose for this step depends on your usage of the data. For this example, we will write the scraped data in a JSON file. We will use the Jackson Java library to convert the data to JSON format. At the end of our Java function, when the data is available, we can add :

ObjectMapper mapper = new ObjectMapper();
mapper.writeValue(new File("results.json"), resultData);

Conclusion

You can find the complete code and the guide to run the scraper at https://github.com/alexadrien/scrapingarticle.

  • Scraping the web helps you get the exact information you need without an API.
  • Key steps of scraping a website :
  1. Check if the data is still available without Javascript
  2. Check if the website implement a security if you scrape too fast
  3. Search for CSS queries to access the data on the Developer Tools of your web browser
  4. Create Java classes to help you store your data
  5. Parse the page and then use your data