Webscraping in Alteryx
Web scraping is the process of harvesting data (usually tabulated) from websites and is an extremely useful approach to still gather web-hosted data that isn’t supplied via Application Programming Interfaces (APIs).
For example, most retail store chains have a store locator map on their website. This information is quite useful for their customers, but it can be much more valuable for a number of other purposes.
What if you could extract all the data in these Store Locators into a handy spreadsheet full of potential business leads or competitor data? But some retail businesses can have more than 100 branches so having the manually retrieve the data from each store would becomes tedious and inefficient. You’ll find yourself spending hours of your valuable time copying and pasting the addresses, phone numbers and store hours from these maps into your own database. How can we get around this issue?
Web scraping is the answer.
Although there are many ways of achieving this through software programming languages such as R or Python, you can also create a data extraction workflow to automate this time consuming task without writing a single line of code. Here, at The Information Lab Ireland, we were taught on how this can be easily done through a few key tools in Alteryx.
Download your free Trial of Alteryx here and follow along!
Finding the appropriate web data
Make sure you’ve downloaded and installed Alteryx from the link above.
Navigate to the retail store website of your choice. In this example, I will be extracting data on all of locations of an Irish retail store which will remain anonymised for this exercise. However, the steps that I have described below can be applied to many store locator webpages.
Then press Ctrl + Shift + I to bring up the Google Developer View
Navigate over to the Network tab
Refresh Webpage again which allows you to inspect a page’s network activity, generating a list of all the different elements which make up it’s view and interactivity.
From, this list you can find your specific request for all store locations.
Once you have found the relevant link go to Headers Tab and to retrieve its URL. Copy and paste this link into a different tab to view the link’s contents.
Below is the URL link we will use which is presented in JSON format as shown below:
Steps in Alteryx – How do we actually download the data?
There are several key new tools that you will need when downloading data from the web as highlighted in the workflow and processing steps below:
- Copy and past the URL link into a text input tool (name it URL for good practice not Field1)
- Then use a Download tool in order to retrieve data from the specified URL to be used in downstream processing or to be saved to a file
- The text in download data is in JSON format. Therefore we will use the JSON parse tool to parse this data into a table
- Use Text to Columns to split JSON Name on a full stop delimiter, this will extract the table structure needed to crosstab the results into a table format we are familiar with
- Use a crosstab tool to group by the Store ID field (number) and put the column headers as the column names field, JSON value as the value
- Deselect unnecessary fields, rename the fields we need for ease of recognition
- We can then use a Google Sheets output tool to conveniently publish your cleansed data from your Alteryx workflow to a Google Sheets spreadsheet. You can visit this site to download and learn more about this tool. https://help.alteryx.com/zh/node/416
And there you have it! Below are the results of this workflow showing the address and coordinate information for all store locations.
For more blog updates or any questions please follow me on Twitter !
What is the The Information Lab Ireland’s Graduate Training Program?
The Information Lab Ireland’s Graduate Training Program is a two-and-half-year program for people with drive and a desire to try something new in data.
When they join us, most of our candidates are completely new to Tableau and Alteryx. After a 14-week intensive training course, they become part of our consulting team, available for long-term engagements with our clients.
We train our graduates in both the technical and soft skills required to be a top-class data analytics consultant.
Want to join us?
When we are hiring, we will post any recruiting news and event information on our blog. So keep your eyes out here.
The Information Lab Ireland is at the forefront of creating a data-driven culture in Ireland.
As part of its vision, The Information Lab Ireland regularly hosts free events throughout the country to show how being data-driven can improve decision making and lead to a better understanding of the world around us.