<< Back

R for PDF Scraping

Nikita Parab

I had a chance to work for a project which required me to scrape a lot of PDFs. The problem I was facing was the structure of some of the tables. It made it very difficult to get correct values in Alteryx and Tableau Prep. So, I decided to use R script instead. And it worked wonders. If you have a PDF which is really tough to scrape in Alteryx and Tableau Prep, give this code a try!

Some Prerequisite

You will need to download R Studio Desktop which is free and Java. Once you have both downloaded and installed, open R Studio and let’s get started!

The first step is to install all the packages we need to scrape our PDF. Packages are groups of multiple functions which are already written. We need to install and call packages that we need for scraping PDF. The packages are listed below. Just copy and paste it.

Install Packages

install.packages(‘RODBC’)
install.packages(‘munsell’)
install.packages(‘tidyverse’)
install.packages(‘tabulizer’)
install.packages(“tabulizerjars”)
install.packages(‘rJava’)
install.packages(‘reshape2’)
install.packages(‘miniUI’)
install.packages(‘reshape’)

If you have an error while installing the packages, make sure you have the latest versions of Java as well as R studio.

Call Packages

library(miniUI)
library(rJava)
library(tabulizerjars)
library(tidyverse)
library(tabulizer)
library(munsell)
library(reshape2)
library(reshape)

Once all the packages are ready we can start with the code. The first thing to do is select the area of table. Copy and paste the code below in R Studio. Edit the file location and the page number. PS: The file location should be entered with ‘\\’ instead of ‘\’.

Select Table to get coordinates

area <- locate_areas(‘File Location’,
pages =page number, copy = FALSE)

Once you run the above statement, you will get a pop up where it will ask you to select the area. Select the area and click on done. After that it is time to extract the table. Copy and paste the code below to get the table by editing the file location and page number. You can enter multiple page numbers separated by commas. You need to enter the same page number twice is you need to scrape two tables from the same page.

Get Table as a list

table <- extract_tables(‘File Location‘,
output = ‘data.frame’,
pages = c(page number),
area= list(area[[1]]),
guess = FALSE)

The last step is to save the extracted table as a dataframe. Do that by using the code below. You can then clean it up a bit and save it using write.csv.

Save as a dataframe

data <- table[[1]]

If you have any questions or feedback feel free to reach me at LinkedInTwitter.

What is the The Information Lab Ireland’s Graduate Training Program? 

The Information Lab Ireland’s Graduate Training Program is a two-and-half-year program for people with drive and a desire to try something new in data.

When they join us, most of our candidates are completely new to Tableau and Alteryx. After a 14-week intensive training course, they become part of our consulting team, available for long-term engagements with our clients. 

We train our graduates in both the technical and soft skills required to be a top-class data analytics consultant. 

Want to join us? 

When we are hiring, we will post any recruiting news and event information on our blog. So keep your eyes out here

The Information Lab Ireland is at the forefront of creating a data-driven culture in Ireland. 

As part of its vision, The Information Lab Ireland regularly hosts free events throughout the country to show how being data-driven can improve decision making and lead to a better understanding of the world around us. For more on these events please like us on Facebook, follow us on Twitter or catch up with us on LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked *