How to Scrape PDF Data with Python – PyPDF2

The Task

One of the most valuable data analysis techniques with Python is the ability to scrape text information and meta-data from websites. With libraries such as Requests, Selenium, and BeautifulSoup you can easily read and collect website HTML information. However, several python libraries allow you to process other binary data sources such as images, PDFs, and others.

Today we will discuss a straightforward method for reading PDF data in Python. The package we will be using today is PyPDF2. This library isn’t just limited to reading PDF data as it supports split/merge operations, page cropping, file encryption, and more. I’ve previously worked on computer vision projects using this library for logo detection (more on that to come!). It’s extremely powerful because it allows you to access PDF data and meta-data programmatically. For example, you can use this library to compile text data from several PDFs of the same type to create a single report instead of editing each file manually and then merging.

The Code </>

Since this library is written in pure Python the code implementation is actually fairly easy. For this example, I’ll be using a version of the script of Star Wars: Revenge of the Sith.

First, let’s import our library like so:

import PyPDF2 #PDF2Text

Next, let’s specify a file path variable, I’m running my code in the same folder as my pdf so mine is simply the file name:

import PyPDF2 #PDF2Text

path = 'ROTS.pdf'

Excellent! Now we can begin writing the code to read our PDF.

Let’s start by adding an open statement below to read the PDF and load in the resulting object as a PyPDF2 object. The pages attribute is used to get a list of pages from the PyPDF2 class object so that we can see the page data. The extractText() method is used on this first list item (the first page of the PDF) to get a string:

import PyPDF2 #PDF2Text

path = 'ROTS.pdf'

with open(path, mode='rb') as file:
    pdf = PyPDF2.PdfFileReader(file)
    print(pdf.pages[0].extractText())

Perfect! Now, using this same extractText() method we can use a for loop to access each page of the PDF programmatically:

with open(path, mode='rb') as file:
    pdf = PyPDF2.PdfFileReader(file)
    for p in pdf.pages:
        print(p.extractText())

And with that now you can read PDF data in Python with ease. With these simple code snippets, you can iterate through any text data of most PDFs. Happy PDF scraping!

Need extra help? Sign-up here for tutoring: https://codedogtutoring.com/connect/