PDF Parser and Reader

Context:

Portable Document Format (PDF) is a file format used to present documents in a manner independent of application software,hardware, and operating systems. Adobe PDF is the most popular brand that we are aware of when we think about PDF and it has its technical foundations in PostScript.

 

Relevance to Test Automation: 

PDF formatted file are used across many applications because of the above neutrality with application software, hardware. When printing html content to a printer or sometimes to save HTML documents that contains lots of formatting like colors, images, fonts and still not lose any of that formatting, we do print it to a pdf in soft copies. For example, we might want to print the legal documents on a website in pdf. We might want to make certain content available to customers in pdf format, so that it is not editable. We can also enforce water marks, password controls and other financial documents encryption using PDF technologies. Anyways, for this section, we will talk about a gem pdf-reader that parses a pdf file and print some of its meta content and text. An application where you might want o use pdf-reader is to check the existence of text in the entire pdf and which page that text exists. This scenario is particularly useful when in legal softwares where we might want to verify that we did NOT miss presenting any critical information or disclosures in client facing documentation.

Agenda:

  • pdf-reader gem
  • Installation and use in project
  • Important lines of code
  • Complete Cucumber scenario and step definitions

1) pdf-reader:

The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe. The entire details of gem here.

It provides programmatic access to the contents of a PDF file with a high degree of flexibility

2)Installation

Add “gem pdf-reader” to the Gemfile as follows:

gemfile_pdfreader

 

Add “require pdf/reader” to load the Module and make it available to Cucumber

env_pdfreader

 

Download a sample pdf file from here that is used for cucumber scenarios below. Place the file “AdobeXmlFormsSamples.pdf” in /features/support/datafiles directory.

3)Important lines of code

Read a pdf file and get a reference to the object handle

Reader version, reader info, meta data and page count

Reading a pdf from a stream. using open-uri. This means you can pass a URL to open() method if your website hosts a pdf at a certain URL.

Read fonts and text

Complete Cucumber Scenario:

Step Definitions:

Output:

  • First scenario opens a local pdf file and prints meta text
  • Second scenario uses open-url and prints meta text
  • Third scenario loops through pdf pages and prints fonts and text
  • Fourth scenario is a real-time example on how to use pdf-reader

pdfreader_output