Convert PDF into TXT with Debian Linux

If you would like to convert Portable Document Format (PDF) into text in console you should install Poppler utils.Poppler is a PDF rendering library based on xpdf PDF viewer. Poppler package has some useful command line utilities for converting PDF into various format.

  • pdffonts -- font analyzer 
  • pdfimages -- image extractor 
  • pdfinfo -- document information 
  • pdftoabw -- PDF to Abiword converter 
  • pdftohtml -- PDF to HTML converter 
  • pdftoppm -- PDF to PPM/PNG/JPEG image converter 
  • pdftops -- PDF to PostScript (PS) converter 
  • pdftotext -- text extraction

 

Install poppler-utils

Installing poppler-utils on Debian box is simple with following line:

apt-get install poppler-utils

Usage examples of poppler-utils

Convert pdf into text:

pdftotext filename.pdf filename.txt

Convert last 5 pages of PDF

pdftotext -l 5 filename.pdf filename.txt

Convert first 5 pages of PDF

pdftotext -f 5 filename.pdf filename.txt

Convert password protected PDF

pdftotext -upw ‘password’ filename.pdf filename.txt

How to use pdftotext

Usage: pdftotext [options]  [PDF-File] [text-file]

where options are:

-f number

Specifies the first page to convert.

-l number

Specifies the last page to convert.

-r number

Specifies the resolution, in DPI.  The default is 72 DPI.

-x number

Specifies the x-coordinate of the crop area top left corner

-y number

Specifies the y-coordinate of the crop area top left corner

-W number

Specifies the width of crop area in pixels (default is 0)

-H number

Specifies the height of crop area in pixels (default is 0)

-layout

Maintain (as best possible) the original physical layout of the text.  The default is to ´undo'  physical  layout  (columns,  hyphenation, etc.) and output the text in reading order.

-raw   

Keep  the  text  in  content  stream order.  This is a hack which often "undoes" column formatting, etc.  Use of raw mode is no longer recommended.

-htmlmeta

Generate a simple HTML file, including the meta information.  This simply wraps the text in  and and prepend the meta headers.

fp.TAGS