Convert PDF into TXT with Debian Linux
If you would like to convert Portable Document Format (PDF) into text in console you should install Poppler utils.Poppler is a PDF rendering library based on xpdf PDF viewer. Poppler package has some useful command line utilities for converting PDF into various format.
- pdffonts -- font analyzer
- pdfimages -- image extractor
- pdfinfo -- document information
- pdftoabw -- PDF to Abiword converter
- pdftohtml -- PDF to HTML converter
- pdftoppm -- PDF to PPM/PNG/JPEG image converter
- pdftops -- PDF to PostScript (PS) converter
- pdftotext -- text extraction
Install poppler-utils
Installing poppler-utils on Debian box is simple with following line:
apt-get install poppler-utils
Usage examples of poppler-utils
Convert pdf into text:
pdftotext filename.pdf filename.txt
Convert last 5 pages of PDF
pdftotext -l 5 filename.pdf filename.txt
Convert first 5 pages of PDF
pdftotext -f 5 filename.pdf filename.txt
Convert password protected PDF
pdftotext -upw ‘password’ filename.pdf filename.txt
How to use pdftotext
Usage: pdftotext [options] [PDF-File] [text-file]
where options are:
-f number
Specifies the first page to convert.
-l number
Specifies the last page to convert.
-r number
Specifies the resolution, in DPI. The default is 72 DPI.
-x number
Specifies the x-coordinate of the crop area top left corner
-y number
Specifies the y-coordinate of the crop area top left corner
-W number
Specifies the width of crop area in pixels (default is 0)
-H number
Specifies the height of crop area in pixels (default is 0)
-layout
Maintain (as best possible) the original physical layout of the text. The default is to ´undo' physical layout (columns, hyphenation, etc.) and output the text in reading order.
-raw
Keep the text in content stream order. This is a hack which often "undoes" column formatting, etc. Use of raw mode is no longer recommended.
-htmlmeta
Generate a simple HTML file, including the meta information. This simply wraps the text in and and prepend the meta headers.