Categories
PHP Programming

Search PDFs With PHP, MySQL, and PdfToText

Being able to search a PDF is a very useful feature on any web site.  The problem is that there aren’t many languages that give you the tools to do so right out of the box.  PHP is no exception to this.  If you want to search PDF files you’ll need some third-party tools and a little bit of ingenuity.

Pre-requisites

You’ll server will need to have the following configuration.

  • PHP (>=4)
  • MySQL (>=4)
  • Linux (Distro of your choice)

Step 1:  Download PdfToText

PdfToText is a program written in C that will quickly convert the contents of a PDF to text.  We’re going to use it just for that purpose.  You download the file at http://www.foolabs.com/xpdf/download.html.  Once you have downloaded the file, go ahead and place it somewhere in your web site directory and extract it (on most linux systems “tar -xzf [file]” will do the trick).  Once it’s unzipped, you’ll see a program called “pdftotext”, which is what we’re after.

Step 2:  Convert the PDF to Text

As an astute reader, you’ve probably noticed by now that PdfToText is not a PHP file.  So how are we going to use it?  Well, we’re going to use the “backtick” (the ~ [tilda] key) operator.

function convert_to_text($pdf) {
     $output = `./pdftotext {$pdf} temp.txt`;
     return $output
}

The backtick operator will execute any command on the command line, trap it’s output, and return it to the caller.  It’s worth noting that the backtick operator will only return output from standard out.

This is probably the hardest part of this tutorial.  There may be problems with write permissions on the directory, or ownership problems, but if you can get it to work, you’re all set.

Step 3:  Read the Text

Now that the PDF has been converted to a text file, we need to get that information back in to PHP.  To do that, we use the file_get_contents functions.

function get_text() {
     $text = file_get_contents("temp.txt");
     return $text;
}

Step 4:  Store the Data

This part of the tutorial assumes 2 things.  1) That you have a table named pdf_data, and 2) That the table has a column called pdf_contents that is full-text searchable (If you need help setting this sort of thing up, leave a comment).

function store_data() {
     $text = mysql_real_escape_string(get_text());
     $query = "INSERT INTO pdf_data (pdf_contents) VALUES ('{$text}')";
     mysql_query($query);
}

Step 5:  Search the Data

The final step is actually searching the data.  To do that, we’ll use the full-text searching capability of MySQL.

function search_data($term) {
     $term = mysql_real_escape_string($term);
     $query = "SELECT * FROM pdf_data MATCH(pdf_contents) AGAINST ('$term')";
     $result = mysql_query($query);
     while($row = mysql_fetch_array($result)) {
          //Do stuff with returned data.
     }
}

Where “Do stuff with returned data” is, you can do whatever you want.  MySQL is going to return the rows to you in order of relevance (descending).  The most relevant result will be first, followed by the second most, and third most, and so on.

Other Notes

  • PdfToText may or may not be the best way to do this, but it is one of the simplest.  There are a handful of libraries out there for creating PDFs in PHP, but surprisingly few for something as common as reading a PDF.
  • There are binaries and source files available for PdfToText on their web site(here).
  • This tutorial could be expanded a lot.  If you have questions or requests, please ask!

By Jack Slingerland

Founder of Kernl.us. Working and living in Raleigh, NC. I manage teams of software engineers and work in Python, Django, TypeScript, Node.js, React+Redux, Angular, and PHP. I enjoy hanging out with my wife and kids, lifting weights, and PC gaming in my free time.

11 replies on “Search PDFs With PHP, MySQL, and PdfToText”

If the PDF is trusted, I don’t think there are any issues. However, when using any type of shell execution, make sure to use escapeshellarg() to make sure that arbitrary commands can’t be executed.

What type of text file is by default created by xpdf and what type of collation should be used for the pdf_contents column? I seem to get varying results

I can get the text in the db however it will not let me search on text that is clearly in the data. Is this due to the collation type?

This worked for me with little modification. The host in question is Host Gator. Step 2 was changed to:


function convert_to_text($pdf) {
$output = `pdftotext {$pdf} temp.txt`;
return $output
}

The pdftotext file had to be placed in my BIN folder on the server.

i dnt knw if i am askin the rite question, i want to use pdftotext in wampserver,so my question where sud i extract it and how sud i use it..i hav all over and found the above info very helpful..please help me gettin started

Comments are closed.