AspPDF is capable of extracting raw text information
from PDF documents for searching and indexing purposes. Text is extracted
from an individual page via the ExtractText method
of the PdfPage object. ExtractText takes an optional
parameter object or parameter string (described below.)
This method always returns text strings
in Unicode format.
Text extraction with coordinates, introduced in Version 2.8, is described in Section 14.7 - Structured Text Extraction.
9.4.1 Code Sample
The following code sample extracts and prints out text data from
all pages of a PDF (we use the 1-page file 1040es.pdf from section 9.2):
VBScript |
Set Pdf = Server.CreateObject("Persits.Pdf")
' Open a PDF file for text extraction
Set Doc = Pdf.OpenDocument( Server.MapPath("1040es.pdf") )
Dim TextString
For Each Page in Doc.Pages
TextString = TextString & Page.ExtractText
Next
Response.Write Server.HtmlEncode( TextString )
|
C# |
IPdfManager objPdf = new PdfManager();
// Open a PDF file for text extraction
IPdfDocument objDoc = objPdf.OpenDocument( Server.MapPath("1040es.pdf"), Missing.Value );
String strText = "";
foreach( IPdfPage objPage in objDoc.Pages )
{
strText += objPage.ExtractText(Mssing.Value);
}
lblResult.Text = Server.HtmlEncode( strText );
|
Click the links below to run this code sample:
http://localhost/asppdf/manual_09/09_extract.asp
http://localhost/asppdf/manual_09/09_extract.aspx
9.4.2 Possible Text Extraction Problems
PDF text extraction is not always reliable, sometimes it produces split
and conjoined words, or even unreadable gibberish.
9.4.2.1 Split and Conjoined Words
Unlike HTML or Word documents, PDFs do not usually contain
blocks of meaningful, readable text. Instead, they contain
text drawing operators that reference short phrases, individual
words, word parts and even separate characters.
As a result, an attempt to extract text information from a PDF document often
yields split and conjoined words. For example, the phrase "Brown dog"
may come out as "Browndog" (conjoined words) or "Bro wn d og"
(split words).
9.4.2.2 Gibberish
Many PDF documents, especially those using non-Latin alphabets, do not
use strings of readable characters to display text at all.
Instead, they use "glyph codes" which are numbers identifying character
appearances in a font file. "Good" PDF documents also provide mapping
tables (referred to as ToUnicode maps) enabling a consumer application
to convert those codes back to human-readable characters. However, not every
PDF document is "good". Those that aren't cannot technically be read.
An attempt to extract text from such a document yields gibberish.
Copying information from such a file via clipboard from Acrobat Reader
will fail as well.
9.4.2.3 Unknown Encoding
Certain foreign-language PDF documents use ASCII characters in the 129 - 255
range to display text information. Copying and pasting from such documents
with Acrobat Reader usually produces unreadable text. However, AspPDF is
capable of extracting text from these documents and converting them into
Unicode, but a code page must be passed to
ExtractText method via the CodePage parameter, such as "CodePage=1251" (Cyrillic),
or "CodePage=1256" (Arabic), etc.
9.4.3 Permission Issues
A secure document may disallow content extraction by clearing Bit 5
of its permission flags (see Section 8.1.2).
To be in compliance with Adobe PDF licensing requirements, AspPDF
enforces this permission flag. For the content extraction functionality to work,
a secure document with Bit 5 cleared must be opened with the owner
password, or an error exception will be thrown.