Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion GroupsVB SyntaxEnterprise DevelopmentDatabase AccessControlsCOMWin APICrystal ReportDeploymentGeneralGeneral 2
Related Topics
VB.NET / ASP.NETMS SQL ServerMS AccessOther Database ProductsMore Topics ...

VB Forum / General 2 / March 2007



Tip: Looking for answers? Try searching our database.

Converting a PDF document image to text

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Greg - 27 Mar 2007 15:26 GMT
I prgramatically save a PDF document as text using automation.  I then
process the document in VB.

However, my client now has one of his clients sending an image of a
document in a PDF file.

What is the best way to convert the PDF image file to text using VB?

Thanks.

Greg
David Kerber - 27 Mar 2007 16:36 GMT
> I prgramatically save a PDF document as text using automation.  I then
> process the document in VB.
[quoted text clipped - 5 lines]
>
> Thanks.

AFAIK, your only choice will be to save it as an image file and use OCR
on the file.

Signature

Remove the ns_ from if replying by e-mail (but keep posts in the
newsgroups if possible).

charles@home.com - 27 Mar 2007 16:58 GMT
If a PDF file is in ASCII format (As opposed to Binary) it should be
possible to extract the text.

Postscript is a well defined language and parsing the page is a possibility.

I don't think the task is simple but it should be possible.

CharlesW

>I prgramatically save a PDF document as text using automation.  I then
> process the document in VB.
[quoted text clipped - 7 lines]
>
> Greg
Bob Butler - 27 Mar 2007 17:14 GMT
> If a PDF file is in ASCII format (As opposed to Binary) it should be
> possible to extract the text.
>
> Postscript is a well defined language and parsing the page is a
> possibility.

Postscript <> PDF

Signature

Reply to the group so all can participate
VB.Net: "Fool me once..."

Rick Rothstein (MVP - VB) - 27 Mar 2007 17:37 GMT
>> If a PDF file is in ASCII format (As opposed to Binary) it should be
>> possible to extract the text.
[quoted text clipped - 3 lines]
>
> Postscript <> PDF

True, but PDF uses Postscript internally...  see the "Technology" section
here

http://en.wikipedia.org/wiki/Portable_Document_Format

And the text can be stored internally in a PDF... as text... or, of course,
as an image of text.

Rick
charles@home.com - 29 Mar 2007 11:06 GMT
Rick

If the text is in the form of an image then "OCR" might be considered
although I have never looked at using that technology in VB it is
nowadays considered to be well established and widely used.

CharlesW

>>> If a PDF file is in ASCII format (As opposed to Binary) it should be
>>> possible to extract the text.
[quoted text clipped - 13 lines]
>
> Rick
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.