Getting Online Content into Usable Text

Posted in Content, Document management, File Management, Screencast | Tagged , , , ,

Most online legal information is text:  case law, statutes, blog posts, whatever.  But there are times that text isn’t really text, and that can make it hard to manage once you have found it.  Take a PDF, for example.  If it is made from a Microsoft Word document, chances are it has retained its text format, so that you can index and search it.  But if someone has created PDFs from a scanned image, even one that presents text when you read it, you might find retrieving it again difficult.  When a PDF contains pictures as representations of text, rather than text, you can’t keyword search it with Google Desktop, X1, or your other desktop search tools.

One way around this is to use an online OCR tool to run the scanned image through optical character recognition.  This will convert the text in the document into something that can be searched later.  Here’s an example.  A document from the Ontario Superior Court of Justice was scanned into PDF, yielding a picture of a text file.  First, you download the file to your computer and then you go to a site like Free Online OCR, highlighted by MakeUseOf.  You identify the file you downloaded as the file to convert, select the output format (you can even put it BACK into PDF when finished) and click the convert button.

Here’s a quick screencast of how that works:

There are other free online OCR sites that you can retrieve with a quick Web search.  Some, like onlineocr.net use file limitations to throttle usage, so you may want to hunt around if you end up needing to OCR more than 15-20 documents per hour on a regular basis.  As I mention in the screencast, you will be uploading these files during the conversion process.  If you are not comfortable having the files hosted on a remote server and out of your control, you may want to look for OCR software to install.  But since much of what you find on the Internet during your research will be public knowledge, this shouldn’t impact your use of free OCR resources.

If you use Google Docs, there is built-in OCR.  Click on the Upload… button and you will be prompted to select files.  Select the option to convert text from PDF or image and select your files.  As Google uploads the files to your Google Docs accounts, it will perform OCR on the files.  It’s a great alternative to the other services, since your files end up in your file system as soon as the upload completes.

Share

Related Posts:

Online File Conversions Make Files Accessible

Posted in File Management | Tagged , , , ,

There has been some recent chatter about Web sites that convert files.  You may have been in the predicament of having a file in one format (say, WordPerfect 6.1) and being unable to open it in another program.  These online sites can help you by making the conversion for you.

The site choices are overwhelming so here are just a few that have been mentioned recently and seem to have good options for source and destination formats:

When you find a file in a format that either you cannot open or do not want to use as a permanent storage format (like Wordperfect or a video or image format), these online services can be great resources.  Some of them enable you to copy and paste the URL of the file to be converted, so that you do not have to first download the source file, then upload it, then download the converted file.  This is similar to a recent Google improvement, where you can load a Microsoft Word document from a Web site directly into a Google viewer, avoiding the download step.

As with any online service, be aware of what you are converting.  These services require you to upload the original, source document.  That document is stored on some server, somewhere.  The converted document may be mailed to you or a link to the converted document may be mailed to you.  If you are not comfortable with that content being available on a remote Web server (confidential, trade secret, whatever), you should probably either purchase a secure online conversion tool or purchase software that keeps the content on your machines.

[ Thx to Lee Rosen for tip on Zamzar ]

Share

Related Posts: