Displaying FOIA Documents on the Web

Technical Tips and Recommendations

This is a summary of some technical "hints and kinks" that were discovered over several weeks, as we set up this new subpage to show several Freedom of Information Act (FOIA) documents that deserve a wider audience.

We began this project after noticing FOIA documents that are displayed by the National Security Archive. The two documents used here that are about Chile came from their site. We felt that there must be a better way to display FOIA material on the web, and bought a flatbed scanner for $129. It came with TextBridge Plus, an OCR (optical character recognition) program.

There are two problems with the way the Archive is displaying documents: 1) the documents are not searchable, because they haven't been converted to text; and 2) the files themselves are 256-grayscale JPEG files, for the most part, and average nearly 100 kilobytes per typescript page, with each page of the document requiring a separate 100 K image. It takes us up to a minute to download 100 K. With FOIA documents, sometimes this results in only a few lines of reading material.

This project was designed to make several documents more widely available, and by now also serves as a demonstration of how we think FOIA documents ought to be displayed on the web. It is not a criticism of the National Security Archive site (which is very impressive by any standard), but is simply meant to introduce some technical issues involved in displaying this sort of material.

The first problem is to apply OCR to the documents. The time to scan for OCR is when you have the original document in hand. Even then the document may be in such poor shape that you could end up keying it in word for word. The point here is that OCR requires a resolution of 200 to 400 dpi (dots per inch) to recognize characters. Once you size a document for a browser screen, you've reduced the resolution to something like 75 dpi. Each character from a typical typed page is now represented by perhaps 20 pixels. If one pixel is out of place (and many are), that character may already be unrecognizable. Regardless of how many colors or shades of gray you use, the resolution needed by OCR software has already been lost.

Using TextBridge Plus on a typical 256-grayscale JPEG from the Archive site, the software got about 20 percent of the words right. Unless your OCR is doing better than 90 percent, it's easier to key it in. Two computers, one displaying the document and the other with your favorite editor, can be used side by side.

The other three documents on this subpage came from reproductions in books. OCR was more successful on these, as they could be scanned at the recommended dpi for TextBridge Plus. There is still work involved in correcting scanner errors, spell-checking, and embedding HTML codes. There's no easy way to do it, but once it's done and published on the web, no one ever has to do it again for that document. It makes sense to put in this extra initial effort for important documents that have something to say about our government.

The overwhelming advantage of doing OCR on FOIA documents is that search engines can be used on the text. Whether you are thinking in terms of your own site search engine (such as PIR's SiteGrep), or in terms of web robots such as AltaVista, there's no way to do anything with raster-imaged documents until you convert them with OCR, or key them in word for word. The OCR version is necessary for searching, but it lacks the authenticity of the photograph-like image. The bottom line is that you need both for FOIA documents.

That brings us to the second problem, which is that the file size for raster images can get unwieldy for FOIA documents. Sometimes this is a problem of not choosing the right JPEG or GIF format, and other times it may be that full-color or 256-grayscale is desired to preserve legibility in the document. The option of using resolution that's higher than what can fit in the typical browser window (which is between 550 and 750 pixels wide) is not an option at all -- it means horizontal scrolling will be required, and each file will be incredibly huge because many more pixels are used.

But if you've already prepared the OCR version of the document, then legibility is no longer an issue. All you need is the most economical way to present a "photograph" of the document for authentication.

We downloaded over 200 of the 256-grayscale JPEG images from the Archive site, and selected the two documents (31 images) that we used. Once we discovered that OCR was impossible on these images, we keyed them in. Our next problem was to convert them to smaller image files. We experimented with about eight graphics editing programs, both 16-bit and 32-bit Windows programs collected over the last five years, as well as a couple of DOS programs. Each of them seemed to do only one thing well, and it was always something different than what the other programs could do. What we wanted was to generate a 2-color GIF file from the 256-grayscale JPEG file. The midpoint between black and white seemed best at around 180 (0 is black, and 255 is white).

On average, the resulting 2-color GIF file requires only 13 percent of the disk space of the original 256-grayscale JPEG file, while the legibility is at least 95 percent of the original. (You still want the JPEG displayed when you are keying in the document, because this extra five percent will make a difference in legibility in a tight spot.) GIF and JPEG are pretty much the only games in town for web graphics. On documents suitable for conversion to 2-color (also called "monochrome" or "black-and-white"), GIF compression is super-efficient. It's the download time that is important, and disk space is mentioned only because it is directly proportional to the download time required for that image. Even when disk space is plentiful, the user's time and your server's bandwidth are factors that should be considered.

If you specify a different size in the link for the image, the download time stays the same, even if the page seems to display at a different speed. This means that it is always most efficient to create any web image in the dimensions you plan to specify on the browser page. Another reason to do this is that you can't trust the browsers out there to resize your handiwork responsibly -- they seem to have a difficult enough time just getting the grays right. Older browsers on older machines will not even allow more than a few shades of gray, which is another reason to convert to a 2-color GIF.

When looking for software that creates GIF documents, be advised that the GIF software you want may require some extra surfing. Unisys currently tries to collect licensing fees for the LZW compression scheme used in GIF, the rights to which they began to assert after they merged with Sperry, which held the patent. Unisys even wants up-front money from little open-source developers. These days you have to look harder for GIF tools that use LZW to generate GIF files -- either the developer was rich enough to buy a license, or the programs are old enough so that they were around before Unisys got greedy, or the distributor is outside the U.S. and isn't impressed by puffery from Unisys lawyers. It's worth looking for the right tool that uses the LZW scheme. Otherwise, your GIF files may end up two or three times more bloated than they should be.

With the graphics editing tools we had, the conversion from JPEG to GIF using Windows and DOS programs was time-consuming. Fortunately, we were able to automate the process with a Linux script that in turn calls three public domain graphics-utility executables (djpeg, ppmchange, ppmtogif) that were installed from a Redhat CD. It takes about five seconds per file this way, and you can use wildcards to do a batch of JPEGs at the same time.

This was so edifying that we threw together several chunks of open source, wrote some code to glue them together, and compiled six programs that run under 32-bit DOS. (This is the DOS you get when you use a command-line window under Windows 95/98; it's popular for utilities programs if Linux isn't handy, because any 32-bit system solves the memory limitations of 16-bit DOS.) Three programs convert from JPEG to monochrome GIF, any GIF to monochrome GIF, and BMP to monochrome GIF. Another three convert to 6-grayscale GIFs (black, white, and four "browser-safe" grays), and are recommended for documents that are very poor quality. Wildcards can be used for batch processing, and you can set the threshold for the monochrome and grayscale conversion. If the input is in color, it gets converted to grayscale before the threshold is applied. You can download it now -- it's in a 358,324-byte zip file.

The only major shareware graphics package we eventually found that lets you do something equivalent -- i.e., set the threshold and save to a monochrome GIF, is Graphic Workshop Professional for Windows, version 2.0a. It has a preview feature that makes it easy to find the best threshold for a particular batch of documents. It's a 5-meg download, and it won't do batch processing, but it's a slick piece of shareware. Here's a tip if you plan to be doing conversions between BMP, GIF, and JPEG: first set the color depth on your monitor to 256 colors. Otherwise you can expect strange results, because GIF has a 256-color maximum. The Paint program that comes with Windows always saves in BMP format, but the color depth is selectable. Be sure to use the 256-color BMP.

The CIA has a site that shows some FOIA material. The documents they've selected are a real snooze, and perhaps even diversionary (look for "Operation Condor" and you'll get one document on the Mexican drug eradication program during the 1970s -- the "Operation Condor" that no one cares about anymore). But the technology used is similar to what's advocated here. There are 8-grayscale GIFs (twice as big as a black-and-white GIF, but still much slimmer than JPEG), and the documents have been put through OCR so that full-text searching is possible. The OCR text is not available, however, apart from the CIA's own search engine. This means that AltaVista need not bother with their site.

Perhaps the OCR is automated, and the 85 percent accuracy they get is good enough for full-text searching, but not good enough for display. In other words, it could be a simple case of being too lazy to proof and correct the OCR output, and insert HTML coding to match it to the image for each new page. This is the "elbow grease" (or if you prefer, "wetware") needed to present an OCR version alongside your GIF image.

The State Department began putting up Chile documents from various agencies in July, 1999. They use Acrobat Capture Server 2.0, which puts the document into Adobe's proprietary PDF format. It's an interesting system: the documents display in most of their graphic glory, but they also have hidden text behind the graphic that came from an OCR process. The person scanning the document is apparently shown a little graphic of each word the software can't understand, and he can key in a substitute over the garbled word or phrase. So a document that was in wretched shape to begin with, might end up with numerous different fonts, and clear text mixed in with original text.

At least it's searchable. The proprietary nature of Acrobat is the biggest problem -- it takes major fiddling to get your browsers to work well with Acrobat, and then you get locked into their packages, which seem bloated and delicate. The time that this package may save on the server end is ultimately time spent by the user at the other end. It's not a particularly good mixture for FOIA documents -- you get an OCR that's inadequate for display, and a graphic that's a patchwork of compromises, and is several times fatter than it needs to be.

Getting a PDF document into a standard graphics format is best done by copying it to the Windows clipboard. The free Acrobat reader allows either a text "Select" for the clipboard, or a graphics "Select," assuming that both were generated when the document was created. The former shows the OCR results, while the latter is captured by the clipboard as a bitmapped graphic. This graphic can be pasted into Paint, the program that comes with Windows, and Paint saves into a BMP file.

If you are having trouble saving a graphic you see on your screen, as a last resort you can hit the Print Screen key and the entire screen is sent to the clipboard as a bitmap. From there it can be pasted into Paint. But since you can't scroll down to get the entire document, you will have to paste it into Paint in two sections. You'll need some serious cropping, which is better done in some other graphics program, once you get Paint to save it as a BMP file. You should set your monitor to 256 colors before capturing it, and always save to a 256-color BMP. Otherwise you will get strange results with later conversions, or when using older shareware graphics packages. There is no way a document can use more than 256 colors (shades of gray), so switching the color depth on your monitor is the first thing you should do when working with documents.

There is a serious lack of awareness about technical issues when it comes to displaying documents on the web. Over half of the sites displaying FOIA documents are serving files that are several times larger than they need to be. The worst case we saw was a file 100 times larger. It was a cover page with five words on it, and it weighed in at 250,000 bytes.

There are many document management systems designed for big law firms and such, but we need an integrated system that is designed for FOIA, so that you don't have to learn a dozen different packages, and scout for software, just to display a document using as little bandwidth as possible.

Back to home page