Professional librarians call it the "name authority problem." Assume that you could do perfect OCR scans of book indexes. After a few thousand names are collected from a couple dozen books, you might end up with JOHN SMITH, JACK SMITH, JOHN A. SMITH, JOHN ARTHUR SMITH, J.A. SMITH, JOHN SMITH, JR., J. ARTHUR SMITH. Which of these refer to the same person, and which are namesakes? To answer this question, all the names must be considered in context. Sometimes, various editions of Who's Who are needed for further research, or telephone listings on CD-ROM might be used to determine correct spelling.
If several variations of a name refer to the same person, you must standardize on one format. Then there's the flip side of the problem: most individuals referred to in print have namesakes that are in print somewhere else, and will show up eventually in your index. Now extra information must be added to the name to distinguish these individuals. This might be the year of birth, or a group or company affiliation, or anything else that's available from the context.
Software cannot handle these problems. The bigger the database, the bigger the problem. Today approximately one out of every three new names added to NameBase has to be resolved -- either because it is a namesake, or because it might be the same person as a slightly different entry.
The best that software can do is to come up with candidates for matching, by using phonetic or leading-letter searches. After that, further research is required. Low-wage, off-shore data entry is used by some corporations these days, but in this case it wouldn't work because the entire NameBase library must be available to look up old references and research new ones.
In fact, the indexer really should have read all 800 books in NameBase, since many names can be resolved by memory if the material is already familiar. It's cumbersome enough that the index in back of every book is completely useless to us, and that we have to key in the names and page numbers as we read it. But for an indexer who hasn't read all of the books, it is much worse.
If you don't obsessively resolve name authority problems up front, and get into the habit of checking anything produced by the software that appears suspicious, the data will be a hopeless jumble of errors after a few thousand names.
The index in the back of the book is covered by the copyright on the book. Even if it was technically feasible to scan the index, it would most likely be considered illegal.
At the same time, a cumulative name index of hundreds of books isn't particularly helpful unless the original context is available. In other words, there must be a way for the NameBase user to find out what was said about the name. Much of the material in NameBase is obscure, and some is impossible to locate apart from our photocopy and fax service. In order to make the referenced pages available legally, the "fair use" provisions of the copyright law come into play. Two considerations are important when considering whether "fair use" is applicable: the amount of material that is reproduced, and whether it is produced by a nonprofit organization.
Because of the nature of the cumulative indexing in NameBase, it is rare that more than one or two pages from a single book are ever ordered by a single NameBase customer at one time. This solves the first problem. The second problem is solved because PIR is a nonprofit public charity. The copying we do is the legal equivalent of using a photocopying machine in a public library.
The fact that books are copyrighted means that there are now two excellent reasons why NameBase has no competition: 1) it's much too labor-intensive, and 2) it would be illegal if the purpose was to make money, as opposed to educating the public. Both reasons make the PIR enterprise, from a commercial perspective, a complete non-starter. That's why NameBase has no competition.
This is unfortunate. For one thing, it means that investigative books are disappearing. Typically, a journalist might spend three years on a topic, and then his book -- assuming he can find a publisher -- is remaindered after several months. When that happens, the historical impact of his contribution is diminished because it was never digitized. This is mostly due to the nature of book publishing, which evolved over many decades, and won't adapt to a digital age unless there is obvious money to be made.
It's also due to our monoculture. If a publisher is inclined toward investigative, historical, or biographical material in the first place, he will probably go for books by or about celebrities because they sell better. Publishing is becoming extremely centralized, less diverse, and more market-oriented. On the other end of this monocultural equation, a new generation of researchers tends to follow up with library work only after online searches indicate that something might be found to make a library trip worthwhile.
Today's typical Internet portal shows how these trends feed each other. The focus is on instant access (if our search engine doesn't find it, it doesn't exist), and on celebrities. There's a sense of ahistoricism and unreality that one gets from the Internet these days, which is now about 85 percent commercial. Just seven years ago, before the dot-com gold rush and e-commerce hype, it was easier for NameBase to attract interest. And that was when the Internet had a fraction of its current users, and our site had half of its current content.
These days there are fewer incentives for investigative writing than there have been for the last 100 years. That's why we think non-celebrity researchers and journalists need all the help and exposure they can get. If you agree, you can support PIR by registering for NameBase, or by sending us a tax-deductible contribution.