Ever since going online in 1995, NameBase has been like honey for automated crawlers. Some are spooky and anonymous, some are unthinking students in dorm rooms who turn their infinite bandwidth in our direction with desktop tools, and some are actual search engines, but from places like Nigeria and China. Before you try to crawl NameBase . . .
![]()
We assume that this is because our site has information on over 129,000 proper names. These are names of prominent people, corporations, and groups from all over the world. Almost all of these names are from books and clippings that are not available elsewhere on the Internet. It took us 20 years to compile this information.
This NameBase site offers one static HTML file for each name. However, the cross-linking to other relevant names is fairly extensive in many of these files. For some unsophisticated crawlers, this means that they spend a lot of bandwidth chasing their tails.
Because of our unique situation, and because we're trying to provide a nonprofit service on very little income, we only tolerate search engine crawlers who return the favor by providing us with traffic from real eyeballs. That means Yahoo, Microsoft, Gigablast, and sometimes Google. In our experience, other crawlers do not help us serve the public. (We recommend Clusty for searching, but they don't do their own crawling.)
For crawlers that we don't find useful, we have developed some blocking techniques:
1) Our cgi-bin directory is disallowed in our robots.txt, and we are especially intolerant of bots in cgi-bin. Spider traps are in place. For bots that evade these traps, a fairly low threshold for rate-of-access is used.2) All our data exists not only through cgi-bin search programs, but also in static files. Here, too, we have spider traps planted. Failing this, the rate-of-access cutoff is more tolerant for static files. Your crawler may even fetch a few hundred of them before getting blocked.
3) The first level of blocking is a 403 Forbidden for the entire site. This will most likely get lifted after a few days. If the attempted crawling persists even after hundreds of 403 responses, the next level is a block in our routing table. From your crawler's perspective, our IP address will not respond at all. This block lasts until we have occasion to reboot our server, which is usually several months later.
If you need to crawl our information and don't want to get blocked, you can sign up for a multiple-machine registration. This costs $200 for two years, which will exempt your IP addresses from our blocks. If your crawler isn't smart enough to avoid recrawling files that it has already fetched, then you must use our CSV dump to fetch the files.
Without these blocking techniques, we would end up with almost all of our bandwidth getting used by crawlers rather than by real eyeballs. Many of these crawlers have no use for our data anyway. It was a simple choice -- either block most crawlers or take NameBase off of the Internet. We hope you understand.