A Webmaster Blog
Short Overview of the resources you need for a search engine
Crawler: The robots which get the web pages off that confusing web structure onto your beautiful disks. You will need lots of disks.
In most architecture, you need to merge these indices so that you have one place to find all the pages mentioning content of a particular keyphrase. When you merge all these small indices, the final architecture will be so huge that it cannot be fit in one system. This means you will have to merge these small indices in a way to split the final Index across many machines.
Now you are ready to serve up for the queries? Wrong. Now you build the runtime machine that gets the users’ key phrases their results from the right machines and re-ranks them according to the query. This process continues, while people are drumming their fingers on their desks waiting, lots of people and, hopefully, not enough time for drumming.
People talk a lot about thousands of machines needed to build a search engine. This sounds very scary. All search engines, however, started with a lot more thought and design than they did machines. So let us see what is the fact and fallacy.
Bandwidth: Legend has it that venture capitalists used to buy hard disks for young entrepreneurs to prove that their ideas would work. Now disks are cheap—but the new bottleneck is bandwidth. Usually that takes capital. You need this bandwidth to get the pages from the Web in the first place. The “CPU-ness” or memory of the machines that you use doesn’t really matter. All that a matter is how much bandwidth you have (can afford) and can use because crawling is not a CPU endeavor—crawling is a bandwidth monster.
There are lots of ways around these issues, but the most useful is to realize that you won’t get the indexer and the servers working right for 6 months, anyway, so crawl slowly and index what you have as you go along. Bugs will show up in the later phases, so the lack of pages won’t be the thing holding up; instead it will be those nasty bugs slowing you down. So crawl continuously at whatever rate you can afford and the rest will take care of itself. By the time you have a search engine that works on the pages you have and can keep up with the super-slow crawl, perhaps you will be in a position to afford bandwidth by raising capital.
Warning if you are a super-small company; and you are a small team then get the bandwidth to your office so you can maintain the expenses.
CPU Issues: Most people ask which types of CPUs to use for which phase of a search engine? And many people argue to use stupid CPUs for crawling and fast CPUs for indexing.
Any old CPU will do for crawling. For Indexing, you are doing a lot of I/O and lot of thinking for thinking/analyzing the web pages, so the bigger the better. At serve time, you need to re-rank the URLs in response to a query, so again, the bigger the better.
Since you are writing the search engine yourself, however, it has to be one size that fits all. Most indexing algorithms will probably trouble any CPU. So again, the bigger the better, get what you can afford, the bugs you write will definitely slow you down more than those cheap CPUs. If you have to look around the local machines available in the market, more cache will be the key for indexing algorithms because more of the page will be on cache before it is indexed.
But if your indexing algorithm doesn’t trouble, then rethink the game plan of building a better search engine because yours will not be the one that wins.
January 8th, 2007 at 9:50 pm
[...] Last time I have explained you Why Building a search engine is hard?. [...]