<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Crazy Mind &#187; aglets</title>
	<atom:link href="http://www.lunaticmarks.com/category/aglets/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lunaticmarks.com</link>
	<description>A Webmaster Blog</description>
	<lastBuildDate>Mon, 08 Oct 2007 10:00:15 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Aglets &#8211; A good idea for Web spidering.</title>
		<link>http://www.lunaticmarks.com/aglets-a-good-idea-for-web-spidering/</link>
		<comments>http://www.lunaticmarks.com/aglets-a-good-idea-for-web-spidering/#comments</comments>
		<pubDate>Sat, 24 Feb 2007 11:03:39 +0000</pubDate>
		<dc:creator>Ravi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[aglets]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[search engine spiders]]></category>
		<category><![CDATA[search technology]]></category>

		<guid isPermaLink="false">http://www.lunaticmarks.com/?p=98</guid>
		<description><![CDATA[About Aglets, Aglet software, Aglets and spiders, Aglet technology, web spidering, data pull technology, data push technology, search technology, web directories, search engines, keywords, search engine cache, indexer, search engine index. ]]></description>
			<content:encoded><![CDATA[<p>Many individuals and businesses now rely on the Web for finding information,and in particular, rely on centralised search databases such as <a href="http://www.lunaticmarks.com/?p=88">web directories</a> and search engines. </p>
<p>The extent to which these databases reflect the &#8220;contents&#8221; of the Web in an accurate and timely manner is now under considerable doubt, and in any event, it is apparent that the methods used by the search engines for finding new and modified Web documents are not scaling well. To ameliorate these problems, we have been exploring the use of a &#8220;<strong>data push</strong>&#8221; model for notifying Web changes, to replace the current &#8220;<strong>data pull</strong>&#8221; model, which uses aglets (aka servlets or peerlets) to distribute the indexing task.</p>
<p>This is a continuation of my earlier posts on <a href="http://www.lunaticmarks.com/?p=60">How do search engine robots work</a> and <a href="">Methods for highlighting search queries by Search Engines</a>.</p>
<p>To sum up from these earlier posts</p>
<p>Search engines which use current &#8220;<strong>data pull</strong>&#8221; model consist of discrete software components like the spider.</p>
<p><b>Spider</b> : a robotic browser like program that downloads webpages. It finds new work in two main ways. Firstly it repetitively follows hyperlinks in known documents to find unknown documents. In the initial stages of spidering, seed URLs are given to the spider and the hyperlink structure is used to populate the index. In later stages, when documents are added or modified, the hyperlinks within these documents can be followed in a similar fashion. Secondly, the spider tracks changes and deletions in documents already indexed, by requesting header information and checking document time-stamps. </p>
<p>Now Tracking changes and deletions through HTTP is less efficient, because it is based on a &#8220;<strong>data-pull</strong>&#8221; communication model and because each request is in terms of a single document. If there is new work for an indexer to do, the only way to find out through HTTP is to send a request for information on each document indexed. For large web indexes this means up to 50 million requests before all documents have been checked, only a very small fraction of which will result in new work. One way of addressing this inefficiency is to reduce the number of HTTP requests a spider must send to find new work. A notable proposal is for a strong>sitelist.txt </strong> standard, where it becomes each web server’s responsibility to provide a single text file listing all files and their time-stamps. However, in such a system the amount of communication is still great regardless of the amount of new work, if any.</p>
<p><strong>Data Pull vs Data Push</strong></p>
<p>Instead of 50 million largely useless requests or one large request per site, a small notification message is sent at an appropriate time, describing in a concise fashion all new work. Novel server push methods have already been proposed using, for example, a request for email notification whenever a stated document changes.</p>
<p>A &#8220;<strong>data-push</strong>&#8221; model for finding new indexing work requires a certain amount of computation and state storage on the web server end. This makes it an ideal application for <strong>aglet technology</strong>, because not only can this remote computation take place, but the techniques used are determined by the indexer, so new technologies or indexing priorities can be reflected in new versions of the <strong>aglet software</strong>. An aglet would be dispatched by the spider/indexer to a web server, acting as its agent, working on behalf of the spider to access and perhaps predigest local information changes at the server, then Page 3 sending them to the indexer.</p>
<p><strong>The search industry is at a state of play</strong></p>
<p>Due to the very large volume of documents available, large web search services are expensive to run, both in terms of the cost of spidering and the cost of index building and searching. Because of this, only a small number of larger web indexes exist, and these are all commercial services (Google, Ask, Yahoo, Live, AltaVista, Excite, HotBot, Lycos and InfoSeek).</p>
<p>Even these services are forced to compromise for efficiency, only partially covering each site and polling documents infrequently. In fact in some services it is possible for an index to be more than three months out of date with respect to changes in a particular document. If the cost of spidering is reduced, the overall cost of running a search service also decreases and the coverage of search services may increase. For this reason there is much industry enthusiasm for finding new ways of efficiently detecting change in remote documents. The solution suggested here, a data-push model employing <strong>aglets</strong>, would offer indexers not only very efficient change notification, but could be used in the transmission of the documents themselves, by compressing, pre-indexing or even sending only changed portions of documents.</p>
<p><strong>What is an Aglet?</strong></p>
<p>An aglet is a Java-based mobile software agent. The term software agent has been given many definitions; here we refer to a piece of software that can halt its execution on one host, transfer to another host, and then continue execution from where it left off on the remote host.</p>
<p>The <strong>aglet framework</strong> was created by a team lead by Danny B. Lange at the IBM Tokyo Research Laboratory in Japan.</p>
<p>Aglets are designed around an <strong>event-driven callback programming model</strong> that has similarities with the Java Applet programming model. An aglet can experience any of the following events in its life:</p>
<p><strong>Creation:</strong> An aglet is instantiated, and its main thread begins executing.</p>
<p><strong>Disposal:</strong> An aglet is destroyed, all information is lost.</p>
<p><strong>Cloning:</strong> The aglet is replicated, with current state but new identity.</p>
<p><strong>Dispatch:</strong> The aglet and state is sent to a remote host.</p>
<p><strong>Retract:</strong> A previously dispatched aglet is pulled back from a remote host.</p>
<p><strong>Deactivation:</strong> Aglet and its state are transfered to persistent storage.</p>
<p><strong>Activation:</strong>Aglet and its state are transfered from persistent storage.</p>
<p>Before any of these events occur, an aglet is notified of the upcoming event through a call to the appropriate callback method. For example, when an aglet is created, the OnCreation() method is invoked. A programmer can override this method with one that initialises the state of the newly formed <strong>aglet</strong></p>
<p><strong>How Aglets move between hosts</strong></p>
<p>Aglets are moved from host to host using the JDK’s Object Serialisation feature. The aglet object, all serialisable objects reachable from it, and the aglet’s heap are converted into a byte stream and sent across the network. The receiving host can then reconstruct the aglet and its heap. Java does not allow access to execution stacks of the virtual machine, and so not all state is preserved in the transition. However, state can be effectively restored if the aglet is programmed like a finite state machine. Before dispatch, the current ‘‘state’’ can be recorded in variables on the heap, so then when execution begins at the receiving node, the state variables on the heap can be consulted to determine what to do next.</p>
<p><strong>Aglets and Spiders</strong><br />
<strong>Aglets</strong> suggest themselves as an important part of the solution to the problem of the scalability of <strong>web spidering</strong> for a number of reasons:</p>
<ul>
<li>support of a &#8220;data push&#8221; model should decrease the network traffic and wall time required to locate changes in the web</li>
<li>the possibility exists for performance-related contracts to be agreed between data suppliers and indexers to ensure that document changes are reflected in indexes in a timely, or at least predicatable, manner (opportunities for contractual consideration exist on both sides &#8211; the data provider pays in local computational power, and the indexing site guarantees index presence)</li>
<li>the functionality of the aglet code is under the control of the indexing site, and so can be updated to reflect its requirements, and its core knowledge of the indexing task</li>
<li>the binding of an aglet into a local site requires the consent of that site, but in this application such consent is likely (unlike, say, aglets produced by individual persons and dispatched for personal business)</li>
<li>the spidering problem is fundamentally a graph-traversal problem &#8211; the web can be seen as a graph with many alternative arcs (paths) between nodes with different cost-related weights.</li>
<li>The ability of aglets to clone, to have an initial intinerary, and to react to local conditions, permits numerous alternative traversal strategies to be explored.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.lunaticmarks.com/aglets-a-good-idea-for-web-spidering/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
