<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Crazy Mind &#187; spam</title>
	<atom:link href="http://www.lunaticmarks.com/category/spam/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.lunaticmarks.com</link>
	<description>A Webmaster Blog</description>
	<lastBuildDate>Mon, 08 Oct 2007 10:00:15 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Scrapers exploit the sitemap.xml and make easy money</title>
		<link>http://www.lunaticmarks.com/scrapers-exploit-the-sitemapxml-and-make-easy-money/</link>
		<comments>http://www.lunaticmarks.com/scrapers-exploit-the-sitemapxml-and-make-easy-money/#comments</comments>
		<pubDate>Mon, 07 May 2007 06:04:48 +0000</pubDate>
		<dc:creator>Ravi</dc:creator>
				<category><![CDATA[Internet security]]></category>
		<category><![CDATA[RSS]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[Scraper sites]]></category>
		<category><![CDATA[cloaking]]></category>
		<category><![CDATA[keywords]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[search engine spiders]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[website security]]></category>

		<guid isPermaLink="false">http://www.lunaticmarks.com/?p=118</guid>
		<description><![CDATA[ What Gets the scrapers to target your website in the first place?
1. If Your website is a very popular site in your niche and getting lots of traffic from search engines, it means that your website URLs are crawled very highly and this makes it easy for scrapers to steal the content and make [...]]]></description>
			<content:encoded><![CDATA[<p><strong> What Gets the scrapers to target your website in the first place?</strong></p>
<p>1. If Your website is a very popular site in your niche and getting lots of traffic from search engines, it means that your website URLs are crawled very highly and this makes it easy for scrapers to steal the content and make a <strong>MADE FOR ADSENSE(MFA)</strong> sites putting your content.</p>
<p>2. Some are marketing analytics for advertising companies to gather data about you and your company and sell it to advertisers for profit. The marketing strategies involve continuous observations on following factors</p>
<li>Charting Your Internet Mind Share and Buzz Index with sites like compete.com, quantcast.com or spyfu.com gives good info about your websites</li>
<li>Tracking On-Line Opinion and Issues</li>
<li>Listening In on Word of Mouth and </li>
<li>Customer Generated Media — Blogs,Consumer     Portals, Special Interest Sites, Political Cause Networks, On-Line News Services, and Archives.</li>
<p>In the recent times, Many people seem to post about sitemap.xml suffering a problem with content. In the sitemap you give a title, description and URL of the webpages in your website</p>
<blockquote><p>Is the new content title and meta tag scraped before the sitemap is submitted to google by sitemap generators? And the Answer is <strong>YES</strong></p></blockquote>
<p>The sitemap.xml file hands over a list of urls of website directly to any scraper who wants to make use of it for cloaking</p>
<blockquote><p><strong>Cloaking is primarily used to show an optimized page to the search engines and a different page to humans</strong></p></blockquote>
<p> Excessively scraped sites can struggle in the SERPs- This means that When someone mirrors your content it&#8217;s possible for your page/site to get hit with a <strong>duplicate content penalty.</strong></p>
<p><strong>Some Ideas to make it hard for Scrapers</strong> </p>
<li>Including sitemap reference in robots.txt should be abandoned and all sitemaps submitted via ping to all search engines that use them and random generated file each time a sitemap is created. </li>
<li>A seperate tool by search engines that allows you to generate an .xml sitemap and as these are only for search engine use I see no reason name of file could not be randomly generated and it could also delete previous sitemap file.</li>
<li>A safe sitemap generator benefit in many ways than a free sitemap generator which might send info to scraper sites without your knowledge. I would trust one from search engines.</li>
<p>But&#8230;.</p>
<p>Any time you give scrapers a clear path to avoid honey pots and spider traps they&#8217;ll use it. With that said, the scrapers can simply scrape a search engine first using site:mydomain.com to get the equivalent of a sitemap and avoid your spider traps anyway.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lunaticmarks.com/scrapers-exploit-the-sitemapxml-and-make-easy-money/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Keyword phrase based Indexing and retrieval</title>
		<link>http://www.lunaticmarks.com/keyword-phrase-based-indexing-and-retrieval/</link>
		<comments>http://www.lunaticmarks.com/keyword-phrase-based-indexing-and-retrieval/#comments</comments>
		<pubDate>Mon, 19 Feb 2007 04:31:16 +0000</pubDate>
		<dc:creator>Ravi</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[SEO]]></category>
		<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[cloaking]]></category>
		<category><![CDATA[keywords]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[search technology]]></category>
		<category><![CDATA[spam]]></category>

		<guid isPermaLink="false">http://www.lunaticmarks.com/?p=92</guid>
		<description><![CDATA[Google has some way of isolating certain keyword phrases that they deem typical of spam sites.
But the number of keyword phrases that can cause the penalty depends on other factors like
1. Indexing
2. Weighting
3.  Ranking
4. Duplicate content
5. Spam detection and weighting (yes &#8211; not all spam is bad?)
6. Back Links/Link Profiles
7. Personalized Search  &#8211; [...]]]></description>
			<content:encoded><![CDATA[<p>Google has some way of isolating certain keyword phrases that they deem typical of spam sites.</p>
<p>But the number of keyword phrases that can cause the penalty depends on other factors like</p>
<p>1. Indexing<br />
2. Weighting<br />
3.  Ranking<br />
4. Duplicate content<br />
5. Spam detection and weighting (yes &#8211; not all spam is bad?)<br />
6. Back Links/Link Profiles<br />
7. Personalized Search  &#8211; code can be added on the page by spammer to lead the user to a network to a spam(&#8221;<strong>keyword stuffing page</strong>&#8220;) websites.</p>
<p>So What does Spam filter of Search engines do?</p>
<p>Spam Assassin<br />
A Phrase Based Indexing and Retrieval (PaIR) system could set about identifying a web page (or document) as a spam page by comparing the actual number of related phrases present in the document with the expected number of related phrases. To high or to low an expected phrase rate or density, could flag a given document for further algorithmic inspection. Also expectations of plurals and singular occurrences can be valuated as part of the spam detection process. It is then added to a list of spam documents.</p>
<p><strong>The whole process</strong> takes place both at indexing and retrieval. In essence the document gets its spam score at indexation and then upon retrieval, should that page be included in the results, weighting is then removed and the page is devalued during the ranking process for previously calculated Spam threshold scoring/weighting.</p>
<p>According to the folks that drafted it, a normal related, topical keyword phrase occurrence (or related phrases) is in the order of 8-20 whereas the typical Spam document would contain between 100-1000 related phrases. So by looking for statistical deviations in related phrase occurrences the system can flag an item as Spam. Once again it is mostly for the high end, but a low deviation count can also be used as a flag for a low occurrences (which could be compared to the link profile for link spam) </p>
<p><strong>So how to identify the right keyword phrases from artificially created keyword phrases?</strong></p>
<p>The goal is to classifying each potential keyphrase as either “a good phrase or a bad phrase” depending on it’s usage and frequency; then using those ‘good’ phrases in predicting the usage of other ‘good phrases’ in the collection of web pages.</p>
<p>E.G: If a normal webpage contains about 8-20 repetitions of a popular keyphrase, the same keyphrase could be present about 100- 1000 times on a spam document.</p>
<p>It is also important which blocks of the webpage influence the search engine to detect it as spam document -</p>
<blockquote><p>For example, the phrase &#8220;President of the United States&#8221; is a phrase that predicts other phrases such as &#8220;George Bush&#8221; and &#8220;Bill Clinton.&#8221; However, other phrases are not predictive, such as &#8220;fell down the stairs&#8221; or &#8220;top of the morning,&#8221; &#8220;out of the blue,&#8221; since idioms and colloquisms like these tend to appear with many other different and unrelated phrases. Thus, the <strong>keyword phrase identification phase</strong> determines which phrases are good phrases and which are bad (i.e., lacking in predictive power).</p></blockquote>
<p>But I suspect that phrase density can be very high, maybe 90% but still be deemed relevant. For example, a recruitment page about &#8216;Telesales jobs in London&#8217; may have the phrase 50 times as it defines each job advertised and this could be a very good page. If the phrase density is very high then google will look for the associated phrases in pages clustered with that page. If the clustered page has the phrase in a low density, e.g. on a specific job description page, then the associated phrases should be on that page. If they are, then the page with a high density which is clustered with it, is still deemed to be OK.</p>
<p>So I believe it is applied on a cluster basis. It&#8217;s all about website navigation and offering the user the <strong>best entry point</strong> into a website, rather than automatically offering the highest scoring page which may be out of date.<br />
 I think it can be good method  to  identify &#8220;good phrases&#8221; in pages&#8217;s clusters to detect spam because google engineers can detect a bunch of spam related terms.</p>
<p>For those of us not interested in spam prevention but in rankings, do you think this model can be used with a few billion of pages with a zillion of terms in a few dozens of languages?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lunaticmarks.com/keyword-phrase-based-indexing-and-retrieval/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
