The Crazy Mind

A Webmaster Blog

Recent Enteries

Categories

About

Scrapers exploit the sitemap.xml and make easy money

What Gets the scrapers to target your website in the first place?

1. If Your website is a very popular site in your niche and getting lots of traffic from search engines, it means that your website URLs are crawled very highly and this makes it easy for scrapers to steal the content and make a MADE FOR ADSENSE(MFA) sites putting your content.

2. Some are marketing analytics for advertising companies to gather data about you and your company and sell it to advertisers for profit. The marketing strategies involve continuous observations on following factors

  • Charting Your Internet Mind Share and Buzz Index with sites like compete.com, quantcast.com or spyfu.com gives good info about your websites
  • Tracking On-Line Opinion and Issues
  • Listening In on Word of Mouth and
  • Customer Generated Media — Blogs,Consumer Portals, Special Interest Sites, Political Cause Networks, On-Line News Services, and Archives.
  • In the recent times, Many people seem to post about sitemap.xml suffering a problem with content. In the sitemap you give a title, description and URL of the webpages in your website

    Is the new content title and meta tag scraped before the sitemap is submitted to google by sitemap generators? And the Answer is YES

    The sitemap.xml file hands over a list of urls of website directly to any scraper who wants to make use of it for cloaking

    Cloaking is primarily used to show an optimized page to the search engines and a different page to humans

    Excessively scraped sites can struggle in the SERPs- This means that When someone mirrors your content it’s possible for your page/site to get hit with a duplicate content penalty.

    Some Ideas to make it hard for Scrapers

  • Including sitemap reference in robots.txt should be abandoned and all sitemaps submitted via ping to all search engines that use them and random generated file each time a sitemap is created.
  • A seperate tool by search engines that allows you to generate an .xml sitemap and as these are only for search engine use I see no reason name of file could not be randomly generated and it could also delete previous sitemap file.
  • A safe sitemap generator benefit in many ways than a free sitemap generator which might send info to scraper sites without your knowledge. I would trust one from search engines.
  • But….

    Any time you give scrapers a clear path to avoid honey pots and spider traps they’ll use it. With that said, the scrapers can simply scrape a search engine first using site:mydomain.com to get the equivalent of a sitemap and avoid your spider traps anyway.

    One Response to “Scrapers exploit the sitemap.xml and make easy money”

    1. Amitabh Says:

      include some proof like google cache and web Archive info to support your claim. If you have your site copyrighted under any country you may include the certificate as well.

    Leave a Reply