Challenges

Sitemaps & Robots.txt

Overview

Whether you like it or not, a lot of bots crawl your web app every day. Here are a few of those crawlers:

  • Search Engines like Google
  • Open Graph Metadata fetchers like Facebook, Twitter
  • Web scraping tools and services
  • Malicious users with different agendas

There is very little we can do to prevent these crawlers. But we can hint & control some of the reputed crawlers from Google, Facebook, Twitter, etc.

Here's how we can do that:

With Sitemaps

A sitemap is a list of pages in your app and hints at how search engines should crawl these pages. Here are some things you can do with sitemaps:

  • Notify new pages
  • Ask not of crawl in the future
  • Ask how often to crawl these pages
  • Give a priority for some pages over others

The specification very simple, have a look at here.

Here'a an example sitemap:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>

</urlset>

You can submit sitemaps directly to search engines or notify it via the robots.txt file.

With the "Robots.txt" file

First, let's have a look at a sample /robots.txt file:

User-agent: *
Disallow: /api
Sitemap: https://yourapp.com/sitemap.xml

As you can see, here we are asking it to stop crawl any of our API routes. We also mentioned a link to our sitemap.

Here's a resource to learn more about robots.txt.

Q: What's a reason to not to use robots.txt or sitemaps?
authenticating...

šŸ™ We need your help

We need your help to keep maintain & add new content to this course. Here's how you can support us: