Sitemaps & Robots.txt
Whether you like it or not, a lot of bots crawl your web app every day. Here are a few of those crawlers:
- Search Engines like Google
- Open Graph Metadata fetchers like Facebook, Twitter
- Web scraping tools and services
- Malicious users with different agendas
There is very little we can do to prevent these crawlers. But we can hint & control some of the reputed crawlers from Google, Facebook, Twitter, etc.
Here's how we can do that:
A sitemap is a list of pages in your app and hints at how search engines should crawl these pages. Here are some things you can do with sitemaps:
- Notify new pages
- Ask not of crawl in the future
- Ask how often to crawl these pages
- Give a priority for some pages over others
The specification very simple, have a look at here.
Here'a an example sitemap:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>
You can submit sitemaps directly to search engines or notify it via the robots.txt file.
With the "Robots.txt" file
First, let's have a look at a sample /robots.txt file:
User-agent: * Disallow: /api Sitemap: https://yourapp.com/sitemap.xml
As you can see, here we are asking it to stop crawl any of our API routes. We also mentioned a link to our sitemap.
Here's a resource to learn more about robots.txt.