A site was hacked and they uploaded a fake sitemap to Google. Now bots are trying to access these urls which are all 404 now. How do I block 1000 or so urls with no specific pattern. I obviously don’t want to block googlebot.
It’s hard to block or challenge without a pattern. But you can keep Googlebot and other legitimate indexers away with a proper robots.txt file (please check Google’s own recommendations for a robots.txt). You’ll probably get an error on your GSC saying something like “submitted URL was blocked by robots.txt”, but that should go away with time, as Google re-fetches your clean sitemap.
I’d also create a Firewall Rule to block anyone that is not a legit indexer from sitemaps and robot.txt files.
Yes, already implemented a robots.txt with all the urls I want to block. What do you mean by:
Probably something like this. Block anything that’s not a known bot from scraping robots.txt or sitemap.xml
Something like this, with a repeat for robots.txt (the secret user-agent will let you visit your sitemap when needed):
I’d use a secret query string instead of the user agent.
It looks like @yannis wants to block the destination URLs, not robots.txt or sitemaps.
If you know the URLs, the best option would be creating a firewall rule. Depending on the length of the URLs you might need more than one rule. If you have a business plan you could have a regex rule like "(http.request.uri.path matches “(?i)url1|url2|url3”)
You’d have to make the URLs unique enough to not block valid addresses i.e. make then as long (to be unique) or as short (to save length) as possible depending on the situation.
This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.