I have a service which is used by some clients of mine. I’m using Cloudflare in front of it for proxying.
Yesterday I created a firewall rule in Cloudflare for blocking all traffic from outside of my country. After creating this rule I observed many requests blocked by Cloudflare from seemingly a yandex bot(the reverse DNS of the IP adress is yandex . com). This bot sends GET requests to my server with the exact route of the requests sent by my frontend(VUE), including the search term.
For example I have a route in laravel with /api/city. If a client of mine searches for example New York, the front-end sends a GET request to the api endpoint(/api/city/newyork). After a few minutes this yandex bot tries to access the same route with the same search term that the user entered.
The api is protected by a login prompt.
Can someone please explain me how is this possible, that the yandex bot knows what my client entered? How could the yandex bot mirror every request?
This is part of Crawler Hints which you have enabled in the dashboard. Cloudflare submits URLs to search engines so they can better index your website.
If you don’t want search engines crawling your API endpoints, you can include those paths in your robots.txt.
Thanks for your response. The bots are sending requests with the exact value what the user wrote, is this expected behaviour too? They are not just trying to access the base API endpoint. They are sending requests every few minutes. The googlebot crawls my website every day, but not every minute, and is accessing only my public endpoints.
Yandex bot is pretty aggressive. If it receives a notification through IndexNow (which is what Cloudflare’s “crawler hints” uses) it’s likely to try to hit it almost immediately. Cloudflare’s documentation on Crawler Hints isn’t much but I think it fires when someone requests an URL for the first time, or when the response for an URL is different than it was before.
If your server has content that you don’t want indexed by search engines, make sure it’s sending appropriate X-Robots-Tag headers with responses. You can also use the classic robots.txt but I personally have stopped using it in favor of X-Robots-Tag, even though they do slightly different things. With the X-Robots-Tag header you can tell crawlers not to index the page and/or not to follow any links on the page. robots.txt can be used to prevent crawling, but doesn’t prevent indexing (search engines WILL index uncrawled pages in certain circumstances), and it’ll prevent your X-Robots-Tag headers from being seen, hence I don’t use robots.txt anymore