We want to use Cloudflare to show 404 to googlebot, when the URL contains the pattern: “_bd_prev_page=” and user agent contains “google” → It should return a 404 status.
Use a Redirect Rule where User-agent contains or wildcard google to redirect (301) to some unexisting URI path. Despite Google would report 301 then
Use Cloudflare Worker to fetch the user-agent and respond with HTTP 404 (hopefully you don’t have a lot of crawled and indexed URLs, otherwise you can expect daily few thousands of requests with that particular URI Query string)
Use Snippets (Pro plan required)
Use combination, Pages to deploy custom 404 page then redirect requests based on user-agent contains google to that 404 (sub)domain
Use WAF rules and block googlebot from accessing URLs with particular query string
addEventListener('fetch', event => {
event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
// Get the User-Agent from the request headers
const userAgent = request.headers.get('User-Agent') || '';
// Get the URL and check for the query parameter '_bd_prev_page='
const url = new URL(request.url);
const hasBdPrevPage = url.searchParams.has('_bd_prev_page');
// Check if both conditions are met: User-Agent contains 'Googlebot' and the query parameter is present
if (userAgent.includes('Googlebot') && hasBdPrevPage) {
return new Response('Not found', {
status: 404,
statusText: 'Not Found',
});
}
// For all other requests, continue with the original request
return fetch(request);
}
Otherwise, if it is possible, bound the worker to example.com/?_bd_prev_page=* and use below code without query check with Worker:
addEventListener('fetch', event => {
event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
// Get the User-Agent from the request headers
const userAgent = request.headers.get('User-Agent') || '';
// Check if the User-Agent contains 'Googlebot'
if (userAgent.includes('Googlebot')) {
return new Response('Not found', {
status: 404,
statusText: 'Not Found',
})
}
// For all other requests, return the original request
return fetch(request)
}
You can create a new Pages project and upload/deploy some custom 404 page for it, then use Redirect rule if user-agent contains (or wildcard nowadays, since it supports lowercase) google to it.
Otherwise, a better way would be to block googlebot and/or anyone else from accessing these URLs which contain such query string _bd_prev_page.
Furthermore, since Workers might be costly for this maneuver as Googlebot crawling & indexing frequency might be too much, using Snippets it might also be a case to achieve, but you’d have to use Pro plan for such case
(http.request.full_uri wildcard "*_bd_prev_page=*" and http.user_agent wildcard "*Google*")
Keep in mind that _bd_prev_page= might trigger WAF signature for SQL injections, so if you see 403s with a message similar to This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data. when performing tests like curl -sv "https://{your_domain}/test?test&_bd_prev_page=1" -H 'User-Agent: Googlebot/2.1 (+http://www.google.com/bot.html)', you may need to skip WAF for these requests in order for your snippet to work.