At SITE DISTRICT, we have a WordPress firewall that protects sites against malicious and other undesired traffic and requests.
One of the main reasons otherwise legitimate services are blocked by our firewall is due to Improper Identification.
Simply put, HTTP requests made to SiteDistrict from services and tools must identify themselves properly, via the User-Agent
HTTP request header.
If you are developer or provide support for a service or tool, you were probably sent to this page because you're not following the rules here, and one of our customers is having issues, is probably frustrated, and hoping you can fix it for them.
We often refer our customers to our Our Service is Being Blocked page first, and if the issues appears to be that of Improper Identification, we will recommend they share this page.
To confirm that your tool or service is being blocked by our firewall, the easiest thing to check for is this
403
. The title of the page returned is "Access Denied."Assuming you are logging your outbound requests, you can probably find this in your logs. You may also try repeating the steps taken by our customer to use your service.
Services and tools making request to sites hosted on SITEDISTRICT generally must identify themselves properly via the User-Agent
header.
We cover what exactly that means in technical terms, a bit farther down.
The main reasons for this rule are:
User-Agent
HTTP request header remains, at least for now, the most common, universal, and easiest way to identify where a request is coming from. A properly and correctly formatted User-Agent
string quickly lets anyone - even those who aren't very technical - make a reasonable conclusion that the request probably comes from the service named in the header.User-Agent
identification criteria. It has proven to be an extremely reliable and effective signal for classifying traffic and blocking attacks. Because WordPress sites are often slow and scale poorly when handling a significant amount of traffic that can not be served from a page cache, blocking this "bad" traffic is essential for maintaining uptime and performance across sites.User-Agent
string, you communicate simply and easily who is sending the request. Rather than having (hundreds or thousands of) customers and hosting services add custom code, rules, or IP whitelists to identify or prevent blocking of your requests, you can simply update your one service to send a proper User-Agent
string, and everyone wins.A regular person should be able to look at the User-Agent
string, and within about 30 seconds - using Google if necessary - find your company (and probably your web page) on the Internet.
The most common issues with the User-Agent
header include:
User-Agent
This header should be present and should not be blank.User-Agent
This should not be something like just Mozilla/5.0
, or a random string.User-Agent
string sent should not be one sent by a browser, unless it actually is that browser sending the request.User-Agent
string should not be copied from a common or outdated browser, and then just have the name of your tool inserted or tacked onto the end.Googlebot
, Microsoft's bingbot
, etc.User-Agent
string from common HTTP libraries. For example, reqeusts with a User-Agent
string like Go-http-client/1.1
, python-requests/2.31.0
, Java/1.8.0_265
, curl/7.68.0
, axios/1.6.8
, node-fetch
, etc. are not OK.See this page from Mozilla for more information regarding the User-Agent
string.
As documented on the MDN page linked above, a proper User-Agent
string typically includes several elements that look like this:
<product>/<product-version> <comment>
For a browser, such as Microsoft Edge, you might see a User-Agent
string like this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59
The different technologies and versions include:
Mozilla
: 5.0
AppleWebKit
: 537.36
Chrome
: 91.0.4472.124
Safari
: 537.36
Edg
: 91.0.864.59
While not always strictly followed by providers, you should not include technologies in your User-Agent
string unless your requests are based on that tool. Browsers can get away with that to a degree, but services & most bots should not.
Do NOT use a browser User-Agent
string for your service, crawler, or tool. Instead, see the next section.
For a service, tool, or crawler, a format similar to a browser might be used, but with additional technology tags to identify the service, and often, a URL is included in a comment.
Most services however use a simpler format, such as this one from Stripe:
Stripe/1.0 (+https://stripe.com/docs/webhooks)
Here we see the <product>
is specified as Stripe
and the <product-version>
is specified as 1.0
In addition, the comment contains just a URL to a page that provies more information on what requests are being sent.
Typically, the best and safest thing to do is to send something like this for your User-Agent
:
MyService/1.0 (+https://example.com/my-service)
where MyService
is the name of your tool or service, 1.0
represents the version of the service or tool, and https://example.com/my-service
is a URL of a page that describes your tool or service and what kind of requests it makes.
The <comment>
part of the User-Agent
should ideally be present, and contain a URL to a page on your website that explains where the requests are coming from, and why they are being made.
If you do not have a page describing your service, a link to the main page of your website is a decent alternative.
Some examples of bot pages include the ones for Applebot, Googlebot, bingbot, and Slackbot.
The SITEDISTRICT bot information page, for our own crawlers and bots, can be found here.
As outlined, setting the User-Agent
HTTP request header is considered the best way to identify your tool, crawler, bot, or service.
What is NOT considered good enough?
There are other ways of identifying your requests, or indeed, verifying they are authentically from your service, and are indeed often helpful, but are not sufficient to meet our requirements for proper identification.
Some examples, and why:
PTR
record / hostnameSetting up a DNS PTR
record containing your domain name, for reverse DNS lookups of your IP address(es), is great, but not good enough. It requires and extra DNS lookup, lacks the details specific in the User-Agent
, and isn't readily accessible to all who may view the request or logs.User-Agent
string. Also, if actual humans use browsers on the same AS, or others may relay or send traffic from the same AS network, the AS becomes even less useful for identification.If you believe your tool or service is properly identifying itself, according to the criteria on this page, please Contact Us, and we will review.
Please provide relevant details about your service and requests, such as the User-Agent
sent, source IP addresses, approximate times when the requests were sent, and any other relevant information.
Note: Providing IP addresses is strictly for purposes of investigation, NOT for whitelisting. See the next sections about this practice.
If your system allows your customers and users to change or customize the User-Agent
themselves, and doing so will allow them to set the User-Agent
to a value that meets the above criteria, please provide them with the ncessary instructions, or a link to your documentation that covers this.
For services that support this, this is often the easiest way to resolve this issue.
Requests can sometimes be matched up to specific IP addresses, or CIDR blocks. But this doesn't mean they are suitable for identification purposes. These sections cover these topics.
It's very common for services to provide a list of IP address from which traffic originates, and request a hosting provider to "whitelist these IP addresses".
Do NOT do this.
The problems with IP whitelisting include:
There is a place where you should provide the IP addresses used by your service, if relevant. It's on the page viewed from the URL that you include in the User-Agent
header for requests sent by your service.
By providing details about the User-Agent
, IP addresses, network names / AS numbers, and reverse DNS records on this page, you provide sufficient information to hosts and others that handle your requests, so they may identify them reliably, and figure out on their own the best way to ensure your traffic is legitimate and not blocked.
While it is true that the User-Agent
can be easily faked, this is not a justification for not using it to properly identify your service.
In practice, while fake User-Agent
strings that pretend to be from a browser are common, pretending to be from a service is much less common.
Exceptions include requests claiming to be from Googlebot
, bingbot
, Facebook, and other large companies. Requests pretending to be from these services can indeed be significant.
In such cases, we will often block traffic if the source IP does not match the list of official IP addresses provided by the service, or the domain from the reverse DNS lookup does not match.
In such cases, IP lists are indeed used - but to block traffic, not to allow it.
This is not something that you as a service provider need to worry about. We do this are seldom as possible, as it introduces additional complexity & maintenance costs. It is also the main reason we avoid whitelisting IP addresses.
Some of the most common, yet "bad" excuses we hear:
When it comes to tools and services not behaving properly, we are also reminded of this children's book: What if everyone did that?
Fixing "bad" behavior and following certain best practices makes the Internet a better place for everyone. Other servers that process requests from these tools & services will often be able to more easily identify & debug the requests as well.
There are also literally thousands of other services that already are "doing this right". Get with the program. Be a better Internet citizen.
On a similar theme, one reason that we avoid adding exceptions for different services is because it is not scalable.
The rules employed by our firewall are carefully crafted to be simple and efficient. Adding exceptions makes it harder to maintain, and can reduce security while increasing the risk of other issues.
What might seem like a simple request to add an exception may have implications and repercussions that you are not aware of, some of which can cause additional unanticipated issues for you and other SiteDistrict customers.
If a customer reaches out to our support, and we believe that their issue is due to requests from a service or tool being blocked, we will typically send them to our Our Service or Tool is Being Blocked page first.
That page may bring them here, or if we can already tell that the reason seems to be a service is not identifying itself properly, we may provide them a link to this page right away.
If they contact support for one of these services, and they are unable to get the issue resolved, we actively encourage our customers to consider one or more of the following:
We also remind customers that if the service provider is not responding or resolving the issue in a timely manner, some possible reasons are:
If your service or tool does not meet our criteria for identifying itself properly via the User-Agent
string, and you have decided not to change it, then please email our mutual customer with the following email:
I'm sorry, but we cannot update our service to comply with your hosting provider's requirements.
- We are currently sending the following
User-Agent
string(s) with our requests:
- <list the values>
- We will not make the changes necessary because ______ .
Of course, if you think you are sending a proper User-Agent
, but your requests are still being blocked, then please see our other page, Our Service or Tool is Being Blocked page, refer the customer back here, and/or Contact Us.
We are constantly working to create a better, safer Internet for all. Our firewall is a critical part of our infrastructure and protects sites by blocking requests and attacks that would otherwise cause performance issues or downtime. Ensuring that external tools and services behave properly allows us to continue to provide efficient and high-quality support.