Service and Crawler Request Identification Rule - SiteDistrict

Service and Crawler Request Identification Rule

External services, crawlers, bots, and spiders accessing sites on SiteDistrict must follow this rule

At SITE DISTRICT, we have a WordPress firewall that protects sites against malicious and other undesired traffic and requests.

One of the main reasons otherwise legitimate services are blocked by our firewall is due to Improper Identification.

Overview

Simply put, HTTP requests made to SiteDistrict from services and tools must identify themselves properly, via the User-Agent HTTP request header.

If you are developer or provide support for a service or tool, you were probably sent to this page because you're not following the rules here, and one of our customers is having issues, is probably frustrated, and hoping you can fix it for them.

We often refer our customers to our Our Service is Being Blocked page first, and if the issues appears to be that of Improper Identification, we will recommend they share this page.

Blocked Requests

Verifying a Block

To confirm that your tool or service is being blocked by our firewall, the easiest thing to check for is this

  • HTTP 403 Status CodeOur firewall returns an HTTP status code of 403. The title of the page returned is "Access Denied."

Assuming you are logging your outbound requests, you can probably find this in your logs. You may also try repeating the steps taken by our customer to use your service.

Identification Rule

Services and tools making request to sites hosted on SITEDISTRICT generally must identify themselves properly via the User-Agent header.

We cover what exactly that means in technical terms, a bit farther down.

Reasoning

The main reasons for this rule are:

  • Easy for EveryoneThe User-Agent HTTP request header remains, at least for now, the most common, universal, and easiest way to identify where a request is coming from. A properly and correctly formatted User-Agent string quickly lets anyone - even those who aren't very technical - make a reasonable conclusion that the request probably comes from the service named in the header.
  • Analytics & LogsAt SITEDISTRICT we have what is probably some of the best user-facing analytics and log viewing dashboards available on the market, at least within the WordPress hosting market. These tools let our customers and our support staff quickly and easily understand the traffic to their sites. It allows customers and us to make insights and diagnose issues far more quickly than is possible with other platforms, where issues and traffic patterns can remain undetected, or review can be extremely technical and time consuming. Ensuring that requests are properly mapped to the correct location, network, browser or crawler, etc. is essential in ensuring that our tools remain both accurate and useful.
  • Security & PerformanceOne of the main reasons that the SiteDistrict firewall is strict about having services identify themselves is because a significant amount of "bad" traffic to sites that we host fails the User-Agent identification criteria. It has proven to be an extremely reliable and effective signal for classifying traffic and blocking attacks. Because WordPress sites are often slow and scale poorly when handling a significant amount of traffic that can not be served from a page cache, blocking this "bad" traffic is essential for maintaining uptime and performance across sites.
  • Simplicity & Universal BenefitBy identifying your service properly via the User-Agent string, you communicate simply and easily who is sending the request. Rather than having (hundreds or thousands of) customers and hosting services add custom code, rules, or IP whitelists to identify or prevent blocking of your requests, you can simply update your one service to send a proper User-Agent string, and everyone wins.

A regular person should be able to look at the User-Agent string, and within about 30 seconds - using Google if necessary - find your company (and probably your web page) on the Internet.

User-Agent Issues

The most common issues with the User-Agent header include:

  • Blank User-AgentThis header should be present and should not be blank.
  • Invalid User-AgentThis should not be something like just Mozilla/5.0, or a random string.
  • Browser ImpersonationThe User-Agent string sent should not be one sent by a browser, unless it actually is that browser sending the request.
  • Bad Browser-Tool HybridSimilar to the previous issue, the User-Agent string should not be copied from a common or outdated browser, and then just have the name of your tool inserted or tacked onto the end.
  • Fake Search Engine BotTools & services should not "impersonate" Google's Googlebot, Microsoft's bingbot, etc.
  • Generic HTTP Lib / ToolRequests should not use the default User-Agent string from common HTTP libraries. For example, reqeusts with a User-Agent string like Go-http-client/1.1, python-requests/2.31.0, Java/1.8.0_265, curl/7.68.0, axios/1.6.8, node-fetch, etc. are not OK.

Fixing the User-Agent

See this page from Mozilla for more information regarding the User-Agent string.

Technologies

As documented on the MDN page linked above, a proper User-Agent string typically includes several elements that look like this:

<product>/<product-version> <comment>

Browser User-Agent strings

For a browser, such as Microsoft Edge, you might see a User-Agent string like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59

The different technologies and versions include:

  • Mozilla : 5.0
  • AppleWebKit : 537.36
  • Chrome : 91.0.4472.124
  • Safari : 537.36
  • Edg : 91.0.864.59

While not always strictly followed by providers, you should not include technologies in your User-Agent string unless your requests are based on that tool. Browsers can get away with that to a degree, but services & most bots should not.

Do NOT use a browser User-Agent string for your service, crawler, or tool. Instead, see the next section.

Service / Tool Example

For a service, tool, or crawler, a format similar to a browser might be used, but with additional technology tags to identify the service, and often, a URL is included in a comment.

Most services however use a simpler format, such as this one from Stripe:

Stripe/1.0 (+https://stripe.com/docs/webhooks)

Here we see the <product> is specified as Stripe and the <product-version> is specified as 1.0 In addition, the comment contains just a URL to a page that provies more information on what requests are being sent.

Constructing the User-Agent

Typically, the best and safest thing to do is to send something like this for your User-Agent:

MyService/1.0 (+https://example.com/my-service)

where MyService is the name of your tool or service, 1.0 represents the version of the service or tool, and https://example.com/my-service is a URL of a page that describes your tool or service and what kind of requests it makes.

Bot / Crawler info page / URL

The <comment> part of the User-Agent should ideally be present, and contain a URL to a page on your website that explains where the requests are coming from, and why they are being made.

If you do not have a page describing your service, a link to the main page of your website is a decent alternative.

Some examples of bot pages include the ones for Applebot, Googlebot, bingbot, and Slackbot.

The SITEDISTRICT bot information page, for our own crawlers and bots, can be found here.

Not good enough

As outlined, setting the User-Agent HTTP request header is considered the best way to identify your tool, crawler, bot, or service.

What is NOT considered good enough?

There are other ways of identifying your requests, or indeed, verifying they are authentically from your service, and are indeed often helpful, but are not sufficient to meet our requirements for proper identification.

Some examples, and why:

  • Reverse DNS / PTR record / hostnameSetting up a DNS PTR record containing your domain name, for reverse DNS lookups of your IP address(es), is great, but not good enough. It requires and extra DNS lookup, lacks the details specific in the User-Agent, and isn't readily accessible to all who may view the request or logs.
  • Custom HTTP request headerSending a custom HTTP request header is also not good enough. Hosts should not, and do not want to keep track of hundreds or possibly thousands of different custom HTTP request headers, and figure out how to translate those to a service name and URL.
  • Recognizable URLMany services make requests to a specific URL. Sometimes, this URL can be used to "guess" where a request originates from. However, what the URL actually represents is the destination of the request, not the source. Don't expect others to rely on this for identification.
  • Autonomous System (AS)Sending requests from your own AS is definitely a way to identify the source network for requests. However, it lacks the information that can be present in a User-Agent string. Also, if actual humans use browsers on the same AS, or others may relay or send traffic from the same AS network, the AS becomes even less useful for identification.

False Positives

If you believe your tool or service is properly identifying itself, according to the criteria on this page, please Contact Us, and we will review.

Please provide relevant details about your service and requests, such as the User-Agent sent, source IP addresses, approximate times when the requests were sent, and any other relevant information.

Note: Providing IP addresses is strictly for purposes of investigation, NOT for whitelisting. See the next sections about this practice.

User Configurable Settings

If your system allows your customers and users to change or customize the User-Agent themselves, and doing so will allow them to set the User-Agent to a value that meets the above criteria, please provide them with the ncessary instructions, or a link to your documentation that covers this.

For services that support this, this is often the easiest way to resolve this issue.

IP Lists / Fake User-Agents

Requests can sometimes be matched up to specific IP addresses, or CIDR blocks. But this doesn't mean they are suitable for identification purposes. These sections cover these topics.

IP Whitelisting

It's very common for services to provide a list of IP address from which traffic originates, and request a hosting provider to "whitelist these IP addresses".

Do NOT do this.

The problems with IP whitelisting include:

  • Unnecessary WorkStop making your customers do extra work. If they host with a platform that knows how to do web security well, then IP whitelisting should typically not be necessary, if you identify your requests properly.
  • Rented IP AddressesMany IP addresses are rented from cloud providers, such as Google Cloud, Amazon Web Services, Digital Ocean, Hetzner, and more, to name just a few. While you may think you will keep an IP address for a long time, it still could change in the future.
  • Not Your Place To "whitelist" something is jumping past the problem definition to a solution. The solution depends on a complete understanding of the problem, which includes a full understanding of how our firewall works. Therefore, it's our job. Instead, you should provide us with all the details necessary for us to diagnose & understand the problem on our end, so we can determine the best solution.

Bot / Service Info Pages

There is a place where you should provide the IP addresses used by your service, if relevant. It's on the page viewed from the URL that you include in the User-Agent header for requests sent by your service.

By providing details about the User-Agent, IP addresses, network names / AS numbers, and reverse DNS records on this page, you provide sufficient information to hosts and others that handle your requests, so they may identify them reliably, and figure out on their own the best way to ensure your traffic is legitimate and not blocked.

Fake User-Agents

While it is true that the User-Agent can be easily faked, this is not a justification for not using it to properly identify your service.

In practice, while fake User-Agent strings that pretend to be from a browser are common, pretending to be from a service is much less common.

Exceptions include requests claiming to be from Googlebot, bingbot, Facebook, and other large companies. Requests pretending to be from these services can indeed be significant.

In such cases, we will often block traffic if the source IP does not match the list of official IP addresses provided by the service, or the domain from the reverse DNS lookup does not match.

In such cases, IP lists are indeed used - but to block traffic, not to allow it.

This is not something that you as a service provider need to worry about. We do this are seldom as possible, as it introduces additional complexity & maintenance costs. It is also the main reason we avoid whitelisting IP addresses.

Additional Topics

Bad Excuses

Some of the most common, yet "bad" excuses we hear:

  • Other customers don't have a problemAt SiteDistrict, we pride ourselves on performance, security, and reliability. Most sites on SiteDistrict do not run any WordPress security or firewall plugins, nor are they recommended. While your service might work with a different hosting provider, sites on that host may also be far more vulnerable to attacks, spam, or performance degradation. In fact, if you have had unexplained performance issues at other hosts, you might consider that some of those could have due to unmitigated attacks.
  • It will take timeAt SiteDistrict, we often respond to support requests within minutes, or at least within a few hours. Issues are often resolved within 24-48 hours. Our team is small, and quite busy, but we still prioritize resolving issues for customers. We engineer solutions so that we don't have to handle the same support issues multiple times. If we can do it, why can't you? Regardless of the answer, it's going to make your service or support look bad if you don't respond and address the issue in a timely, competent fashion. See the section below on Customer Guidance, for what we tell customers if they are struggling with your support.

Doing the Right Thing

When it comes to tools and services not behaving properly, we are also reminded of this children's book: What if everyone did that?

Fixing "bad" behavior and following certain best practices makes the Internet a better place for everyone. Other servers that process requests from these tools & services will often be able to more easily identify & debug the requests as well.

There are also literally thousands of other services that already are "doing this right". Get with the program. Be a better Internet citizen.

Scalability

On a similar theme, one reason that we avoid adding exceptions for different services is because it is not scalable.

The rules employed by our firewall are carefully crafted to be simple and efficient. Adding exceptions makes it harder to maintain, and can reduce security while increasing the risk of other issues.

What might seem like a simple request to add an exception may have implications and repercussions that you are not aware of, some of which can cause additional unanticipated issues for you and other SiteDistrict customers.

Customer Guidance

If a customer reaches out to our support, and we believe that their issue is due to requests from a service or tool being blocked, we will typically send them to our Our Service or Tool is Being Blocked page first.

That page may bring them here, or if we can already tell that the reason seems to be a service is not identifying itself properly, we may provide them a link to this page right away.

Our Advice

If they contact support for one of these services, and they are unable to get the issue resolved, we actively encourage our customers to consider one or more of the following:

  1. Escalate support tickets until they are talking with someone that "gets it"
  2. Search for alternative services or competitors
  3. Cancel service or request a refund
  4. Consider if they really need the service, stop using it if not

Support Difficulties

We also remind customers that if the service provider is not responding or resolving the issue in a timely manner, some possible reasons are:

  1. They don't actually value you that much as a customer
  2. They don't care that most of the rest of the Internet does this correctly, but they do not
  3. Their code is fragile or their engineering team is not terribly competent
  4. They don't think it's a priority. See 1.
  5. They think SiteDistrict is actually wrong with our requiremet that requests should be identifiable

Refusal to Fix

If your service or tool does not meet our criteria for identifying itself properly via the User-Agent string, and you have decided not to change it, then please email our mutual customer with the following email:

I'm sorry, but we cannot update our service to comply with your hosting provider's requirements.

  1. We are currently sending the following User-Agent string(s) with our requests:
    • <list the values>
  2. We will not make the changes necessary because ______ .

Of course, if you think you are sending a proper User-Agent, but your requests are still being blocked, then please see our other page, Our Service or Tool is Being Blocked page, refer the customer back here, and/or Contact Us.

Last Words

We are constantly working to create a better, safer Internet for all. Our firewall is a critical part of our infrastructure and protects sites by blocking requests and attacks that would otherwise cause performance issues or downtime. Ensuring that external tools and services behave properly allows us to continue to provide efficient and high-quality support.

 

Want Secure and Reliable WordPress Hosting?

 

Built WithAround The
Copyright © 2016 - 2024 SiteDistrict, All Rights Reserved