It can be due to a wide variety of reasons, and from our first-hand experience with customers complaining to us about things like this; and it can well be down to one of the reasons listed below.
Spiders and Bots.
For bots like Google and Bing, they comply with robots.txt exclusion rules; however, even though they’re friendly bots doesn’t mean they may not crawl your website too often which causes a cost on lost bandwidth. In the case of Google, you can contact them about it and they will artificially reduce the amount of times Google’s bots crawl your website in order to reduce the amount of bandwidth they’re causing you to lose as a result of crawling your website too often.
However, unfortunately, not all bots or search spiders from some less popular search providers will comply with robots.txt exclusion rules (if you have them in place), and specifically, reduce the amount of times they crawl your website when you contact the company about it. We’ve had complaints from customers about search spiders for the Baidu search engine crawling their websites too often causing a huge loss of bandwidth – and if you search up on this, it’s a pretty well-reported problem with many webmasters.
How do I find out whether it is due to a spider/bot that is causing a lot of my account’s bandwidth to be used?
You can check the statistics available in Awstats which gives real-time statistics regarding the traffic to your website – Awstats is available to both whUK cPanel Web Hosting and whUK Windows Web Hosting customers.
Awstats can provide specific information as to what identified and unidentified spiders and bots are visiting your website and how much data has been transferred as a result (bandwidth). You’ll also find the IPs of these bots and search spiders in Awstats, too; if you wish to block the specific IP or IP range.
We would only recommend you block an IP range if it will not affect you. For example, if you have a business website and you get a large portion of traffic which would otherwise be blocked if you blacklisted an IP range, then perhaps in this case blocking an IP range may affect you. But blocking a single IP alone may not be enough – but it is worth the try and see whether it improves bandwidth usage.
How can I stop spiders and bots from crawling my website too often?
Well, obviously in the case of Google and Bing search engines, contacting them about it may well resolve the issue – Google definitely does reduce the amount of times it will crawl your website if you contact them with a legitimate concern.
However, for search engines like Baidu, which is the most popular search engine in China (and is a Chinese search engine provider), you may need to consider blocking their IP range or the entire country if other measures before it do not work. However, it’s fair to say Baidu as many other search engines will likely have search bots from many different countries, but blocking IP ranges may definitely help elevate the issue.
While we cannot verify whether Baidu’s search spiders comply with robots.txt exclusions or not, if you search about it, it becomes questionable as to whether Baidu does actually comply with the voluntary robots exclusion rules.
Are search engines required to comply with robots.txt exclusions?
Not really. It’s really more of an ethical than a legal issue. While we’re not lawyers and we could be wrong; as far as we’re aware, search engines do not need to comply with robots.txt exclusion rules at all. Your only absolute last-resort remedy is to blacklist culprit IPs.
If you require any assistance with this particular issue, you are welcome to contact our technical support department.
As well as search spiders possibly being the root cause of bandwidth loss, it can also be down to images there are a few things you need to look out for.
not being compressed and optimised so they both load faster on your website and save you bandwidth (which saves you money!)
You can use free software utilities to do this. On Macs, you can use ImageOptim. On Linux distributions you can use Trimage (also available in the Ubuntu Software Centre). However, there are plenty of web-based utilities that do the same thing.
Well, this isn’t going to give you the most impressive savings, it is a “every little helps”. There are free online tools that you can use that compresses your CSS code by removing all unneeded whitespaces between your CSS selectors, properties and property values.
Both techniques can save you a sizable portion of bandwidth especially if you have a high-traffic website.
Using the jQuery library?
If you’re using the jQuery library, always use the minified version. It’s exactly the same as the full version but it’s much smaller in size because it is essentially compressed. However, if you don’t want to waste bandwidth having the jQuery library on your server – why not take advantage of being able to use the jQuery library hosted on Google’s, Microsoft’s or jQuery’s servers through their Content Delivery Network? The only different is the src attribute is to Google’s, Microsoft’s or jQuery’s servers where the library is hosted for you to use.