internet filter database The best URL database and internet filter    
for higher productivity & less bandwidth usage    
Comparison of methods for blocking unwanted websites

Filtering methods

There are 3 filtering methods that block unwanted web content:
  • content scanning: block web pages if it contains (a set of) "bad" words
  • artificial intelligence: an improved version of content scanning
  • blacklist: block sites based on a list of categorised websites
The following criteria are used for the comparison of the filtering methods:
  • user experience: the method must be sufficiently fast for the individual user.
  • wrong blocking I: a site about breast cancer should not be blocked as if it was a site about sex. this is called overblocking since the site should not have been blocked.
  • wrong blocking II: a site with sexual content should be blocked. If the filter fails, it is called underblocking.
  • block https: the https protocol is an encrypted protocol intended for security and privacy. Because the protocol uses encryption, the content cannot be scanned for words or phrases.
  • infrastructure costs: the components are bandwidth usage and computing power.

Method A: content scanning

When web pages are scanned for content, they are first downloaded which costs time and bandwidth.  Then the content is scanned for bad words like "breast", "s*x", "s*ck", "f*ck", etc.  Depending on the vendor, one or more words trigger the blocking mechanism and unwanted content can be blocked.  The theory looks nice...  In practise, however, many sites are blocked because of word combinations like "I don't like sex", "breast cancer" etc. which is called overblocking.  On the other hand, sites with sexual content that only have pictures (text can also appear in a picture), are not blocked because they don't contain any of the bad words, which is called underblocking.  The time that it takes to scan and guess the type of content of a web page varies per page (some pages on the internet are very large).  This method is sufficiently fast for an individual user.  However, for 250 or more users, a very fast computer system is required for the proxy server.

Method B: artificial intelligence

When web pages are blocked based on artificial intelligence (AI), they are also downloaded first and then scanned, so this method also consumes bandwidth and time for the download process.  The various AI methods are more complex versions of method A.  To reduce the failures caused by underblocking and overblocking, all words in the webpage are rated and some word combinations are rated.  Some products try to find out if a picture contains nudity by looking at colors and claim a high level of correctness.  This improvement of correctness of blocking comes with a large cost: much CPU power.  So, for 100 or more users, a very fast computer system is required for the proxy server.

Method C: blacklist

When web pages are blocked with the use of a blacklist, they are not downloaded to make a decision about to block it or not.  Instead, the URL filter module of the proxy server makes a quick decision based on the URL: www.sex.com is blocked and www.google.com is not.  The URL filter makes this decision based on a database that is often referred to as a blacklist.  This method is fast since blocked sites are not downloaded and the URL filter ufdbGuard does 25,000 URL verifications/sec on a 2,8 GHz Intel CPU.

ufdbGuard also features dynamic detection of https proxy tunnels and hence increases the security on your network.

Comparison of methods

The table shows why the blacklist method is the best choice.

Daily updates

No method is perfect and will never be.  This is due to the large amount of websites on the internet that simply cannot be rated and categorised perfectly.  We believe that this imperfection is not a problem as long as a method blocks 99% of unwanted content and does not block wanted content.  ufdbGuard has a feature to recognise URLs that are not yet part of the URL database and uploads these URLs to be included in the next day's database.
Back

© copyright 2004-2008 URLfilterDB. All rights reserved.  
OnToplist.com ranking: internet the date is approximately Sunday, 20-Jul-2008 00:09:27 CEST