|
|
|
Comparison of methods for blocking unwanted websites
|
There are 3 filtering methods that block unwanted web content:
- content scanning: block web pages if it contains (a set of) "bad" words
- artificial intelligence: an improved version of content scanning
- blacklist: block sites based on a list of categorised websites
The following criteria are used for the comparison of the filtering methods:
- user experience: the method must be sufficiently fast for the individual user.
- wrong blocking I: a site about breast cancer should not be blocked as if it was a site about sex.
this is called overblocking since the site should not have been blocked.
- wrong blocking II: a site with sexual content should be blocked.
If the filter fails, it is called underblocking.
- block https: the https protocol is an encrypted protocol
intended for security and privacy.
Because the protocol uses encryption, the content cannot be scanned for words or phrases.
- infrastructure costs: the components are bandwidth usage and computing power.
Method A: content scanning
When web pages are scanned for content,
they are first downloaded which costs time and bandwidth.
Then the content is scanned for bad words like "breast", "s*x", "s*ck", "f*ck", etc.
Depending on the vendor, one or more words trigger the blocking mechanism and
unwanted content can be blocked.
The theory looks nice...
In practise, however, many sites are blocked because of word combinations like
"I don't like sex", "breast cancer" etc. which is called overblocking.
On the other hand, sites with sexual content that only have
pictures (text can also appear in a picture), are not blocked because they don't
contain any of the bad words, which is called underblocking.
The time that it takes to scan and guess the type of content of a web page
varies per page (some pages on the internet are very large).
This method is sufficiently fast for an individual user.
However, for 250 or more users, a very fast computer system is required for the proxy server.
Method B: artificial intelligence
When web pages are blocked based on artificial intelligence (AI),
they are also downloaded first and then scanned,
so this method also consumes bandwidth and time for the download process.
The various AI methods are more complex versions of method A.
To reduce the failures caused by underblocking and overblocking,
all words in the webpage are rated and some word combinations are rated.
Some products try to find out if a picture contains nudity by looking at colors and
claim a high level of correctness.
This improvement of correctness of blocking comes with a large cost: much CPU power.
So, for 100 or more users, a very fast computer system is required for the proxy server.
When web pages are blocked with the use of a blacklist,
they are not downloaded to make a decision about to block it or not.
Instead, the URL filter module
of the proxy server makes a quick decision
based on the URL: www.sex.com is blocked and www.google.com is not.
The URL filter makes this decision based on a database that is often
referred to as a blacklist.
This method is fast since blocked sites are not downloaded and
the URL filter ufdbGuard
does 25,000 URL verifications/sec on a 2,8 GHz Intel CPU.
ufdbGuard also features dynamic
detection of https proxy tunnels
and hence increases the security on your network.
Comparison of methods
The table shows why the blacklist method is the best choice.
Daily updates
No method is perfect and will never be.
This is due to the large amount of websites on the internet that
simply cannot be rated and categorised perfectly.
We believe that this imperfection is not a problem as long as
a method blocks 99% of unwanted content and does not block wanted content.
ufdbGuard has a feature to recognise URLs
that are not yet part of the URL database and
uploads these URLs to be included in the next day's database.
|
|
|
|
|
|
© copyright 2004-2008 URLfilterDB. All rights reserved.
|
OnToplist.com ranking: internet
the date is approximately Sunday, 20-Jul-2008 00:09:27 CEST
|