From the Blogosphere
How to Make "mailto" Safe Again
Using HTTP headers and default browser protocol handlers provides an opportunity to rediscover the usability and simplicity
By: Lori MacVittie
Jan. 29, 2010 10:30 AM
Using HTTP headers and default browser protocol handlers provides an opportunity to rediscover the usability and simplicity of the mailto protocol.
Over the last decade it's become unsafe to use the mailto protocol on a website due to e-mail harvesters and web scraping. No one wants to put their e-mail address out on teh Internets because two minutes after doing so you end up on a trillion SPAM lists and the next thing you know you're changing your e-mail address.
But people still wanted to share contact information, so it became common practice to spell out your e-mail address, such as l.macvittie AT F5 dot com. But e-mail harvesters quickly figured out how to circumvent that practice so people got even more inventive, describing how to type the @ sign instead. For example, you can send me an e-mail at l.macvittie SHIFT 2 f5.com. But that's inconvenient and isn't easily automated, and eventually the e-mail harvesters figure that one out, too.
You could use contact forms instead to hide the e-mail address, but that's not really sharing and it isn't convenient for the person trying to get a hold of you. Like many folks, if I have a need to contact you I’d like a record that I did so and contact forms rarely provide a copy of the message which makes managing communication more difficult. It also affords spammers an easily automated method of submitting spam. What you really want is to be able to share your e-mail address and avoid the automated e-mail harvesters. Some folks suggest using CSS tricks that manipulate selectors to hide the e-mail address, but the problem with this is that it (1) doesn’t automatically launch a mail client and (2) the e-mail address is still in the text of the page, it’s just located in a different place. Some techniques use pure CSS and pseudoclass selectors and others use CSS to expose the actually e-mail address that is “hidden” in one of the HREF attributes, often the title. But in both cases the address is still in the page – or in an external CSS file which bots might pull if they’re following all links - and a simple regular expression search will find it easily enough.
ONE SIMPLE SOLUTION
One solution to this problem lies in leveraging an HTTP redirect and the ubiquitous browser support for the mailto protocol. Another description of this (and simple PHP code) can be found in this extensive reference document listing myriad ways of “hiding” e-mail addresses from harvesters. My only nit is that the author indicates the mailto-redirect method doesn’t work as per a normal mailto link, and I’ve found that’s not the case. A header redirect to a mailto location should automatically launch the mail client with the appropriate e-mail address as expected; at least it has in the testing I’ve done thus far on the iRule code used to accomplish the redirect.
The mailto link in the presentation page is changed to a standard HTTP link which, when clicked, executes logic that sends an HTTP redirect to a mailto location instead of a more standard HTTP location. The reason using this technique works is that the location to which the browser is being redirected is “hidden” in the HTTP headers, which bots and spots rarely interpret or expect to carry pertinent information and it is the browser that must interpret the location, which means any client-side supported protocol – like mailto – will cause the execution of the expected action. In this case it is launching the user’s e-mail client. This technique could, of course, be used to silently launch other client-side applications for which a protocol handler is defined as well.
A traditional HTTP redirect header to a web page would look like this:
And what we want is simply to make it look like this:
There are two easy ways to implement this solution: network-side and server-side scripting.
METHOD #1: NETWORK-SIDE SCRIPTING
you easily accomplish this task. You can also do the same with mod_rewrite if you're running Apache, and I'm sure there's a way to do it if you're running IIS, as well. Basically any network-side scripting enabled proxy can accomplish this task. You can also accomplish this via server-side scripts as well, but that requires modification to the application and that may not be desirable, depending on your situation.
First you need a URI which you can map to an e-mail address, e.g. /getmailto. The script needs to (1) look for that URI and (2) respond to the call to that URI with an HTTP redirect containing the appropriate e-mail address.
Now replace your mailto links with a link to the new URL. If your browser and mail client are configured properly, clicking on the link should bring up a new e-mail message with the e-mail address filled in. That supports usability needs (the e-mail address link should launch the user’s mail client) but it also keeps the address out of the page.
You'll probably want to further filter access to the URL by putting some iRule code in to detect bots and spiders and prevent them from exploring this one, but that's pretty easy, too. If you only have to replace one e-mail address, you could probably avoid rewriting the mailto links and simply use an iRule to transform the original mailto links to the new URL. And I'm sure someone out there will figure out how to change any mailto link to a new URL as well.
For example, if all e-mail addresses use the same formula, i.e. first initial, dot, lastname, you could construct a URL that sent the information as the URL, i.e. /lmacvittie. You can use a network-side script to then parse it into the right e-mail address and send the redirect back to the user. Using iRules you could also create a data group that maps URIs to e-mail addresses and do a quick lookup based on the URI to extract the appropriate e-mail address. As mentioned, you can do the redirect using mod_rewrite as well. I think iRules affords more flexibility in dealing with the actual data being manipulated (the e-mail address –> URI mappings), but you should be able to do it using other tools as well. The trick here is in putting the e-mail address in the HTTP header rather than in the body of the page where it is easily discovered by harvesting tools.
METHOD #2: SERVER-SIDE SCRIPTING
If you aren’t lucky enough to have your own personal, private BIG-IP or other network-side scripting enabled solution, you can also accomplish this same functionality in your application code. In a server-side script the trick is to ensure that you’re inserting the HTTP header before any other data is written to the connection. HTTP headers must be received first, before data. It’s like gravity – a law that must be obeyed.
For example, in PHP, all you need to do is call the function header with the appropriate location:
Rather than add this code to every page with an e-mail address it might be advantageous to take a service-based approach and simulate network-side scripting capabilities by creating a single “page” for all mailto redirects and then implementing the lookups and return of the appropriate HTTP redirect in a centralized, more manageable service.
There are other solutions to prevent this type of web scraping behavior, and of course any solution combined with a good SPAM prevention solution will improve the quality of the e-mail received. SPAM may be a fact of life on the Internet, but anything we can do to preserve the user experience while cutting down on how much SPAM we receive has to be a good thing.
UPDATED NOTE: I just had a thought that because this essentially moves e-mail to a URI-based system, it should be possible to integrate techniques like a CAPTCHA to further secure access to e-mail addresses against bots, spiders, and scripts.
Related blogs & articles:
Technorati Tags: MacVittie,F5,web 2.0,mailto,protocol,HTTP,SPAM,security,network-side scripting,scripting,PHP,redirect,browser,usability
SOA World Latest Stories
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
SYS-CON Featured Whitepapers
Most Read This Week