<< Back to main page

Validate email addresses using regular expressions

Author: Markus Sipilä

Version: 1.0, 2006-08-02

Permanent URL: http://www.iki.fi/markus.sipila/pub/emailvalidator.php

Email address validation is quite a bit more complex than it might sound at first. This PHP script uses regular expressions to check if given input is a syntactically valid email address. It won't go to the very basics of regexps since there are some very good tutorials available in the web. This article focuses on using regexp to validate email addresses.

The fact that RFC 2822 allows broader set of characters in email addresses than typically used makes things quite challenging usability wise. It is a very good and user-friendly idea to check the input for typos (eg. for invalid input like foo.bar.@example.com). At the same time the validation should, however, accept valid but not-so-typical addresses (eg. foo+bar!@example.com).

The tricky part is that the validator has to balance between these two objectives. Let's take foo#@example.com as an example. It's syntactically valid but in the real world it would probably be a typo. The difficult part in the last sentence is the word "probably" because you just can't be sure.

The only good solution I can think of is to perform the validation so that it will accept rarely used (but valid) syntax BUT will warn the user that he/she should double check it for typos.

Try it

Enter some address here:


How it works?

The local-part (the part before the "@" character) of the e-mail may use any of these ASCII characters [1]:

The domain part of the address is much easier to handle. The dot separated domain labels can only include letters, digits and hyphens [1].

There are two regexps in this script. The first one will pass "normal looking" addresses like foo.bar@baz.example.com or foo+bar@example.com. This regexp won't, however, pass all syntactically valid addresses like foo,!#@example.com

// define a regular expression for "normal" addresses
$normal = "^[a-z0-9_\+-]+(\.[a-z0-9_\+-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,4})$";

To understand this expression you need to be familiar with regexp syntax. You'll find links to some good tutorials in the end of this article.

The first part, ^[a-z0-9_\+-]+ means that the address has to start with letters a-z, numbers 0-9 or characters "_", "+" or "-" The final "+" means there must be 1..n of these characters. A normal username, say jsmith2 would match this expression. It also matches to foo+bar

The regexp continues with (\.[a-z0-9_\+-]+)*. It means that the first characters defined before can be followed with a period "." and after that with the same set of characters than before the period. Because characters "." and "+" have special meaning in regexps they must be escaped with a backslash. The final * means there must be 0..n of these sequences. This way the regexp will match to strings firstname.lastname, firstname.long-middlename.lastname and foo.bar+baz

After these characters there must be a single "@" character. It must be followed by a domain label that consist of letters, numbers and hyphens. There can be 1..n domain labels separated with a period. The first label (without the period) is defined by [a-z0-9-]+. After this there can be 0..n similar sequences starting with a period. This is defined as (\.[a-z0-9-]+)*. At the time this article was written most email address end with a period followed by 2..4 letters (for example .fi or .info). The expression \.([a-z]{2,4})$ matches this.

The second regexp is supposed to match all syntactically valid addresses, even those that we don't see that often. The idea in this example is that the validator should pass those strange looking addresses but tell the user that it would probably be a good idea to double check the address.

// define a regular expression for "strange looking" but syntactically valid addresses
$validButRare = "^[a-z0-9,!#\$%&'\*\+/=\?\^_`\{\|}~-]+(\.[a-z0-9,!#\$%&'\*\+/=\?\^_`\{\|}~-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*\.([a-z]{2,})$";

This ugly regexp is actually quite similar to the one declared earlier. The period separated character sequences in the local-part can now include all the special characters defined in the RFC. Characters "$", "*", "+" "^", "{" and "|" all have their special meanings in regular expressions so they must be escaped with a backslash. The expression now allows the domain part to end with a period followed by 2..n letters such as .museum

You can use these regexps as follows (in PHP):

if (eregi($normal, $email)) {
  echo("The address $email is valid and looks normal.");
}

else if (eregi($validButRare, $email)) {
  echo("The address $email looks a bit strange but it is syntactically valid. You might want to check it for typos.");
}

else {
  echo("The address $email is not valid.");
}

These regexps were inspired by and modified from the article "Using Regular Expressions in PHP" by James Ussher-Smith [2]. The article uses email address validation as an example but the suggested regexp doesn't work with for example foo+bar@example.com.

You can use these regexps in your applications but please give credit to the original authors. Feel free to drop me an email if you liked this howto. :)

Limitations

References:
[1] http://en.wikipedia.org/wiki/E-mail_address
[2] http://www.sitepoint.com/article/regular-expressions-php
Another good regexp tutorial: http://weblogtoolscollection.com/regex/regex.php