It’s a familiar situation: your email address columns contain phone numbers, important notes, credit card details…  It’s not a problem until you want to use the addresses in bulk – such as for a mailshot – and by that time it is extremely costly to deal with.

The user interface should validate at the time of entry, of course, but we need a backstop validation at database level: just strict enough to stop completely off-the-ball values being entered.  Format validation, in other words.  Non-existent domains and typos are not our concern, and don’t cause us too much trouble (that’s for a different area of data quality); but try sending a telephone number to a mail transport and it will probably raise an exception.

Here is a specification.

Email Column Specification

 email_address nvarchar(254) NULL

Email Address Validation at database layer

See http://en.wikipedia.org/wiki/E-mail_address

  • No attempt to check that the valid address exists: that’s a job for higher tiers.
  • Don’t be too strict.  SQL is not the best language for writing the complex validator that email addresses seem to need.  It would be embarrassing to reject an address that a carefully crafted UI validator has allowed through.
Database Validation rules for Email
Rule Remarks
At least 1 @ symbol must be at least one @ symbol.  Further @s are allowed in local part if quoted  – "local@part"@domain.com
Allow ! # $ % & ' * + – / = ? ^ _ ` { | } ~  . (dot)
Allow "():;<>@[\] comma
At least one other character before and after the @ symbol There could be more than one @ symbol.  Maybe not worth putting too much energy into handling this very rare case.
Minimum 5 characters overall a@com is possible.
All other characters prohibited

That would satisfy me at the database layer.  The UI is the place to make more stringent checks – when there is an opportunity to correct invalid entries.  Many invalid email addresses can get by these validations but (in a cooperative environment) they are likely to be typos: it is better to preserve the slight mistakes and submit them to correction by post-editing if desired.   Of course, if we were selling addresses to a third party we would want stricter validation, but for that it might be better to purchase a tool or use a bureau service.

UDF : FUNC_IsValidEmailAddress

This function is fairly lenient. Few characters are refused in accordance with the broad specification (£, ¬, ¦ are refused).  Therefore it should be reasonably straightforward to synchronise with higher tiers’ validation.  Nevertheless it will still be difficult to slip a textual comment or telephone number into the column.

The string operations will make this a slow function.  OK for an individual inserts and updates; maybe a problem for batches.

CREATE FUNCTION [dbo].[FUNC_IsValidEmailAddress]
    /*
    Performs minimal database-level validation on email addresses.

    Returns Y if email_address meets following criteria, else N.
		- contains @ with at least one character before and three characters after (i.e. 5 chars minimum).
		- all characters must be in this string (case insensitive; includes space):
			'abcdefghijklmnopqrstuvwxyz01234567890@ !#$%&''*+–/=?^_`{|}~."():;<>@[\],'

	NULL email_address returns Y.  

	2013-12-28  http://www.andrewing.co.uk 
*/
(
    @email_address VARCHAR(254)
)
RETURNS CHAR(1)
AS
BEGIN
    DECLARE @success_flag CHAR(1) = 'Y'
    DECLARE @allowed_chars VARCHAR(255)
    DECLARE @searched_char VARCHAR(3)
    DECLARE @email_length SMALLINT
    DECLARE @position SMALLINT

    SET @allowed_chars = 'abcdefghijklmnopqrstuvwxyz01234567890@ !#$%&''*+-/=?^_`{|}~."():;<>@[\],'

	IF (@email_address IS NULL) RETURN 'Y'

    SELECT @email_length = LEN(@email_address)
        ,@position = 0

    IF (@email_address NOT LIKE '%_@___%')
        SET @success_flag = 'N'

    WHILE @position < @email_length
        AND @success_flag = 'Y'
    BEGIN
        SET @position = @position + 1
        SET @searched_char = SUBSTRING(@email_address, @position, 1)

        SELECT @searched_char = CASE @searched_char
                WHEN '_'
                    THEN '[_]'
                WHEN '%'
                    THEN '[%]'
				WHEN '['
					THEN '[[]'

                ELSE @searched_char
                END

        IF ( NOT @allowed_chars LIKE '%' + @searched_char + '%')
            SET @success_flag = 'N'
    END

    RETURN @success_flag
END

Tests

Script should return the values indicated.

-- Tests
print 'Test dbo.FUNC_IsValidEmailAddress().'
print 'NULl.  (Should return Y)'
print dbo.FUNC_IsValidEmailAddress(null)
print 'Missing @ symbol, or @ only at start or end  (N)'
print dbo.FUNC_IsValidEmailAddress('@symbol.at.front')
print dbo.FUNC_IsValidEmailAddress('symbol.at.end@')
print dbo.FUNC_IsValidEmailAddress('no.symbol')
print 'Invalid characters: £ or ¦  (N)'
print dbo.FUNC_IsValidEmailAddress('test@£email.com')
print dbo.FUNC_IsValidEmailAddress('andrew¦test@email.com')
print 'Telephone number or comment!  (N)'
print dbo.FUNC_IsValidEmailAddress('Michelle on 01233 12333444')
print dbo.FUNC_IsValidEmailAddress('Check this record!!')
print 'But we don''t catch an email address plus a comment   (Y)'
print dbo.FUNC_IsValidEmailAddress('andrew@test.com Check before emailing.')
print 'Valid examples from Wikipedia article  (Y)'
print dbo.FUNC_IsValidEmailAddress('very.common@example.com')
print dbo.FUNC_IsValidEmailAddress('a.little.lengthy.but.fine@dept.example.com')
print dbo.FUNC_IsValidEmailAddress('disposable.style.email.with+symbol@example.com')
print dbo.FUNC_IsValidEmailAddress('other.email-with-dash@example.com')
print dbo.FUNC_IsValidEmailAddress('user@[IPv6:2001:db8:1ff::a0b:dbd0]')
print dbo.FUNC_IsValidEmailAddress('"much.more unusual"@example.com')
print dbo.FUNC_IsValidEmailAddress('"very.unusual.@.unusual.com"@example.com')
print dbo.FUNC_IsValidEmailAddress('"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com')
print dbo.FUNC_IsValidEmailAddress('postbox@com')
print dbo.FUNC_IsValidEmailAddress('admin@mailserver1')
print dbo.FUNC_IsValidEmailAddress('!#$%&''*+-/=?^_`{}|~@example.org')
print dbo.FUNC_IsValidEmailAddress('" "@example.org')
print 'Not accepting unicode yet.  (N)'
print dbo.FUNC_IsValidEmailAddress('üñîçøðé@example.com ')
print 'Invalid examples from Wikipedia article (N)'
print dbo.FUNC_IsValidEmailAddress('Abc.example.com')
print 'Invalid examples that we don''t intend to catch with this lenient test  (Y)'
print dbo.FUNC_IsValidEmailAddress('A@b@c@example.com')
print dbo.FUNC_IsValidEmailAddress('a"b(c)d,e:f;g<h>i[j\k]l@example.com ')
print dbo.FUNC_IsValidEmailAddress('just"not"right@example.com')
print dbo.FUNC_IsValidEmailAddress('this is"not\allowed@example.com ')
print dbo.FUNC_IsValidEmailAddress('this\ still\"not\\allowed@example.com')

 

Speed

It’s not very fast.  Test:  insert 10,000 rows into A a 3-column table with no constraint on the email_address column;  B a similar table, but with a constraint on the email_address column.

  • A: 10,000 row insert, no constraint: CPU time = 15 ms,  elapsed time = 30 ms.
  • B: 10,000 row insert, constraint on email column: CPU time = 1139 ms,  elapsed time = 1214 ms.

An increase of  4,000%, mostly CPU time as you would expect – there is no additional IO.

Let’s break down the function to see what is taking the time.  Start by commenting all in the function except the length check:

1.  Length check only  310% increase:

  • A: CPU time = 32 ms,  elapsed time = 37 ms.
  • B: CPU time = 124 ms,  elapsed time = 116 ms.

2.  “email_address LIKE ‘%_@_%'”  check only  590% increase:

  • A: CPU time = 32 ms,  elapsed time = 30 ms.
  • B: CPU time = 156 ms,  elapsed time = 178 ms.

3.  valid character check only  4,000% increase:

  • A:  CPU time = 16 ms,  elapsed time = 30 ms.
  • B: CPU time = 1138 ms,  elapsed time = 1203 ms.

4. All checks.

  • A: CPU time = 15 ms,  elapsed time = 30 ms.
  • B: CPU time = 1139 ms,  elapsed time = 1214 ms.

Clearly we need an optimisation of that expensive allowed-character check.   Using CLR would give us better string manipulation and regular expressions.

Discussion

  • Speed  I’d be worried about the effect on bulk inserts or updates  (bulk updates that don’t touch the validated column won’t be affected).  In normal circumstances the customers table is not affected in bulk all that often so it might not be a concern.
  • That string of allowed characters (@allowed_chars) is hidden in the function.  It should be centralised and reportable.  It could be wrapped within another function (func_email_address_allowed_chars()).  Other options are table row (not tempting), or database property (how lightweight is that?)