I was looking for certainty  in this area and I knew that Joe Celko would provide it.  This is a thoroughly absorbing read and while it isn’t (quite rightly) a full style-sheet ready to take away it gives a wealth of guidance, discussion and stimulus to the reader to create his or her own rules.

Joe Celko's SQL Programming Style book coverThe early chapters provide an enumerated set of guidelines:  e.g. 1.2.3 Avoid Descriptive Prefixes or 2.1.5 Avoid the Use of CamelCase.  Each guideline normally comes with  a justification (sometimes with evidence and references), examples and exceptions.   Later chapters move away from visuals and give more functional guidelines for designing tables and writing views and procedures. I’ve become a more assiduous user of check constraints as a result of reading this book.

There are some familiar Celko chants :  don’t think of columns as fields; they are tables not files; and it’s reassuring to find them.   Sometimes he veers from the nicely encapsulated didacticism of his best chapters to a more woolly philosophical style (e.g. Chapter 4 Scales and Measurements), but it’s done with the right intention: to prompt thought about the data that is going into these columns.

I agree with his appeal to use external standards wherever they exist.  There are no two companies that I have worked with who use the same set of country keys, even though ISO country codes are well known.   I have never had to use a column to record sex but if I did it’s good to know that standard gender codes have been defined (1 =male 2 = female, to save you thumbing through the 24 pages of ISO/IEC 5218).

I don’t follow everything in the book:  some guidelines I disagree with and some I find plain impossible to interpret.  For example, he proposes  the suffix “_id”  for a unique identifier column but  says ‘Never use “<table_name>_id”‘ (p12).    So in other words, you should never use “customer_id” on the  “Customers” table.   I simply cannot think what one should use instead, nor does Celko give any alternatives.  So I use “customer_id”.  (Indeed Celko uses  “Drivers.driver_id” in an unrelated example on p60.)

However, the point is that Celko has thought about all these questions – and we should too if we  work with databases, even if we come to different conclusions.

References

Annotations

I accept the majority of Celko’s guidelines. There are a few that I don’t follow:

  • 6.2.4.2 Use IN(), Not OR-ed predicates (p115).    I.e. use  “WHERE a IN (b, c)” rather than “WHERE a = b OR a = c”.   That falls in the “Seek a Compact Form” section, which is generally laudable.  However SQL Server expands the compact IN form into OR-ed predicates in CHECK constraints.  So if you use compact form in DDL, schema comparison tools will always flag up a difference.  (I think that BETWEEN is similarly expanded.)

 

 

 

“Even COBOL, the most verbose programming langage on earth, allows some abbreviations” (2.4 Use Full Reserverd Words)

 

 

Column Naming Convention

I like lower case with underscores:  company_idfirst_name, telephone_nbr, booking_ref

This is Uncle Joe Celko’s convention as stated in SQL Programming Style (Morgan Kaufmann, 2005), and I accept his justification that Invoices.invoice_nbr  is more readable than Invoices.InvoiceNbr in a dense melee of Sql code.

However, pragmatically, PascalCase is more popular and I find I use it more commonly from choice.

Id/Key column Names

  • Use the singular version of the table followed by Id / Key / Nbr
  • e.g.  table Customers has column CustomerId .

There are a number of reasons for this:  having a bunch of columns all named ‘Id’ (as one popular approach has it) is never helpful.  Another reason is that it makes it practically impossible to search for references.   Searching for ‘CustomerId’ will find definite references; searching for ‘Id’ in a schema full of Ids  is useless.  The same applies to Name, Date, Cost etc used as column names.

Column Name Suffixes

The suffix is a type-and-content indicator.  Probably all columns with the same suffix will be of the similar data type and size and have similar constraints. This is reasonably easy to assert and check using a script or even by eye in a data dictionary.

Column Name Suffixes
Suffix Type Remarks
_abbrv short char or varchar an abbreviation for the item; normally for display where space is limited.
_flag char(1) Constrained to accept only uppercase Y or N.  (if an undefined value is required use _triflag)
_date any date type date or date and time
_id int numeric surrogate key – often the primary key. based on table name: table Companies has company_id
_name varchar proper name e.g. first_name. Use sparingly. name should never be used on its own for a column!
_key normally short varchar or char if natural key is of fixed length (e.g. ISO country codes) natural (normally textual) key. Normally unique except for versioned tables (i.e. tables with effective and obsolete dates).
_ref short char or varchar short textual reference – e.g. initials; company reference code.  Need not be unique.
_perc decimal(5 4) percentage scale 1.  (type specification allows maximum of 999.99%)
_token GUID a row identifier – e.g. case_guid. used for security in HTML links; etc
_triflag char(1) Constrained to accept only uppercase Y or  N or  U  (= Unknown).

 Alternative Conventions

Some other conventions that I have advocated or endured.

The Table-Prefix (c_name) Convention

This was the first convention I adopted (largely in MSAccess) and it has some attractive qualities.

  • All column names are prefixed by (normally) a single letter plus underscore.  Thus all columns in table tblCompany are prefixed with c_ e.g. c_idc_name.
  • Sometimes it isn’t possible to keep the single letter convention (what do you do once you have tblProperty, tblPerson, tblProduct?), so two or even three letter prefixes are acceptable: pn_FirstName etc.
  • Foreign key columns are named differently to the column they relate to.  E.g. tblPerson.pn_CompanyId is foreign key to tblCompany.c_Id.

Advantages

  1. Every column has a unique name.  If you don’t use fully qualified column names, adding a column can break a view if the column name already exists in one of the other tables referenced – or if someone has used Select *.
  2. It’s easy to see which table you’re referring to if you don’t use fully qualified column names:  c_id is the Company ID and p_id is the Product ID.
  3. You rarely have to alias similarly named columns:  pn_address is distinct from c_address.

It works satisfactorily in MS Access and is quick to use.   Working with recordsets in code it is clear which column is being referenced – you never see an anonymous rst!Address.  Because no aliasing is necessary there is no risk of inconsistent aliases: the column name you see in the code is the name it has in the table.

I wouldn’t choose this convention for a new database; however it has caused no problems in any database where I have used it.  With the advent of Sql Data Tools it is now possible to enforce fully qualified column names and prohibit Select * so database-unique column names are less important.  Intellisense in SSDT and SSMS makes full names (with table aliases) easier to write.  Neither of these are available for Sql in Access.

Prefixing keys with PK_ and FK_

I don’t think this helps.  The primary key is indicated in metadata. The naming convention is that it’s the singularised name of the table with a suffix. Customers.customer_id.  The foreign key is again indicated in metadata.  I don’t think confusing them is common.

I think it has disadvantages

  • It burns column properties into the column. Foreign keys might be created or dropped; the table name is unlikely to change.
  • When including the column in a view the output will be FK_customer_id or PK_customer_id – which you might be tempted to alias.  It’s not relevant to the output that the column is a FK or PK.  In fact, PK_customer_id would be confusing if it were not the PK of the rowset being returned.
  • Many to Many tables:  columns can be both PK and FK at the same time.  The conention just confuses here.
  • It obscures the fact that these columns represent the same attribute.

ORM-Generated Tables with Standard Column Names

I have worked with databases where every lookup table had exactly the same column names: twenty plus tables with the columns NameDescriptionKeyEffectiveDate, and so on.  (It was generated using ORM modelling in Visio for Enterprise Architects.  Which of us wouldn’t be flattered by the name of that program?  Unfortunately it put visual effects above good design – another bete noir; thankfully Microsoft dropped Visio for EA after 2005 and put its energies into the grown-up SSDT, which has no visuals at all!)

In addition to the lookup tables, most fact tables in the database also had a Name and Description column.  Almost every Select from more than one table would require the designer to alias columns  (or forget to – Sql Server  is happy to output two columns with the same name; applications are rarely happy to receive them!).

It seems obvious that there will be less confusion if a column name seen by the application is the same as in the table, and if a column always has the same name wherever it appears.    If a column regularly has to be aliased there is a high risk that different aliases will be used.  Why even raise the possibility?

  • Give a unique name to all columns that are likely to appear together in a Select statement.
  • Never use Name as a column name!  (there are a number of other prohibited names).

I’m happy for multiple tables to have effective_date and obsolete_date, for example, because these will normally only be used in where clauses, where full names will be enforced; they won’t often be returned for multiple tables in a view.

Naturally enough it was this database that suffered an ambiguous column reference following the addition of a column to a table.  This could never have happened with the Table-Prefix convention: perhaps that is a good reason to use it, in spite of the aversion to its Access appearance!

 

It’s a familiar situation: your email address columns contain phone numbers, important notes, credit card details…  It’s not a problem until you want to use the addresses in bulk – such as for a mailshot – and by that time it is extremely costly to deal with.

The user interface should validate at the time of entry, of course, but we need a backstop validation at database level: just strict enough to stop completely off-the-ball values being entered.  Format validation, in other words.  Non-existent domains and typos are not our concern, and don’t cause us too much trouble (that’s for a different area of data quality); but try sending a telephone number to a mail transport and it will probably raise an exception.

Here is a specification.

Email Column Specification

 email_address nvarchar(254) NULL

Email Address Validation at database layer

See http://en.wikipedia.org/wiki/E-mail_address

  • No attempt to check that the valid address exists: that’s a job for higher tiers.
  • Don’t be too strict.  SQL is not the best language for writing the complex validator that email addresses seem to need.  It would be embarrassing to reject an address that a carefully crafted UI validator has allowed through.
Database Validation rules for Email
Rule Remarks
At least 1 @ symbol must be at least one @ symbol.  Further @s are allowed in local part if quoted  – "local@part"@domain.com
Allow ! # $ % & ' * + – / = ? ^ _ ` { | } ~  . (dot)
Allow "():;<>@[\] comma
At least one other character before and after the @ symbol There could be more than one @ symbol.  Maybe not worth putting too much energy into handling this very rare case.
Minimum 5 characters overall a@com is possible.
All other characters prohibited

That would satisfy me at the database layer.  The UI is the place to make more stringent checks – when there is an opportunity to correct invalid entries.  Many invalid email addresses can get by these validations but (in a cooperative environment) they are likely to be typos: it is better to preserve the slight mistakes and submit them to correction by post-editing if desired.   Of course, if we were selling addresses to a third party we would want stricter validation, but for that it might be better to purchase a tool or use a bureau service.

UDF : FUNC_IsValidEmailAddress

This function is fairly lenient. Few characters are refused in accordance with the broad specification (£, ¬, ¦ are refused).  Therefore it should be reasonably straightforward to synchronise with higher tiers’ validation.  Nevertheless it will still be difficult to slip a textual comment or telephone number into the column.

The string operations will make this a slow function.  OK for an individual inserts and updates; maybe a problem for batches.

CREATE FUNCTION [dbo].[FUNC_IsValidEmailAddress]
    /*
    Performs minimal database-level validation on email addresses.

    Returns Y if email_address meets following criteria, else N.
		- contains @ with at least one character before and three characters after (i.e. 5 chars minimum).
		- all characters must be in this string (case insensitive; includes space):
			'abcdefghijklmnopqrstuvwxyz01234567890@ !#$%&''*+–/=?^_`{|}~."():;<>@[\],'

	NULL email_address returns Y.  

	2013-12-28  http://www.andrewing.co.uk 
*/
(
    @email_address VARCHAR(254)
)
RETURNS CHAR(1)
AS
BEGIN
    DECLARE @success_flag CHAR(1) = 'Y'
    DECLARE @allowed_chars VARCHAR(255)
    DECLARE @searched_char VARCHAR(3)
    DECLARE @email_length SMALLINT
    DECLARE @position SMALLINT

    SET @allowed_chars = 'abcdefghijklmnopqrstuvwxyz01234567890@ !#$%&''*+-/=?^_`{|}~."():;<>@[\],'

	IF (@email_address IS NULL) RETURN 'Y'

    SELECT @email_length = LEN(@email_address)
        ,@position = 0

    IF (@email_address NOT LIKE '%_@___%')
        SET @success_flag = 'N'

    WHILE @position < @email_length
        AND @success_flag = 'Y'
    BEGIN
        SET @position = @position + 1
        SET @searched_char = SUBSTRING(@email_address, @position, 1)

        SELECT @searched_char = CASE @searched_char
                WHEN '_'
                    THEN '[_]'
                WHEN '%'
                    THEN '[%]'
				WHEN '['
					THEN '[[]'

                ELSE @searched_char
                END

        IF ( NOT @allowed_chars LIKE '%' + @searched_char + '%')
            SET @success_flag = 'N'
    END

    RETURN @success_flag
END

Tests

Script should return the values indicated.

-- Tests
print 'Test dbo.FUNC_IsValidEmailAddress().'
print 'NULl.  (Should return Y)'
print dbo.FUNC_IsValidEmailAddress(null)
print 'Missing @ symbol, or @ only at start or end  (N)'
print dbo.FUNC_IsValidEmailAddress('@symbol.at.front')
print dbo.FUNC_IsValidEmailAddress('symbol.at.end@')
print dbo.FUNC_IsValidEmailAddress('no.symbol')
print 'Invalid characters: £ or ¦  (N)'
print dbo.FUNC_IsValidEmailAddress('test@£email.com')
print dbo.FUNC_IsValidEmailAddress('andrew¦test@email.com')
print 'Telephone number or comment!  (N)'
print dbo.FUNC_IsValidEmailAddress('Michelle on 01233 12333444')
print dbo.FUNC_IsValidEmailAddress('Check this record!!')
print 'But we don''t catch an email address plus a comment   (Y)'
print dbo.FUNC_IsValidEmailAddress('andrew@test.com Check before emailing.')
print 'Valid examples from Wikipedia article  (Y)'
print dbo.FUNC_IsValidEmailAddress('very.common@example.com')
print dbo.FUNC_IsValidEmailAddress('a.little.lengthy.but.fine@dept.example.com')
print dbo.FUNC_IsValidEmailAddress('disposable.style.email.with+symbol@example.com')
print dbo.FUNC_IsValidEmailAddress('other.email-with-dash@example.com')
print dbo.FUNC_IsValidEmailAddress('user@[IPv6:2001:db8:1ff::a0b:dbd0]')
print dbo.FUNC_IsValidEmailAddress('"much.more unusual"@example.com')
print dbo.FUNC_IsValidEmailAddress('"very.unusual.@.unusual.com"@example.com')
print dbo.FUNC_IsValidEmailAddress('"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com')
print dbo.FUNC_IsValidEmailAddress('postbox@com')
print dbo.FUNC_IsValidEmailAddress('admin@mailserver1')
print dbo.FUNC_IsValidEmailAddress('!#$%&''*+-/=?^_`{}|~@example.org')
print dbo.FUNC_IsValidEmailAddress('" "@example.org')
print 'Not accepting unicode yet.  (N)'
print dbo.FUNC_IsValidEmailAddress('üñîçøðé@example.com ')
print 'Invalid examples from Wikipedia article (N)'
print dbo.FUNC_IsValidEmailAddress('Abc.example.com')
print 'Invalid examples that we don''t intend to catch with this lenient test  (Y)'
print dbo.FUNC_IsValidEmailAddress('A@b@c@example.com')
print dbo.FUNC_IsValidEmailAddress('a"b(c)d,e:f;g<h>i[j\k]l@example.com ')
print dbo.FUNC_IsValidEmailAddress('just"not"right@example.com')
print dbo.FUNC_IsValidEmailAddress('this is"not\allowed@example.com ')
print dbo.FUNC_IsValidEmailAddress('this\ still\"not\\allowed@example.com')

 

Speed

It’s not very fast.  Test:  insert 10,000 rows into A a 3-column table with no constraint on the email_address column;  B a similar table, but with a constraint on the email_address column.

  • A: 10,000 row insert, no constraint: CPU time = 15 ms,  elapsed time = 30 ms.
  • B: 10,000 row insert, constraint on email column: CPU time = 1139 ms,  elapsed time = 1214 ms.

An increase of  4,000%, mostly CPU time as you would expect – there is no additional IO.

Let’s break down the function to see what is taking the time.  Start by commenting all in the function except the length check:

1.  Length check only  310% increase:

  • A: CPU time = 32 ms,  elapsed time = 37 ms.
  • B: CPU time = 124 ms,  elapsed time = 116 ms.

2.  “email_address LIKE ‘%_@_%'”  check only  590% increase:

  • A: CPU time = 32 ms,  elapsed time = 30 ms.
  • B: CPU time = 156 ms,  elapsed time = 178 ms.

3.  valid character check only  4,000% increase:

  • A:  CPU time = 16 ms,  elapsed time = 30 ms.
  • B: CPU time = 1138 ms,  elapsed time = 1203 ms.

4. All checks.

  • A: CPU time = 15 ms,  elapsed time = 30 ms.
  • B: CPU time = 1139 ms,  elapsed time = 1214 ms.

Clearly we need an optimisation of that expensive allowed-character check.   Using CLR would give us better string manipulation and regular expressions.

Discussion

  • Speed  I’d be worried about the effect on bulk inserts or updates  (bulk updates that don’t touch the validated column won’t be affected).  In normal circumstances the customers table is not affected in bulk all that often so it might not be a concern.
  • That string of allowed characters (@allowed_chars) is hidden in the function.  It should be centralised and reportable.  It could be wrapped within another function (func_email_address_allowed_chars()).  Other options are table row (not tempting), or database property (how lightweight is that?)

 

Error Handler for T-SQL

E. Sommarskog is the important source on this subject with a set of supremely readable articles here http://www.sommarskog.se/error-handling-I.html – covering up to Sql Server 2005.

ERROR_HandleCaughtError stored procedure

This is  ES’s error handler in a form that suits me: I usually want to add some additional information to aid debugging – e.g. the values of parameters.

  • I seem to be thinking of an “error” schema with the prefix I’ve given it.  Probably dbo is the appropriate location (since little else would ideally end up there).
  • I think I have solved the question of whether the LOG option of RAISERROR is always allowed; I’ll update this post.
  • I’ve gone with ES’s 5555 return value.   Mostly I don’t use the return value, but I’ve yet to form a complete stance on them.  I wouldn’t trust them to indicate errors, largely because of the ambivalence with which they are treated.
  • I am no longer certain why I added Execute As Caller.  It might be important if I were capturing the login running the procedure (and indeed, that might be what appears in the LOG).  On the other hand, it might expose the procedure to the vagaries of Caller’s permissions.  Again, I will confirm and update the post.
CREATE PROCEDURE [dbo].[ERROR_HandleCaughtError] 

        @DebugInfo varchar(2000) = NULL

/*
        2013-01-06

        Reports details of an error to user.
        Logs error to SQL Server Log.

        IN
                @DebugInfo varchar(200) - optional.  Additional 
                                          debug information to append to error message.

	VERSION
		2014-08-19 includes database name.

*/
WITH EXECUTE AS CALLER 
AS
DECLARE @errmsg   nvarchar(2048),
        @severity tinyint,
        @state    tinyint,
        @errno    int,
        @proc     sysname,
        @lineno   int,
		@dbname   varchar(50)

SELECT @errmsg = error_message(), @severity = error_severity(),  
       @state  = error_state(), @errno = error_number(),
       @proc   = error_procedure(), @lineno = error_line(),
	   @dbname = db_name()

IF @errmsg NOT LIKE N'***%'                                         
        BEGIN 
                -- This is the first time the error has been handled in the stack.
                SELECT @errmsg = '*** Proc. ' + @dbname + '..' + coalesce(quotename(@proc), '<dynamic SQL>') + 
                                        ', Ln. ' + ltrim(str(@lineno)) + '. Err. ' + 
                                        ltrim(str(@errno)) + ': ' + @errmsg +
                                        ISNULL(' [' + @DebugInfo + ']', '')
                RAISERROR(@errmsg, @severity, @state) WITH LOG
        END    
ELSE
        BEGIN     
          -- The error information has already been concatenated; rethrow error. No logging required.
           RAISERROR(@errmsg, @severity, @state)

        END

Implementing the Error Handler

Here is a pattern showing the error handler in use in an outline stored procedure.  Copy and Paste.

CREATE PROCEDURE schema.StoredProcedure_Action

        /*
                This is what it does.

                PARAMETERS

                RETURNS

                0/0/0000  Initials : created        
        */

AS

SET NOCOUNT ON
SET XACT_ABORT ON
DECLARE @DebugInfo varchar(2000)

BEGIN TRY

        SELECT @DebugInfo = 'int_parameter=' + ISNULL(CONVERT(varchar(8), @int_parameter), 'NULL') +
                           ' date_parameter=' + ISNULL(CONVERT(varchar(20), @date_parameter, 120), 'NULL');

        -- Raise errors for any parameters that must not be Null.
        IF (@int_parameter IS NULL) RAISERROR('@int_parameter is required', 16, 1)
        -- Procedure Actions Here --

END TRY
BEGIN CATCH
        EXECUTE dbo.ERROR_HandleCaughtError @DebugInfo
        RETURN 5555
END CATCH

 Discussion

All exceptions are re-thrown and take the error number of 50000.   Sometimes I would prefer to raise more specific errors to help the application using the procedure – e.g. differentiating between invalid parameters (which could be corrected by the application) and invalid data like a missing lookup value (which couldn’t).     Custom error numbers have to be added to sys.messages: it allows you to specify a custom formatted error message (good), but the custom entries would apply to all databases on the instance, and I don’t care to take that step just yet. A universal set of conventionalised custom errors would be the answer (which ones would you need?), but I haven’t devised one yet, and I have found little evidence of anyone else using them yet.   What would happen if you loaded a third party database that defined its own conflicting custom errors?

 Revisions

1. Aug 2014.   Knowing the database that the error came from is undeniably useful when trawling logs.  Added to this version.  I haven’t yet found a way of returning the schema of the error procedure.

,

I imagine that it’s a well trodden path seeking out a satisfactory code or syntax highlighter for wordpress.

You can, just about, paste colour-formatted code from some editors into the wp text editor  (e.g. SSMS to Outlook Email to WordPress), but it creates a mess of the HTML.  Not pretty at all.

Current preference:

  • Crayon Syntax Highlighter.   Nice to use;  line numbers; expansion to new window; code-block size limits.  The main drawback seems to be that the colouring is rather random.

 

I have twice now been wrong-footed by this SQL Server error message:

“Microsoft SQL Server Native Client 10.0” Hresult: 0x80040E07 Description: “Conversion failed when converting the varchar value ‘abcd123’ to data type int.”.

The statement causing this distress was:

ISNULL(View1.long_nameView1.short_nameAS output_column

Reasonably, ISNULL will expect both arguments to be the same type will attempt conversion if they aren’t.   But in this example both columns long_name and short_name are the same: varchar – aren’t they?

Well, no they aren’t actually.   View1 was hiding some changes that were made to the source table:  long_name was dropped, and to keep things working View1 merely aliased NULL as long_name.     With no other clue, SQL Server interprets the literal NULL as type int, hence the attempted conversion.

T-SQL Pattern: No un-cast literals in views

It’s necessary for numeric literals, fairly common-sense for string literals, but less natural for literal NULL to be cast to the appropriate data type – though a moment’s thought makes it clear why it must.

       SELECT
           NULL as short_name,  -- wrong : what data type will it be? 
           CAST(NULL AS varchar(50)) as long_name  -- the correct pattern

The first line should stand out as unacceptably vague about the data type.

,