Monday, November 23, 2009

Oracle's MINUS and SQL Server's EXCEPT

Today I was a little surprised to find a difference between Oracle and SQL Server syntax that I wasn't aware of. I had a need to find a set of rows which had been incorrectly geographically coded. The incorrect coding took a number of forms and I wanted to make sure that I wasn't double counting.

Easy - I done the same type of thing many times before with Oracle. Just use the MINUS operator. Perhaps not. SQL Server complained about the MINUS keyword. With a little digging a found that the SQL Server equivalent is EXCEPT.

It does the same thing. It will return any rows from the first query where are not also returned by the second query. The syntax is:

SELECT col_a, col_b, col_c
FROM view1
MINUS
SELECT col_a, col_b, col_c
FROM view2

There are a few gotchas to be aware of. The number and order of the columns in the queries used must be the same, and the data types must also be the same.

I was even more surprised to find that the EXCEPT keyword actually is the ANSI SQL standard. Oracle's MINUS keyword is actually the odd one out!

Friday, November 13, 2009

Unexplained NULLs from Excel Sources in SSIS

Today when using SSIS to transpose some data I run into a problem which I'd seen before but had never had time to properly investigate and understand. Today the alternative to solving the problem was a lot of manual data manipulation so it was time to dig in and work out what was going on.


So, here's the problem, you may well have struck it yourself.


My Excel spreadsheet has a text column and three columns of integers.

But when I added an SSIS Excel Source and previewed the data I saw the following:

Rather than seeing the values of "Battery Point", 2, 1 and 1 for the first data row I had "Battery Point", 2, 1, NULL. Where had the value for the last column gone?

Some digging around MSDN led me to learn that the Excel driver reads 8 rows to determine the data type for each column in the source file. When a column has mixed data types the majority type wins, with numeric types winning over text types in the event of a tie. Cells in that column with the losing type(s) are given a NULL value. In my case I had one numeric cell and seven text cells in the first eight rows of the column. So my integer value of 1, despite the fact that it was in the first row, lost out under the collective weight of the following seven rows and was made into a NULL.

It is possible to avoid this by telling the driver to work in Import Mode. To do this bring up the Properties Page of the Connection object and add the value IMEX=1 to the extended properties of the connection string. You'll need to manually type the value into the connection string rather than using the ellipse or you won't be able to make the change.

So, in my example I start with this


And change it to this




A quick check of the preview from the Excel source reveals my value of 1 is back - the NULL is gone!


By the way, it's also possible to change the number of rows that Excel samples to guestimate the data type for each column to something other than the default value of 8. This is done by altering the registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel and changing the value of TypeGuessRows to your new value.

Thursday, November 12, 2009

Reporting Services Parameters - Something to Avoid

Recently I had a need to review some SSRS reports and came across one of my pet report development hates - querying an inappropriate table to generate a list of values for a parameter pick list.

In this case the query that the parameter used was performing a table scan on a 35 million row transactional table in order to generate two values to display as potential parameter choices. I suspect that in development that performance was quite snappy and so no-one bothered to consider the efficiency of the SQL statement behind the parameter list. Even now (in a production environment) performance wasn't annoyingly bad - sub 2 seconds to return the values required. But consider the situation where the table is growing by tens of thousands of rows per day. The cost of the parameter query will become higher over time (as the size of the table grows) and the load on the database server will be higher than it need be, a problem that may well be compounded if the same design approach has been used across many reports on the server.

So what are the alternatives?

Ideally, take a look at the data model to see if there is already a table which models just the data for the parameter. For example, rather than querying a transactional hospital admissions table to get a list of hospital names, look for a table which contains information at the hospital grain - i.e. one row per hospital. Chances are if the entity is important enough to be required for a parameter it will already be modelled in the data model. This is particularly true in star schemas - look for a dimension table.

If that can't be done other things that might work include adding a table to the database, querying a smaller transactional table or adding an index to the transactional table which covers the column containing the values required for the parameter. Any of these approaches have potential implications on other interfaces to the database and so shouldn't be made without consultation and consideration.