Building an Ultra-Light Ultra-Useful Healthcare Datawarehouse with SQL Server 2008

Friday, October 23, 2009

Efficient query to find rows with min or max value in a column

I often need to find the row of data with a maximum value in a certain column. For example, let's say I have a customer order table with multiple records per customer. I need to find the latest (or the first) order for each customer based on the order date. To emphasize: I need the entire row, not just the maximum order date for each customer. (Note: this is not exactly the scenario I was dealing with, I changed it for illustrative purposes. If I screwed up the queries, it's a typo. The principle works.)

The obvious solution is to use max() in the subquery, so that the query first finds the maximum order date for each customer, then retrieves the row with that date:


select *
from customer_order a
where order_date = (
    select max(b.order_date)
    from customer_order b
    where b.customer_id = a.customer_id
)

Seems a little convoluted, but ok. But... there are several issues:

1. If a customer has several orders on the maximum date, I'll get all those rows, not just one. Depending on my requirements, I may need just one row.

2. If I am looking for the second or N-th max or min row, this query does not help.

Here is an efficient and a more flexible alternative that uses row_number() to select the first, second, next to last, and the last order per customer:


select *
from (
        select 
            ROW_NUMBER() over (
                partition by customer_id
                order by order_date ASC) as AscRank,
            ROW_NUMBER() over (
                partition by customer_id 
                order by order_date DESC) as DescRank,    
            *    
        from customer_order
    ) q
where
    AscRank in (1,2) or
    DescRank in (2,1)

This query assigns two numbers for each row: ascending rank and descending rank. So if a customer has 5 orders, the earliest (historically) record will have ascending rank = 1 and descending rank = 5. The latest record will have the opposite: ascending rank = 5, descending = 1.

Thus, for each customer_id this query will return the fist and second, and the last order and next to last record due to the or condition in the where clause. The first record will have AscRank = 1, next AscRank = 2, next to last DescRank = 2, and the last record DescRank = 1. If you want to find only the last order for each customer, use

where DescRank = 1

That's it!

I found this technique to be quick (2 sec) using a member enrollment history table with 1.3 mil rows, with every customer having 1-5 records. The same technique against a health claims table with 1.4 mil claims returns rows in 27 seconds. The claim table has fewer individual members, but more fare more records per member (avg 8 records).

Wednesday, August 5, 2009

Compression in SQL Server 2008

I have been very impressed with this feature of SQL Server 2008. Our ODS database was ~425GB on SQL Server 2005. After migrating to 2008 and enabling data compression (page-type), the size of the database: 104GB.

Even more impressive is the backup compression. Full backup file size was about 400GB and it took hours to complete. Backup size now: 44GB. Full backup time: 13 minutes. Unbelievable.

Regarding table compression, here is a query that I use to show tables compression status:

SELECT distinct
   OBJECT_SCHEMA_NAME(OBJECT_ID) as SchemaName,  
 OBJECT_NAME(OBJECT_ID) AS TableName, 
 data_compression_desc, 
 rows
FROM sys.partitions
WHERE OBJECT_SCHEMA_NAME(OBJECT_ID) not in ('sys') 
ORDER BY SchemaName, TableName

For our sandbox database, where business analysis do extensive "data smashing", I created this stored procedure to compress uncompressed tables:

-- This procedure compresses uncompressed tables
-- that have more than 10,000 rows

crerate proc [dbo].[usp_compress_tables] as

-- get uncompressed tables

declare CUR cursor READ_ONLY FAST_FORWARD for
 SELECT distinct
     OBJECT_SCHEMA_NAME(OBJECT_ID) as SchemaName,
     OBJECT_NAME(OBJECT_ID) AS TableName
 FROM sys.partitions
 WHERE
     data_compression = 0 and rows > 10000
     and OBJECT_SCHEMA_NAME(OBJECT_ID) not in ('sys')

declare @tbl sysname, @sch sysname, @cmd varchar(max)

-- loop through tables
open CUR
fetch next from CUR into @sch, @tbl
while @@FETCH_STATUS = 0 begin

 -- compose alter table  rebuild with (data_compresion = page)
 select @cmd = 'ALTER TABLE [' + @sch + '].[' + @tbl + '] ' +
               'REBUILD WITH (DATA_COMPRESSION=PAGE)'

 print @cmd
 exec ( @cmd )
 print 'Done.'

 fetch next from CUR into @sch, @tbl
end

-- clean up
close CUR
deallocate CUR

GO

Warning: I did some typing here to make the code look nicer on the web page. I may have introduced a syntax error. So if you use it, give it a critical look-over first in the Management Studio.

Friday, June 5, 2009

Restricting Users to a Schema in a Shared Database

Our healthcare analysts do lots of "data smashing," so they needed a work area where they could play with the data. Initially we were creating individual databases for each user, but this quickly proved difficult to manage. So we decided to create a single sandbox database and separate users/projects by schemas.

The main stumbling block was to figure out how to grant all of the rights that a database owner would normally have, but only in a specific schema. After some experimenting with SQL Server, here is what worked for us:

-- create the work area (Schema)
create schema HEDIS authorization dbo
GO

-- create user that will be restricted to the HEDIS schema
create user TESTUSER
for login TESTUSER
with DEFAULT_SCHEMA = HEDIS
GO

-- Grant various create rights to the user
-- with grant (the user can authorize others)
grant create table, create view, create procedure,
create function, create synonym
to TESTUSER
with grant option
GO

-- Grant control over the schema
grant control on schema::HEDIS
to TESTUSER with grant option
GO

That's it. Let me know if anyone knows a different/better way to do it or if I missed some rights/permissions.

Data Warehouse Overview

In this post I want to briefly describe what I mean by ultra-light ultra-useful data warehouse.

Our data warehouse development effort came about because our internal IT has largely failed to create a truly easy-to-use integrated data warehouse. We have numerous systems spanning patient encounters, claims, referrals, scheduling, prescriptions, labs, radiology, membership, CRM, provider credentialing, general ledger, payroll, HR and so on. We also have many different "data warehouses." We have such database/reporting technologies as Oracle, SQL Server, DB2 on Unix/mainframes, Sybase, and even a large Teradata complex with Chrystal Enterprise. Yet, in the end, reporting analysts were using Paradox and Access to cobble together integrated data sets from these source systems and existing "data warehouses" and e-mailing PDFs to the business users.

So we have created an externally-hosted SQL Server-based platform to remedy the problem. Here are its main characteristics:

1. Total database size: 480Gb
2. One production database server with SQL Server 2005 Enterprise (migrating to 2008), one production report server with Sharepoint integration, and one development DB server.
3. ETLs use Integration Services and stored procedures.
4. 2.5 FTEs manage the entire solution, excluding server and software management, which is done by the hosting provider.
5. RAD methodology is used to build dimensional models and ETLs quickly and incrementally. All ETLs are framework-driven and highly automated.

In the end, the degree of integration and performance we have achieved makes this solution highly useful, as it quickly eliminated the need for Paradox-based data integration and reporting, provided consistency among analysts, delivered Web-based self-service reports, and provided a way for a rapid response to business needs. The fact that we are building this on a SQL Server platform with 2.5 FTEs and an external hosting vendor makes this an ultra-light solution in terms of the cost to the organization.

In the upcoming posts I will document our ongoing issues and lessons learned, as well as review some interesting designs, such as our highly automated single-code-based "uber-ETL" process, technical server architecture with disk layouts, and so on.

Stay tuned.