Friday, October 23, 2009

Efficient query to find rows with min or max value in a column

I often need to find the row of data with a maximum value in a certain column. For example, let's say I have a customer order table with multiple records per customer. I need to find the latest (or the first) order for each customer based on the order date. To emphasize: I need the entire row, not just the maximum order date for each customer. (Note: this is not exactly the scenario I was dealing with, I changed it for illustrative purposes. If I screwed up the queries, it's a typo. The principle works.)

The obvious solution is to use max() in the subquery, so that the query first finds the maximum order date for each customer, then retrieves the row with that date:

select *
from customer_order a
where order_date = (
select max(b.order_date)
from customer_order b
where b.customer_id = a.customer_id
)


Seems a little convoluted, but ok. But... there are several issues:

1. If a customer has several orders on the maximum date, I'll get all those rows, not just one. Depending on my requirements, I may need just one row.

2. If I am looking for the second or N-th max or min row, this query does not help.

Here is an efficient and a more flexible alternative that uses row_number() to select the first, second, next to last, and the last order per customer:

select *
from (
select
ROW_NUMBER() over (
partition by customer_id
order by order_date ASC) as AscRank,
ROW_NUMBER() over (
partition by customer_id
order by order_date DESC) as DescRank,
*
from customer_order
) q
where
AscRank in (1,2) or
DescRank in (2,1)


This query assigns two numbers for each row: ascending rank and descending rank. So if a customer has 5 orders, the earliest (historically) record will have ascending rank = 1 and descending rank = 5. The latest record will have the opposite: ascending rank = 5, descending = 1.

Thus, for each customer_id this query will return the fist and second, and the last order and next to last record due to the or condition in the where clause. The first record will have AscRank = 1, next AscRank = 2, next to last DescRank = 2, and the last record DescRank = 1. If you want to find only the last order for each customer, use

where DescRank = 1

That's it!

I found this technique to be quick (2 sec) using a member enrollment history table with 1.3 mil rows, with every customer having 1-5 records. The same technique against a health claims table with 1.4 mil claims returns rows in 27 seconds. The claim table has fewer individual members, but more fare more records per member (avg 8 records).

No comments: