Hibernate query limitations and correlated sub queries

It’s well known that one of the key limitations of writing queries in Hibernate, using either HQL or criteria, is that you cannot write queries that have a join in the from clause. What is less well known is that for some simple queries that involve a join in the from clause, you can rewrite them as correlated subqueries, to make them runnable by Hibernate. Let’s look at an example. Imagine we have a customer table and an orders table. The orders table has a foreign key to the customer table. Suppose you want to get a report of all customers that have three or more orders. If you were to write this in SQL using a join, you would write:

select id from 
customer c 
    join on 
(select customer_id,count(*) as num_orders from orders group by customer_id) co 
    where c.id = co.customer_id and co.num_orders > 2

To convert this to an SQL correlated subquery, the join is removed and the subquery gets moved to the where clause. The conditions on the join get pushed down into the subquery. Since they refer to the columns from the outer query (on customer), this is what makes the subquery a correlated subquery:

select id,forename,surname from
customer c 
where (select count(*) from orders o where o.customer_id = c.id) > 2

In HQL, this becomes:

select id,forename,surname from 
Customer c 
where (select count(*) from Order o where o.customer.id = c.id) > 2

Now suppose we want the most recent order for each customer. In the most basic variation, where you just want to get the timestamp of the most recent order for each customer id, you get both columns out of your grouping query:

select max(timestamp),customer_id from orders group by customer_id

In the more general case, where you want to get further details of either the customer or the order, you have to join this query to the customer or orders table respectively. For example, to get the order id and total as well:

select o.id,o.timestamp,o.order_total from 
orders o 
(select max(timestamp) as max_ts,customer_id from orders group by customer_id) as x 
on o.customer_id = x.customer_id and o.timestamp = x.max_ts

To convert this to a correlated subquery, the subquery gets moved from the join into the where clause, and the additional condition on the join becomes part of the subquery where clause. Note that the group by can be removed, since the restriction on customer_id has become part of the correlated subquery where clause.

select o1.id,o1.timestamp,o1.order_total from 
orders o1 
where o1.timestamp in 
(select max(timestamp) from orders o2 where o1.customer_id = o2.customer_id)

In HQL this would become:

select o1.id,o1.timestamp,o1.orderTotal from
Order o1
where o1.timestamp in 
(select max(timestamp) from Order o2 where o1.customer.id = o2.customer.id)

Suppose we want to add some details of the customer as well. In sql we would join this to the customer table:

select c.id as cust_id,c.forename,c.surname,o1.id,o1.timestamp,o1.order_total 
from orders o1
join customer c on o1.customer_id = c.id
where o1.timestamp in 
(select max(timestamp) from orders o2 where o1.customer_id = o2.customer_id)

In HQL, we simply navigate the association from order to customer to get the customer info:

select o1.customer.id,o1.customer.forename,o1.customer.surname,
from Order o1 where o1.timestamp in 
  (select max(timestamp) from Order o2 where o1.customer.id = o2.customer.id)

So, are these queries equivalent to the corresponding join queries, and should you always rewrite them in this format? The answer to the first question is that technically, a subquery could get a different execution plan at the database. In theory, evaluating a correlated subquery could mean that the database evaluates the correlated subquery separately for every single row in the outer query. Clearly, this would be a very expensive operation. However, for correlated subqueries that are actually equivalent to a join query, a good database query optimiser will recognise this, and follow the same execution plan as it would do for the join query. i.e. it will perform a join operation to get a result set that contains all of the rows in the subquery as a single operation, rather than evaluating the query once for every row in the outer query. Certainly, if I run queries like the above on SQL Server and look at the execution plan, I can see it is equivalent to the corresponding join query. What about the second question? Even if you can rewrite these sorts of query, should you? Well, this probably comes down to the normal questions surrounding Hibernate – the Hibernate version of the query won’t need updating if you change the tables or columns in the database, whereas the SQL version would. If you want a query that returns objects, the Hibernate version will make this easier for you. As a general rule, the Hibernate versions are going to be easier to use and maintain.

This article is part of a series on Hibernate. You might be in interested in other articles in the series:

Bespoke join conditions with Hibernate JoinFormula
Inheritance and polymorphism
One-to-many associations
One-to-one associations
Many-to-many associations

This entry was posted in Hibernate and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

503,438 Spambots Blocked by Simple Comments

HTML tags are not allowed.