mandag 7. november 2011

Spring Batch and Hibernate

At JavaOne 2011 I did a BoF session titled “Parallel processing with Spring Batch: Lessons learned”. Unfortunately (although some may say fortunately), 45 minutes is not enough to go into the details of all lessons learned, which is why I decided to do a write up of my experiences with Spring Batch and Hibernate.

Using Hibernate in batch have been subject of some debate. Hibernate is geared towards web appliaction and its lazy loading of collections isn’t really a recipe for performance (many queries instead of one), when you end up always needing the collection data. However, perforamnce isn’t really the biggest issue I have with Hibernate and batch. Hibernate’s inherit magic on the other hand has caused me to rip my hair out. Having said that, there are perfectly good reasons to use Hibernate in your batch. Chief among them is that your domain and existing services already rely heavily on Hibernate (and its accompanying magic). Sometimes it will just be too time consuming to re-write the business logic to use JDBC in your batch. In this blog post I'll share my experiences with Spring Batch and Hibernate. I'll explain the issues I've had and how I've solved them, finally providing a recommendation for how Spring Batch and Hibernate can be combined successfully.

I should probably say that by no means am I a Hibernate expert, nor do I claim to be. My experience is that most self proclaimed Hjbernate experts are really just API-experts. They know their way around the vast amount of mapping options, the usertypes, the filters and what not. Here-in lies the problem, they are oblivious to the real value AND intrinsic complexity of Hibernate. That is lazy loading, flushing and transactions. More on that later.

I assume you know some Spring Batch, but here’s a summary of the most important part needed to understand this blog entry.

A typical step in Spring Batch is executed in chunks.



The reader returns a single object per read operation. That object is the input to the process operation. The output from the process is added to a list that is finally sent to the writer when the number of read items matches the commit interval, or the reader returns null (indicating that there is no more data left to process). It is also worth mentioning that the processor can return null, effectively filtering out that item (the item is not added to the list passed to the writer).

The transaction is started before the first read and committed after the write. Any exception will cause a rollback, unless otherwise explicitly specified in the configuration.

You can't use the same Hibernate session in the reader as you do in your writer (and processor)
This is likely one of the first first batch/hibernate mismatch you'll encounter when using Spring Batch with Hibernate. One of the reasons you can't use the same transaction bound hibernate session in the reader that you use in the writer is that a hibernate session does not necessarily survive a rollback. There are some exceptions that will invalidate the Hibernate session. That is why both of Spring Batch’s out of the box hibernate-based readers (HibernateCursorItemReader and HibernatePagingItemReader) creates a separate session that is not associated with the current transaction (in other words, sessionFactory.openSession() instead of sessionFactory.getCurrentSession()). This allows the session to span several chunks, something which is pivotal if you are using the HibernateCursorItemReader, which hold a cursor open over the length of the step execution (which consists of several transactions).

Something you may, or may not know is that a mapped association from one session can not be associated with another, the code snippet below will result in org.hibernate.HibernateException: Illegal attempt to associate a collection with two open sessions when session2.flush() is called at line 8. (There is a CascadeType.ALL set on the Order > Item association)

TransactionStatus transaction = transactionManager.getTransaction(new DefaultTransactionDefinition());
Session session1 = sessionFactory.openSession();
Query query = session1.createQuery("FROM Order");
Order order = (Order) query.list().get(0);
Session session2 = sessionFactory.getCurrentSession();
order.getItems().add(createItem("Product 4"));
session2.saveOrUpdate(order);
session2.flush();
transactionManager.commit(transaction);

This is not a problem for Spring Batch readers if you don’t change the useStatelessSession property. They will not load the association and as you are using the stateless session, they will not be loaded if you access them (instead you’ll see a LazyInitializationException). However, this also means that you can’t add data these collections.

In an attempt to get around this you might want to use LEFT JOIN FETCH in your query, to make sure your collections are eagerly loaded but at the same time in a detached state when they are returned from the reader (you are using stateless session remember).

Sounds good, but this causes two new problems:

HibernateException: cannot simultaneously fetch multiple bags
If you map two or more collections using java.util.List you’ll get an exception stating that you cannot simultaneously fetch multiple bags. The reason for this is that Hibernate is unable to detect duplicates in the cartesian product that the query will produce  (read Eyal Lupu’s blog post for a more detailed explanation)
The easy way around this is changing the collection type from List to Set (you can leave at most one List type mapping in the entire graph you are trying to load). But there's another problem:

Join fetchs does not preserve uniqueness
Queries using join fetch to eager load data does not necessarily return a list with one unique entry per root object. Say what?? Let’s try to explain that with some code. Using the standard Order + Item scenario
@Entity
public class Order {
 
 @Id
 @GeneratedValue(strategy=GenerationType.AUTO)
 private Long id;
 
 private String customer;
 
 @OneToMany(cascade=CascadeType.ALL)
 @JoinColumn(name="ORDER_ID")
 private Collection<Item> items = new LinkedHashSet<Item>();

//Methods

}

public class Item {
 
 @Id
 @GeneratedValue(strategy=GenerationType.AUTO)
 private Long id;
 
 @ManyToOne
 private Order order;

 private String product;
 
 private double price;
 
 private int quantity;
//Methods
}
Assuming the following database content:
Order:
id (PK) date
1 1/1/2011
2 1/1/2011
Item:
id (PK) order_id (FK) product
1 1 DRAM
2 2 SSD
3 2 CPU

So, the database holds two orders, the first order has one order item, the second has two. Executing the following code:
int count = session.createQuery(“select o FROM Order AS o LEFT JOIN FETCH o.items”).list().size();

What do you think the value of count would be? 2? 3?
No. It will be 6. 

The list will have 6 entries for Order, however, there will only be two Order instances. Which means, if you do
List orders = session.createQuery(“select o FROM Order AS o LEFT JOIN FETCH o.items”).list().
int count = new TreeSet().addAll(orders).size();
you’ll get 2.

This is a problem for the HibernatePagingItemReader that ships with Spring Batch. Internally it uses query.setFirstResult(page * pageSize).setMaxResults(pageSize).list(). When the resulting list may return duplicates, you can end up processing the same row twice.

I will outline a solution for this problem further down. This solution will also solve some other Hibernate-related problems in Spring Batch, so before I explain the very simple solution, here’s another problem:

Hibernate inserts & retries
Using the same Order + Item hibernate graph from the earlier examples, let’s say you have a batch that will process all incoming orders, calculate the sum of the orders and add a free gift if the sum is above a certain amount.

The reader will fetch all orders for which order date is today’s date. No point in writing a custom reader for this, just configure one of the two stock Hibernate readers mentioned earlier in this post.

The processor will be responsible for calculating the order total and add a gift as additional Item on the order if the order total is greater than the threshold. The code will look something like this:
public class FreeGiftProcessor implements ItemProcessor<Order, Order>

  public Order process(Order order){
     int total = calculateTotal(order);
     if(total > THREASHOLD){
       order.addItem(createFreeGiftItem());
  }
  return order;
  }

  private Item createFreeGiftItem(){
    //Gets a productnumber for freebie which is in stock
    Integer productNumber = productDao().getProductNumberForFreebie();
    return new Item(productNumber)
  }
}
Try not to think about the fact that this could just as easily have been done real-time on order completion, and the quirky dao call to get a product number.  It’s just a simple example that is easy to follow, used to show the issue you’ll soon see.

The Order is set up to cascade changes in the list of Item associated with the Order as shown in the code sample earlier.

Let’s say we are expecting some deadlocks in the database. They are a fact of life, especially if you do parellel processing in your batch. A database deadlock is a perfect example of a non-deterministic exception where we just want to retry the operation. With Spring Batch you can do automatic retries with a tiny configuration change to your step. You just add a retry-limit to your chunk element, and add retriable exceptions, like this:
 <job id="endOfDayOrderJob" xmlns="http://www.springframework.org/schema/batch">
  <step id="freeGiftOnBigOrdersStep">
   <tasklet>
    <chunk reader="hibernateReader" processor="freeGiftProcessor" writer="hibernateWriter"
     commit-interval="10" retry-limit="5">
     <retryable-exception-classes>
      <include class="org.springframework.dao.DeadlockLoserDataAccessException" />
     </retryable-exception-classes>
    </chunk>
   </tasklet>
  </step>
 </job>

(Spring will map exceptions caused by deadlocks to a org.springframework.dao.DeadlockLoserDataAccessException by default when using JDBC or Hibernate. If you have switched off the exception translation in Spring Hibernate will throw org.hibernate.exception.LockAcquisitionException ).

The diagram below shows a squence of events that is interrupted by a database deadlock.


Notice the statements highlighted in red in the diagram. This is Hibernate fetching the auto-generated primary key after the insert and assigning it to the java object.

Once a deadlock is encountered Spring Batch will roll back the current chunk. However, it will not throw away the items read in the, now rolled back, transaction. These will be stored in a retry cache. One of the reasons for this is that ItemReaders in Spring Batch are forward only, much like iterators. There is no stepping back.

By now you might have guessed what happens when you do a retry. The step will now fetch data, not from the ItemReader, but from the retry cache. A retry is doing the exact same thing as in the previous attempt, expecting a different result. But what happens? We get a StaleStateException. Why? In our first attempt order #1 is identified as a order fulfilling the requirements for a free gift, as such a free gift order item is added to the order. Then before the deadlock, this new line is flushed in the ItemWriter.write(). This flush will write out any changes done to entities attached to the Hibernate session. In this case, that means that the free gift item added to order #1 is written to the database (an insert) and assigned a new primary key (generated by the database). Hibernate sets the OrderItem representing the free gift ID equal to the generated primary key. However, as we saw in the diagram, a deadlock during the write of Order#2 will cause rollback of the transaction.

Now, we do a retry. The free gift item for Order#1 is no longer in the database, it is as if it never happened, in the database. The ID set on the java object is not reset however. Remember how Spring Batch will not read the Order from the database again? It will reuse the object from the first attempt? In other words, it will use the same order with that free gift already added. Worse, with a product item ID that is now rolled back in the database. So, this is what happens:


The business logic is the same. Order #1 still deserves a free gift. Another free gift is added. When an attempt is done to persist these changes in the ItemWriter hibernate will see that Order #1 has an order line  that has not been persisted (nor fetched) by the active session. As it has the primary key set, it will assume that this object corresponds to an existing database row, which means an UPDATE should be performed. So.. update order_item where id=42 set [...]. Of course, we now know that there is no order_item with ID 42 in the database. Hibernate will also realize this, when it sees that the number of effected rows from this update is zero. This of course makes Hibernate paranoid (and who can blaim it, really). It assumes that something is wrong with its state, hence the StaleStateException.

The fix
The fix for all these problems may seem obvious by now. Ditch Hibernate! Actually, you don’t have to ditch it all together. You can still use Hibernate and all your magically mapped Hibernate entities, just not in the ItemReader. Instead, you should use the simplest jdbc backed ItemReader. Its job is to read the primary key of each order, instead of reading the full Order-OrderItem object graph. That responsibility will now be pushed down the line, to the ItemProcessor. So, instead of looking like this:
public class FreeGiftProcessor implements ItemProcessor<Order, Order>
  public Order process(Order order){
     int total = calculateTotal(order);
     if(total > THREASHOLD){
       order.addItem(createFreeGiftItem());
  }
  return order;
  }
}

the process method should look like this:
public class FreeGiftProcessor implements ItemProcessor<Long, Order>
  public Order process(Long orderId){
     Order order = session.get(Order.class, orderId);
     int total = calculateTotal(order);
     if(total > THREASHOLD){
       order.addItem(createFreeGiftItem());
  }
  return order;
  }
}

Now when you encounter deadlocks, and you want Spring Batch to do automatic retries Spring Batch will not have the Hibernate object graph in the retry cache. Instead it will have the very stateless and immutable Long in the cache. And the processor will make sure that no stale Order data from the previous (failed) transaction is re-used, as the object graph is re-read by the processor on each retry. There are a number of advantages when using a Jdbc based reader to fetch primary keys (as opposed to objects):

  • You will only need one Hibernate session (per thread) in your batch, and it can be bound to the transaction
  • The primary keys will not change, so there is no issue when re-using the same object instance in a retry.
  • Reading PKs only in the reader makes it much easier to do multi-threaded steps.
  • Longs (primary keys) are low footprint. This means that your reader can read all the PKs in one SQL. Parse and store them in a class member (List<Long>). The reader can then return the next cached key whenever read() is called. (This is what the Spring Batch sample StagingItemReader for multi-threaded steps does)


A retry when using a Jdbc ItemReader (for example JdbcCursorItemReader or JdbcPagingItemReader) will look like this:



I'll try to follow up this blog post with another post relating to the parallel processing specific issues. This was more a Hibernate + Spring Batch related post, although many of these problems only became apparent when we parallelized our batch.