mandag 7. november 2011

Spring Batch and Hibernate

At JavaOne 2011 I did a BoF session titled “Parallel processing with Spring Batch: Lessons learned”. Unfortunately (although some may say fortunately), 45 minutes is not enough to go into the details of all lessons learned, which is why I decided to do a write up of my experiences with Spring Batch and Hibernate.

Using Hibernate in batch have been subject of some debate. Hibernate is geared towards web appliaction and its lazy loading of collections isn’t really a recipe for performance (many queries instead of one), when you end up always needing the collection data. However, perforamnce isn’t really the biggest issue I have with Hibernate and batch. Hibernate’s inherit magic on the other hand has caused me to rip my hair out. Having said that, there are perfectly good reasons to use Hibernate in your batch. Chief among them is that your domain and existing services already rely heavily on Hibernate (and its accompanying magic). Sometimes it will just be too time consuming to re-write the business logic to use JDBC in your batch. In this blog post I'll share my experiences with Spring Batch and Hibernate. I'll explain the issues I've had and how I've solved them, finally providing a recommendation for how Spring Batch and Hibernate can be combined successfully.

I should probably say that by no means am I a Hibernate expert, nor do I claim to be. My experience is that most self proclaimed Hjbernate experts are really just API-experts. They know their way around the vast amount of mapping options, the usertypes, the filters and what not. Here-in lies the problem, they are oblivious to the real value AND intrinsic complexity of Hibernate. That is lazy loading, flushing and transactions. More on that later.

I assume you know some Spring Batch, but here’s a summary of the most important part needed to understand this blog entry.

A typical step in Spring Batch is executed in chunks.



The reader returns a single object per read operation. That object is the input to the process operation. The output from the process is added to a list that is finally sent to the writer when the number of read items matches the commit interval, or the reader returns null (indicating that there is no more data left to process). It is also worth mentioning that the processor can return null, effectively filtering out that item (the item is not added to the list passed to the writer).

The transaction is started before the first read and committed after the write. Any exception will cause a rollback, unless otherwise explicitly specified in the configuration.

You can't use the same Hibernate session in the reader as you do in your writer (and processor)
This is likely one of the first first batch/hibernate mismatch you'll encounter when using Spring Batch with Hibernate. One of the reasons you can't use the same transaction bound hibernate session in the reader that you use in the writer is that a hibernate session does not necessarily survive a rollback. There are some exceptions that will invalidate the Hibernate session. That is why both of Spring Batch’s out of the box hibernate-based readers (HibernateCursorItemReader and HibernatePagingItemReader) creates a separate session that is not associated with the current transaction (in other words, sessionFactory.openSession() instead of sessionFactory.getCurrentSession()). This allows the session to span several chunks, something which is pivotal if you are using the HibernateCursorItemReader, which hold a cursor open over the length of the step execution (which consists of several transactions).

Something you may, or may not know is that a mapped association from one session can not be associated with another, the code snippet below will result in org.hibernate.HibernateException: Illegal attempt to associate a collection with two open sessions when session2.flush() is called at line 8. (There is a CascadeType.ALL set on the Order > Item association)

TransactionStatus transaction = transactionManager.getTransaction(new DefaultTransactionDefinition());
Session session1 = sessionFactory.openSession();
Query query = session1.createQuery("FROM Order");
Order order = (Order) query.list().get(0);
Session session2 = sessionFactory.getCurrentSession();
order.getItems().add(createItem("Product 4"));
session2.saveOrUpdate(order);
session2.flush();
transactionManager.commit(transaction);

This is not a problem for Spring Batch readers if you don’t change the useStatelessSession property. They will not load the association and as you are using the stateless session, they will not be loaded if you access them (instead you’ll see a LazyInitializationException). However, this also means that you can’t add data these collections.

In an attempt to get around this you might want to use LEFT JOIN FETCH in your query, to make sure your collections are eagerly loaded but at the same time in a detached state when they are returned from the reader (you are using stateless session remember).

Sounds good, but this causes two new problems:

HibernateException: cannot simultaneously fetch multiple bags
If you map two or more collections using java.util.List you’ll get an exception stating that you cannot simultaneously fetch multiple bags. The reason for this is that Hibernate is unable to detect duplicates in the cartesian product that the query will produce  (read Eyal Lupu’s blog post for a more detailed explanation)
The easy way around this is changing the collection type from List to Set (you can leave at most one List type mapping in the entire graph you are trying to load). But there's another problem:

Join fetchs does not preserve uniqueness
Queries using join fetch to eager load data does not necessarily return a list with one unique entry per root object. Say what?? Let’s try to explain that with some code. Using the standard Order + Item scenario
@Entity
public class Order {
 
 @Id
 @GeneratedValue(strategy=GenerationType.AUTO)
 private Long id;
 
 private String customer;
 
 @OneToMany(cascade=CascadeType.ALL)
 @JoinColumn(name="ORDER_ID")
 private Collection<Item> items = new LinkedHashSet<Item>();

//Methods

}

public class Item {
 
 @Id
 @GeneratedValue(strategy=GenerationType.AUTO)
 private Long id;
 
 @ManyToOne
 private Order order;

 private String product;
 
 private double price;
 
 private int quantity;
//Methods
}
Assuming the following database content:
Order:
id (PK) date
1 1/1/2011
2 1/1/2011
Item:
id (PK) order_id (FK) product
1 1 DRAM
2 2 SSD
3 2 CPU

So, the database holds two orders, the first order has one order item, the second has two. Executing the following code:
int count = session.createQuery(“select o FROM Order AS o LEFT JOIN FETCH o.items”).list().size();

What do you think the value of count would be? 2? 3?
No. It will be 6. 

The list will have 6 entries for Order, however, there will only be two Order instances. Which means, if you do
List orders = session.createQuery(“select o FROM Order AS o LEFT JOIN FETCH o.items”).list().
int count = new TreeSet().addAll(orders).size();
you’ll get 2.

This is a problem for the HibernatePagingItemReader that ships with Spring Batch. Internally it uses query.setFirstResult(page * pageSize).setMaxResults(pageSize).list(). When the resulting list may return duplicates, you can end up processing the same row twice.

I will outline a solution for this problem further down. This solution will also solve some other Hibernate-related problems in Spring Batch, so before I explain the very simple solution, here’s another problem:

Hibernate inserts & retries
Using the same Order + Item hibernate graph from the earlier examples, let’s say you have a batch that will process all incoming orders, calculate the sum of the orders and add a free gift if the sum is above a certain amount.

The reader will fetch all orders for which order date is today’s date. No point in writing a custom reader for this, just configure one of the two stock Hibernate readers mentioned earlier in this post.

The processor will be responsible for calculating the order total and add a gift as additional Item on the order if the order total is greater than the threshold. The code will look something like this:
public class FreeGiftProcessor implements ItemProcessor<Order, Order>

  public Order process(Order order){
     int total = calculateTotal(order);
     if(total > THREASHOLD){
       order.addItem(createFreeGiftItem());
  }
  return order;
  }

  private Item createFreeGiftItem(){
    //Gets a productnumber for freebie which is in stock
    Integer productNumber = productDao().getProductNumberForFreebie();
    return new Item(productNumber)
  }
}
Try not to think about the fact that this could just as easily have been done real-time on order completion, and the quirky dao call to get a product number.  It’s just a simple example that is easy to follow, used to show the issue you’ll soon see.

The Order is set up to cascade changes in the list of Item associated with the Order as shown in the code sample earlier.

Let’s say we are expecting some deadlocks in the database. They are a fact of life, especially if you do parellel processing in your batch. A database deadlock is a perfect example of a non-deterministic exception where we just want to retry the operation. With Spring Batch you can do automatic retries with a tiny configuration change to your step. You just add a retry-limit to your chunk element, and add retriable exceptions, like this:
 <job id="endOfDayOrderJob" xmlns="http://www.springframework.org/schema/batch">
  <step id="freeGiftOnBigOrdersStep">
   <tasklet>
    <chunk reader="hibernateReader" processor="freeGiftProcessor" writer="hibernateWriter"
     commit-interval="10" retry-limit="5">
     <retryable-exception-classes>
      <include class="org.springframework.dao.DeadlockLoserDataAccessException" />
     </retryable-exception-classes>
    </chunk>
   </tasklet>
  </step>
 </job>

(Spring will map exceptions caused by deadlocks to a org.springframework.dao.DeadlockLoserDataAccessException by default when using JDBC or Hibernate. If you have switched off the exception translation in Spring Hibernate will throw org.hibernate.exception.LockAcquisitionException ).

The diagram below shows a squence of events that is interrupted by a database deadlock.


Notice the statements highlighted in red in the diagram. This is Hibernate fetching the auto-generated primary key after the insert and assigning it to the java object.

Once a deadlock is encountered Spring Batch will roll back the current chunk. However, it will not throw away the items read in the, now rolled back, transaction. These will be stored in a retry cache. One of the reasons for this is that ItemReaders in Spring Batch are forward only, much like iterators. There is no stepping back.

By now you might have guessed what happens when you do a retry. The step will now fetch data, not from the ItemReader, but from the retry cache. A retry is doing the exact same thing as in the previous attempt, expecting a different result. But what happens? We get a StaleStateException. Why? In our first attempt order #1 is identified as a order fulfilling the requirements for a free gift, as such a free gift order item is added to the order. Then before the deadlock, this new line is flushed in the ItemWriter.write(). This flush will write out any changes done to entities attached to the Hibernate session. In this case, that means that the free gift item added to order #1 is written to the database (an insert) and assigned a new primary key (generated by the database). Hibernate sets the OrderItem representing the free gift ID equal to the generated primary key. However, as we saw in the diagram, a deadlock during the write of Order#2 will cause rollback of the transaction.

Now, we do a retry. The free gift item for Order#1 is no longer in the database, it is as if it never happened, in the database. The ID set on the java object is not reset however. Remember how Spring Batch will not read the Order from the database again? It will reuse the object from the first attempt? In other words, it will use the same order with that free gift already added. Worse, with a product item ID that is now rolled back in the database. So, this is what happens:


The business logic is the same. Order #1 still deserves a free gift. Another free gift is added. When an attempt is done to persist these changes in the ItemWriter hibernate will see that Order #1 has an order line  that has not been persisted (nor fetched) by the active session. As it has the primary key set, it will assume that this object corresponds to an existing database row, which means an UPDATE should be performed. So.. update order_item where id=42 set [...]. Of course, we now know that there is no order_item with ID 42 in the database. Hibernate will also realize this, when it sees that the number of effected rows from this update is zero. This of course makes Hibernate paranoid (and who can blaim it, really). It assumes that something is wrong with its state, hence the StaleStateException.

The fix
The fix for all these problems may seem obvious by now. Ditch Hibernate! Actually, you don’t have to ditch it all together. You can still use Hibernate and all your magically mapped Hibernate entities, just not in the ItemReader. Instead, you should use the simplest jdbc backed ItemReader. Its job is to read the primary key of each order, instead of reading the full Order-OrderItem object graph. That responsibility will now be pushed down the line, to the ItemProcessor. So, instead of looking like this:
public class FreeGiftProcessor implements ItemProcessor<Order, Order>
  public Order process(Order order){
     int total = calculateTotal(order);
     if(total > THREASHOLD){
       order.addItem(createFreeGiftItem());
  }
  return order;
  }
}

the process method should look like this:
public class FreeGiftProcessor implements ItemProcessor<Long, Order>
  public Order process(Long orderId){
     Order order = session.get(Order.class, orderId);
     int total = calculateTotal(order);
     if(total > THREASHOLD){
       order.addItem(createFreeGiftItem());
  }
  return order;
  }
}

Now when you encounter deadlocks, and you want Spring Batch to do automatic retries Spring Batch will not have the Hibernate object graph in the retry cache. Instead it will have the very stateless and immutable Long in the cache. And the processor will make sure that no stale Order data from the previous (failed) transaction is re-used, as the object graph is re-read by the processor on each retry. There are a number of advantages when using a Jdbc based reader to fetch primary keys (as opposed to objects):

  • You will only need one Hibernate session (per thread) in your batch, and it can be bound to the transaction
  • The primary keys will not change, so there is no issue when re-using the same object instance in a retry.
  • Reading PKs only in the reader makes it much easier to do multi-threaded steps.
  • Longs (primary keys) are low footprint. This means that your reader can read all the PKs in one SQL. Parse and store them in a class member (List<Long>). The reader can then return the next cached key whenever read() is called. (This is what the Spring Batch sample StagingItemReader for multi-threaded steps does)


A retry when using a Jdbc ItemReader (for example JdbcCursorItemReader or JdbcPagingItemReader) will look like this:



I'll try to follow up this blog post with another post relating to the parallel processing specific issues. This was more a Hibernate + Spring Batch related post, although many of these problems only became apparent when we parallelized our batch.

torsdag 3. november 2011

JavaOne summary


This was my second time at JavaOne, the first time being in 2006. Obviously, much has changed since then, one of which is the location. Instead of Moscone Center, JavaOne is now hosted at three hotels near Union Square. This is the second time it is done this way, but it’s my first experience with it.

Location
People who attended JavaOne 2010 said that this year was a great improvement over last year when it came to locations. That probably says more about the situation last year than this year.
A recurring problem for me was figuring out where the session I wanted to attend was.I had the printed agenda, but it was huge and I often forgot to bring it. The mobile app required me to log in all the time and I couldn’t remember my password, besides the wireless network was spotty and wasn’t set up to handle the attendance. I often found myself knowing which hotel I was supposed to be at, but unsure of which room. A suggestion for the organizers would be to a) equip the people helping you find the way with an agenda so they could tell you which room you were going to and not only where the room was. b) put up boards like what you’d see in airports of departing flights and their gate number. They could have something similar, upcoming presentations, their hotel and room. That would help a lot.

Agenda
Listening to the Java Spotlight Podcast I noticed that the organizers said that there was a lot of non-Oracle speakers, but that is not how I saw it. Doing a find ‘Oracle’ in the speaker catalog clearly shows the impact Oracle has on the agenda. The agenda was full of Oracle or Oracle partners speakers. For a Oracle World attendee this is perhaps not surprising, but I think the Java community would rather have unbiased and interesting talks than a polished agenda with a unison message.

One thing I found quite curious is the way Oracle (and other JEE implementers) keep trying to pick a fight with SpringSource. There was a talk titled Java EE and Spring/MVC Shoot-out presented by a Chris Vignola from IBM. So, there’s actually a guy employed by one of the largest commercial JEE implementers comparing Spring and JEE? I didn’t attend the session, but no prizes for guessing who won that shoot-out. I also heard there were talks were Spring 1.2 was compared to JEE 6. I didn’t see this myself, but I wouldn’t be surprised.

All in all, my opinion is that the agenda was too heavily influenced by the overall message Oracle wanted to convey, instead of the really good and cool talks. There should be more speakers from No Fluff Just Stuff out there. If JavaOne is to be something more than Oracle Open World with Java flavors the agenda must allow more diverse talks that aren't necessarily part of Oracle's overall (marketing) message.

I don’t know, it might be that I’m just spoiled by attending JavaZone every year. (I should probably point out that I'm part of the JavaZone program committee which some might see as bias ;-) )

There were two talks I really liked. One was Jevgeni Kabanov’s “Do you really get Class Loaders?”. This talk was an eye opener, explaining why we are seing ClassDefNotFoundError, ClassCastException etc when running apps in Servlet containers (or full blown JEE containers).

The other one was Ken Sipe’s “Rocking the Gradle”.

The party
Before I came to JavaOne I heard that the one big improvement that Oracle had brought to JavaOne was the party. The party was on Treasure Island outside San Francisco, and ~50.000 people were shipped there in busses (OracleWorld and JavaOne has a joint party. OOW outweighs J1 1-10.)

It’s no easy job shipping 50k people by bus, and I’m sure the organizers did a good job. Still, waiting in line for a bus and then trolling through traffic (which I guess the amount of buses contributed to) was a drag, I almost fell asleep on the bus. When we got there, there was free food and drinks. The food was actually quite good, a selection of BBQ. The beer was bland, at best. There was concerts with Sting and Tom Petty and the Heartbreakers, but that isn’t really my thing. We strolled around for about an hour and then headed home.

I’d say the party was a disappointment. I’d rather have something smaller in the Moscone with different companies doing happy hours in surrounding bars (which was what it was like in 2006) than this big and time consuming thing.

Thanks for having me
I do want to thank Oracle for accepting my BoF. I really enjoyed speaking at JavaOne. I’d also like to thanks the people who attended my BoF. It was the last BoF on tuesday so it means a lot to me that you chose to turn up at my presentation instead staying at the bar. I also appreciate all the questions during and after the talk.


Presenting..
Got my Speaker badge