Hibernate Best Practices – Parameterising Criteria Queries

Recently one of our database team pointed out some queries that were not running efficiently. We looked into the queries and they were Hibernate criteria queries. Consider a simple helper method that is creating a predicate for use in a criteria query:
private Predicate restrictToAcceptedTxWith(Long policyTermId, DateTime at, Root<PolicyTransaction> ptxRoot, CriteriaBuilder cb) {
    return cb.and(
            cb.isNotNull(ptxRoot.get("acceptedDate")),
            cb.equal(ptxRoot.get("policyTerm").get("id"), policyTermId),
            cb.greaterThanOrEqualTo(ptxRoot.get("transactionPeriodEndDate"), at),
            cb.lessThanOrEqualTo(ptxRoot.get("transactionPeriodStartDate"), at)
    );
}
There doesn’t seem anything wrong with this, it looks like a regular bit of Hibernate code, so what is the problem? Well, it takes a parameter – in this case policyTermId, but when it passes that into the criteria builder equal() method, it doesn’t explicitly create it as a parameterised expression. Why is this a problem? To understand, you need to understand how a database evaluates and runs a SQL statement. To evaluate a query, a database must do the following:
  1. Construct one or more potential query plans
  2. Choose a plan
  3. Execute the query using the chosen plan

The reason SQL statements can result in multiple query plans is because SQL is a declarative language, rather than an imperative language like Java. You are telling the database conceptually what you want, without specifying exactly how it needs to be done. If you want to join tables A, B, C and D, you could do that by joining A to B, then the result of that to C, then the result of that to D. Or you could join A to D, then the result of that to B, then the result of that to C, and so on. In fact you could even join A to B, then separately join C to D, then join those two results together. (This is usually called a bushy query plan.) Choosing a query plan takes time, so once one is chosen, it is cached, so that if the same SQL query is run again, just with different parameters, the query plan choice does not have to be repeated.

This is the problem here. All databases understand the concept of parameterised SQL. Consider a simple query:

select * from policy where policy_number = ?
The database constructs the query plan and caches it based on the parameterised SQL. However, if you send in SQL where a parameter has actually been inlined into the SQL, like this:
select * from policy where policy_number = 123
The database has no understanding that “123” is meant to be a parameter, so the query plan (if cached at all) would be cached on the full query string. Then when you send in:
select * from policy where policy_number = 456
the database will not pick up the cached plan. It has to reevaluate the query plan options.

In the case of the Hibernate criteria, it can be parameterised as follows:

private Predicate restrictToAcceptedTxWith(DateTime at, Root<PolicyTransaction> ptxRoot, CriteriaBuilder cb) {
        final ParameterExpression<Long> policyTermIdParam = cb.parameter(Long.class, POLICY_TERM_ID_PARAM_NAME);
        return cb.and(
                cb.isNotNull(ptxRoot.get("acceptedDate")),
                cb.equal(ptxRoot.get("policyTerm").get("id"), policyTermIdParam),
                cb.greaterThanOrEqualTo(ptxRoot.get("transactionPeriodEndDate"), at),
                cb.lessThanOrEqualTo(ptxRoot.get("transactionPeriodStartDate"), at)
        );
    }

// calling code then calls
 getCurrentSession().createQuery(cq).setParameter(POLICY_TERM_ID_PARAM_NAME, policyTermId).list();
Now the generated SQL will be parameterised, and the database will be able to cache the query plan.
Posted in Hibernate | Tagged , | Leave a comment

Hibernate internals – how does pagination work?

Hibernate supports pagination for all query types – HQL, Criteria and SQL, via the query setMaxResults() and setFirstResult() methods, but how does this work under the covers? Well, assuming your database supports it, it will add a row_number() to your query, then wrap this inside another query to pick out the desired rows. You can see the SQL for yourself if you turn on Hibernate SQL logging (by setting the org.hibernate.SQL log level to DEBUG):
2021-05-17 16:22:00,452 http-nio-8081-exec-5 DEBUG [org.hibernate.SQL] 
WITH query AS (SELECT inner_query.*, ROW_NUMBER() OVER (ORDER BY CURRENT_TIMESTAMP) 
as __hibernate_row_nr__ 
FROM ( select
  clm.claim_id claimId, 
  clm.claim_number claimNumber,
  clmst.code claimStatusCode, 
  clmst.description claimStatusDescription,
  ppb.companyname policyholderCompanyName
You can see what is going on if you look inside the Hibernate Loader class:
protected SqlStatementWrapper executeQueryStatement(
      String sqlStatement,
      QueryParameters queryParameters,
      boolean scroll,
      List<AfterLoadAction> afterLoadActions,
      SharedSessionContractImplementor session) throws SQLException {
   // Processing query filters.
   queryParameters.processFilters( sqlStatement, session );
   // Applying LIMIT clause.
   final LimitHandler limitHandler = getLimitHandler(
         queryParameters.getRowSelection()
   );
   String sql = limitHandler.processSql( queryParameters.getFilteredSQL(), queryParameters.getRowSelection() );
In my case I’m running on SQL Server and the limit handler is the class SQLServer2005LimitHandler. If you look inside this class you can see the code that parses the SQL statement, adds a row_number() and wraps it in an outer query:
/**
 * Add a LIMIT clause to the given SQL SELECT (HHH-2655: ROW_NUMBER for Paging)
 *
 * The LIMIT SQL will look like:
 *
 * <pre>
 * WITH query AS (
 *   SELECT inner_query.*
 *        , ROW_NUMBER() OVER (ORDER BY CURRENT_TIMESTAMP) as __hibernate_row_nr__
 *     FROM ( original_query_with_top_if_order_by_present_and_all_aliased_columns ) inner_query
 * )
 * SELECT alias_list FROM query WHERE __hibernate_row_nr__ >= offset AND __hibernate_row_nr__ < offset + last
 * </pre>
 *
 * When offset equals {@literal 0}, only <code>TOP(?)</code> expression is added to the original query.
 *
 * @return A new SQL statement with the LIMIT clause applied.
 */
@Override
public String processSql(String sql, RowSelection selection) {
   final StringBuilder sb = new StringBuilder( sql );
   if ( sb.charAt( sb.length() - 1 ) == ';' ) {
      sb.setLength( sb.length() - 1 );
   }
   if ( LimitHelper.hasFirstRow( selection ) ) {
      final String selectClause = fillAliasInSelectClause( sb );
      final int orderByIndex = shallowIndexOfPattern( sb, ORDER_BY_PATTERN, 0 );
      if ( orderByIndex > 0 ) {
         // ORDER BY requires using TOP.
         addTopExpression( sb );
      }
      encloseWithOuterQuery( sb );
      // Wrap the query within a with statement:
      sb.insert( 0, "WITH query AS (" ).append( ") SELECT " ).append( selectClause ).append( " FROM query " );
      sb.append( "WHERE __hibernate_row_nr__ >= ? AND __hibernate_row_nr__ < ?" );
   }
   else {
      addTopExpression( sb );
   }
   return sb.toString();
}
Posted in Hibernate, Java | Tagged , | Leave a comment

Memory leak with TestNG and Spring DirtiesContext annotation

We recently added some new integration tests to our test suite. All three tests were Spring integration tests. Normally in a Spring integration test, the entire code for a test method operates in a single transaction, and at the end of that test method, all database changes are rolled back, so your Spring context is safe to be cached and reused for the next test. However sometimes you are invoking code that starts and commits transactions, meaning data will be still in the database after the test has finished. To help deal with this scenario, Spring provides the @DirtiesContext annotation, which tells Spring to remove the context from its cache so a new context will be created for the next test. All three new tests used the @DirtiesContext annotation. However after they were checked in, the build failed with an out of memory error. We configured Java to generate a heapdump on out of memory and I looked at the heapdump with the Eclipse Memory Analyser Tool (MAT). This is what the results showed:

Yikes! There were eight separate instances of the Hibernate SessionFactoryImpl! It transpires that TestNG has a known bug in that it keeps references to your tests even after each test is completed. This means that even though Spring was removing the context from its cache and a new context was being created, most of the old context was not eligible for garbage collection.

TestNG memory leak bug 1461

My colleague Elko actually wrote an annotation for us that we use in our test suite, which clears the entire database after a test method. Not only does this avoid the TestNG memory leak, but is it massively faster, as it avoids reconstructing the test context, which takes around 30-45 seconds to construct. It is a TestNG listener which does the following:

  • Turns off referential integrity
  • Queries the information_schema table to get all table names
  • Iterates over all tables and truncates each one
  • Turns referential integrity back on
Posted in Java, Testing, TestNG | Tagged , , | Leave a comment

Jetson Nano and TensorFlow

I got a Jetson Nano for Christmas from my brother-in-law. This is a Single Board Computer (SBC), similar to a Raspberry Pi, but more oriented to machine learning as it has a powerful nVidia GPU on it. Neural networks like TensorFlow run most efficiently on GPUs.

The components I’m using are:

  • Jetson Nano 2Gb
  • Raspberry Pi USB-C power supply
  • Logitech wireless USB keyboard
  • 64Gb SD card
  • PiHut USB switch (so you can switch the Nano power on and off without having to use the plug socket

Setup instructions from nVidia are pretty good:

https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-devkit

Note that the Nano comes in two versions – 4Gb and 2Gb – the first download link for the operating system is the 4Gb version, so if you are using the 2Gb version like me, make sure you click the second download link. It is a 6Gb zip which took around an hour to download. After that it needs to be unzipped, which produces a 14Gb disc image file, and written to your SD card. I’m using an Apple Mac, so as per the nVidia instructions, I used Etcher to copy the file to the SD card, which took around 10 minutes to write, and another 3 minutes to validate. The operating system is an Ubuntu variant, called Linux4Tigra, which runs the Lightweight Desktop Environment (http://www.lxde.org/):

Note that the Jetson Nano does not include a wireless ethernet card. In my case, I have a wireless range extender plugged in, and I’m using a wired connection to that. Alternatively, the latest versions of the Nano support wireless via USB. For example:

https://learn.sparkfun.com/tutorials/adding-wifi-to-the-nvidia-jetson/all

After getting the Nano up and running, I installed Python 3 and TensorFlow by following these instructions:

https://docs.nvidia.com/deeplearning/frameworks/install-tf-jetson-platform/index.html

I added an alias to .bash_aliases so that “python” is aliased to “python3”. (If you add an alias like this, it is recommended you add it for your user, not as a system wide alias, as this could break operating system functionality that requires python 2.)

Then I ran through the Tensorflow Quickstart tutorial, which is an image classification neural network, using the MNIST dataset 1, which is the digits from 0-9:

https://www.tensorflow.org/tutorials/quickstart/beginner

Although the code is all included in the above link, I want to inline it here to show just how short it is:

#!/usr/bin/python3

import tensorflow as tf

# load standard sample data from the mnist dataset 1
# dataset 1 is images of the digits 0-9
mnist = tf.keras.datasets.mnist

# load both training and test data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# the images have colour values from 0-255. these need to be scaled to be from 0-1
x_train, x_test = x_train / 255.0, x_test / 255.0

# now define our model
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

predictions = model(x_train[:1]).numpy()

tf.nn.softmax(predictions).numpy()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)

model.evaluate(x_test,  y_test, verbose=2)

model(x_test[:5])

I think it is amazing that in around 20 lines of python, we can load a dataset, define a neural network, train it and test it! Of course, the python is orchestrating the process, not doing the heavy lifting. The hard work is done by C code running on the GPU. Running this on the Nano took a couple of minutes. The output shows the work being sent to the GPU, as we expect. Here is an edited excerpt from the output:

2020-12-30 11:46:56.368109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1884] Adding visible gpu devices: 0
2020-12-30 11:48:03.136928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1428] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
Epoch 1/5
1875/1875 [==============================] - 17s 9ms/step - loss: 0.2918 - accuracy: 0.9158
Epoch 2/5
1875/1875 [==============================] - 17s 9ms/step - loss: 0.1385 - accuracy: 0.9584
Epoch 3/5
1875/1875 [==============================] - 16s 9ms/step - loss: 0.1049 - accuracy: 0.9678
Epoch 4/5
1875/1875 [==============================] - 17s 9ms/step - loss: 0.0864 - accuracy: 0.9733
Epoch 5/5
1875/1875 [==============================] - 16s 9ms/step - loss: 0.0738 - accuracy: 0.9774
2020-12-30 11:51:36.564930: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 31360000 exceeds 10% of free system memory.
313/313 - 2s - loss: 0.0733 - accuracy: 0.9774
Posted in Machine Learning, Python | Tagged , | Leave a comment

Using live templates in IntelliJ

If you need to write repeated text in IntelliJ then you can use its live templating function to help you. Suppose I’m writing a Liquibase script that will be composed of many similar changesets:

<changeSet id="MapSalvageCategoryAToMIAFTR" author="proctorh">
    <preConditions onFail="MARK_RAN">
        <and>
            <sqlCheck expectedResult="0">SELECT COUNT(*) FROM EXTERNAL_SYSTEM_MAPPED_PROP WHERE REF_DATA_TABLE =
                'rd_salvage_category' AND REF_DATA_CODE='CATEGORYA' AND EXTERNAL_SYSTEM_ID = (SELECT ID FROM
                RD_EXTERNAL_SYSTEM WHERE CODE='MIAFTR')
            </sqlCheck>
        </and>
    </preConditions>
    <sql>INSERT INTO EXTERNAL_SYSTEM_MAPPED_PROP       (EXTERNAL_SYSTEM_ID,REF_DATA_TABLE,REF_DATA_CODE,EXT_SYS_PROPERTY_VALUE) (SELECT
        ID,'RD_SALVAGE_CATEGORY','CATEGORYA','A' FROM RD_EXTERNAL_SYSTEM WHERE CODE='MIAFTR')
    </sql>
</changeSet>

I want to repeat these inserts, but with different values for the ref data code, and what it is being mapped to.

  • Select the code and go to Tools -> Save as live template.
  • Choose the abbreviation for the template.
  • Edit the template to insert variables where required.
<changeSet id="MapSalvageCategoryAToMIAFTR" author="proctorh">
    <preConditions onFail="MARK_RAN">
        <and>
            <sqlCheck expectedResult="0">SELECT COUNT(*) FROM EXTERNAL_SYSTEM_MAPPED_PROP WHERE REF_DATA_TABLE =
                'rd_salvage_category' AND REF_DATA_CODE='$refdata$' AND EXTERNAL_SYSTEM_ID = (SELECT ID FROM
                RD_EXTERNAL_SYSTEM WHERE CODE='MIAFTR')
            </sqlCheck>
        </and>
    </preConditions>
    <sql>INSERT INTO EXTERNAL_SYSTEM_MAPPED_PROP
        (EXTERNAL_SYSTEM_ID,REF_DATA_TABLE,REF_DATA_CODE,EXT_SYS_PROPERTY_VALUE) (SELECT
        ID,'RD_SALVAGE_CATEGORY','$refdata$','$value$' FROM RD_EXTERNAL_SYSTEM WHERE CODE='MIAFTR')
    </sql>
</changeSet>

Now you can repeat the code block by doing:

  • Ctrl / Command + J to bring up the Insert template menu. The carat will be positioned on the first variable. Type the variable value, then press return to go to the next variable.

Posted in IntelliJ | Tagged | Leave a comment

IntelliJ Hints and Tips

Most useful keyboard shortcuts (these are for Mac):

  • CMD + O Open class
  • CMD + SHIFT + OOpen file
  • CMD + F Find in file
  • CMD + R Replace in file
  • CMD + SHIFT + F Find in path
  • CMD + SHIFT + R Replace in path
  • ALT + ENTER Quick fix (for problem under cursor)
  • CTRL + ALT + O Organise imports (remove unused)
  • CMD + N Generate (getters, setters, toString etc)
  • CTRL + I Implement interfaces
  • ALT + CMD + L Format file

Full list:
https://resources.jetbrains.com/storage/products/intellij-idea/docs/IntelliJIDEA_ReferenceCard.pdf

Also:

https://www.jetbrains.com/help/idea/mastering-keyboard-shortcuts.html

For efficient editing, also useful to be aware of:

Right click -> Refactoring options, such as rename, extract constant, extract method
New -> Scratch file – allows you to create a temporary file of any kind – text, xml, sql etc.
Regular expression search and replace
Block / multi column editing. Hold down ALT and drag to select an area with the mouse. You’ll get a cursor on each line and anything you type will be repeated on all lines. Or hold down SHIFT + ALT to click and put multiple cursors anywhere.
String manipulation plugin: https://plugins.jetbrains.com/plugin/2162-string-manipulation
Zero width character plugin: https://plugins.jetbrains.com/plugin/7448-zero-width-characters-locator

Code Swapping

When deploying a WAR file, deploy the exploded version to make it easier to recompile and repackage changes into it.
Configure “Build artifact” in the run config.
While debugging and changing a single class, right-click and choose “Compile XXX.java”. This causes the class to be reloaded.
While debugging and changing multiple classes use the Build Project button (ctrl-F9 / Cmd-F9). This causes all affected classes to be reloaded.
For non Java classes such as xml files, right click and select “Package file”. This will move the file over to the exploded target location.
Standard hot code swap supports changing code within a single method. However it will not support many other changes. e.g. Changing method signatures. Updating a spring context. e.g. adding, changing or removing beans

Debugger tips

Evaluate expression – allows you to evaluate an expression in the current context. i.e. access variables, collections etc. You can write the expression on one line or switch to the multi line option.
Catch on Exception – if you don’t know where in the code you need to stop, but you know that an exception is being thrown, this will pause execution at that point. You can specify which sort of exception you are interested in.
Conditional breakpoint. e.g. if you are in a loop which is executing 5000 times, this will allow you to stop when a specific expression is true. Make sure your expression is null safe though!
“Drop Frame”. You can right-click on any stack frame in the debugger and drop back to that point in the execution, no need to rerun.
“Disabled until the selected breakpoint is hit”, which means that you can have one breakpoint depend on another.

Posted in IntelliJ, Java | Tagged , | Leave a comment

Data migration in SQL Server

Recently I’ve had to write a data migration for SQL Server to split a large table (28 million rows) into separate tables. Some notes here on my thoughts…

Firstly, SQL Server has INSERT…SELECT syntax which allows you to copy from one table to another. It seems like any solution will be based around using this.

Secondly, my assumption is that for a large migration, we’ll need to run in batches, with a transaction for each batch, as it will take too long to run in a single transaction.

One first idea was to write something like this, and run it inside a loop, breaking out when no more rows were being copied:

BEGIN TRANSACTION X
SET IDENTITY_INSERT TARGET_TABLE ON
INSERT INTO TARGET_TABLE
  (ID, 
   VERSION, 
   TSTAMP, 
  ...other fields here
)
select top 100000 
  source.ID, 
  source.VERSION,
  source.TSTAMP, 
  ...other fields here
from source_db.dbo.source_table source
left outer join TARGET_TABLE x
    on source.id = x.id
where source.item_type = 'REQUIRED_TYPE'
and x.id is null
 
SELECT @rows_processed = @@rowcount
SET IDENTITY_INSERT TARGET_TABLE OFF
COMMIT TRANSACTION X
However, testing this with millions of rows suggests it is taking too long to perform the left join, as the time to do the join increases with every batch, as we add rows to the target table. As the target table is a new table, and hence has no rows to begin with, and we have an integer primary key, I ended up changing the where condition on the INSERT..SELECT to the following:
 WHERE source.ITEM_TYPE = 'REQUIRED_TYPE'
     AND source.ID > (SELECT COALESCE(MAX(ID),0) FROM TARGET_TABLE)
     ORDER BY source.ID
This means there is no join, just identifying the max id. Because we are ordering by the id and that is the primary key, there is no sorting required. In my testing, this took around 1 min 20 seconds to copy 3 million rows, compared to around 20 minutes for the join based approach.

I also had to migrate audit table data. This is interesting for two reasons. Firstly the audit tables don’t have a single primary key, but rather a composite. Secondly, the target tables already have some data, so this is more of a merge than just a copy. For this reason I ended up still using a left join for the audit data migration. I experimented with the batch size. A size of 10,000 took 28 minutes to copy 3 million rows, whereas a batch size of 100,000 took 18 minutes. This makes sense if the join is taking a long time, as we’ll be reducing the number if times we do the join by a factor of 10. I suspect the performance of the audit migration could be improved by changing the implementation so that the join is only performed once. e.g. perform initial join to check what rows need to be copied over, and store that either in a physical table or table variable, ordered by the same composite primary key. Then use a cursor or offset fetch to get the next batch of ids that need to be copied and use that to join with the source table to get the rows to copy. However my current implementation is fast enough, so this is one to investigate in a future migration perhaps.

Finally it is interesting to note that in my first implementation, I used some count(*) queries so that I could report progress in the logs. However with millions of rows, the time taken to perform these queries becomes non-trivial if you are doing a count with every batch. In my testing it could take 5-6 seconds to count with around 3 million rows, so doing that for 100 batches would mean 10 minutes just performing a count.

Links

Insert select: https://www.w3schools.com/sql/sql_insert_into_select.asp
Posted in Databases and SQL, SQL Server | Tagged , | Leave a comment

log4j in maven tests

If you want to specify log4j configuration when running tests via Maven, you can do so by updating the Maven surefire plugin configuration to point to a specific log4j configuration file. For log4j v2 the appropriate parameter is called log4j.configuration. e.g.
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-surefire-plugin</artifactId>
    <configuration>
        <systemProperties>
            <property>    
                <name>log4j.configurationFile</name>
                <value>src/test/resources/log4j2-module-name.xml</value>
            </property>
        </systemProperties>
    </configuration>
</plugin>
You can see from the snippet that, since the tests run from the top level of the current module, the path to the file needs to include the path to the test resources folder. Alternatively, the log4j.configurationFile name does support a fully qualified URL in the form file://. The other thing you will notice is that I haven’t used the default name of “log4j2.xml” as the file name. This is because in a large maven project it can be confusing to have many files with the same name, so I prefer to put the module name into the file name to make things clearer.
Posted in Java, Maven | Tagged , | Leave a comment

Generating code with JavaPoet

Why write code when you can generate it? There are lots of situations when it makes more sense to generate. In this article I’m going to work through an example of how to use JavaPoet and Apache BeanUtils to write a class that will generate domain to DTO conversion code.

In our app, due to the gradual removal of our old way of doing things, we have a lot of code that does the following:

domain object -> legacy DTO object -> new DTO object

The legacy DTO objects are no longer needed, so now we would like to delete them. Really we want the code to convert from the domain object directly to the new DTO object. When doing this sort of conversion, you always face a choice – you just code a generic converter class, which understands all of the data conversions that you need to perform, and uses runtime reflection, simply iterating over all of the properties, and converting each one. However, one major problem with this is that it is very fragile – you cannot search for usages of getters or setters in your IDE, and if someone changes or removes a property, you will end up with a runtime failure, not a build or test failure. For this reason, we want to use plain old java code to do the conversion. However, we don’t want to write it by hand, so it makes sense to use JavaPoet to generate it. JavaPoet is a very easy way to do code generation. Let me show how I used it in this scenario.

Firstly, download JavaPoet or add to your Maven dependencies: https://github.com/square/javapoet In my case, both the domain and DTO classes are java beans (i.e. they have properties, and each property has a getter and setter) so rather than just using reflection, I can use the Apache BeanUtils classes to make it easier to read these properties, so my Maven setup includes both JavaPoet and Apache BeanUtils:

<dependency>
  <groupId>commons-beanutils</groupId>
  <artifactId>commons-beanutils</artifactId>
  <version>1.8.3</version>
</dependency>
<dependency>
    <groupId>com.squareup</groupId>
    <artifactId>javapoet</artifactId>
    <version>1.9.0</version>
</dependency>
Now let’s start solving the problem at hand. Firstly, we need to map between the properties in the DTO and the domain class, and also keep a record of any properties that exist in the DTO, but cannot be found in the domain class, so we can put warnings in the generated code to say that the properties need to be manually checked. To begin with, I’ll create a mini helper class to return a map of the properties, and any missing ones:
class PropertyInfo {
    Map<PropertyDescriptor, PropertyDescriptor> propertyDescriptorMap;
    List<String> missingProperties;

    public PropertyInfo(Map<PropertyDescriptor, PropertyDescriptor> propertyDescriptorMap, List<String> missingProperties) {
        this.propertyDescriptorMap = propertyDescriptorMap;
        this.missingProperties = missingProperties;
    }
}
Now we can write a method using Apache BeanUtils that iterates over the properties and matches them on their names:
PropertyInfo getPropertyMapping(Class source, Class target) {
    // iterate over each property / field to generate a list of properties we can deal with, and ones we cannot
    Map<PropertyDescriptor, PropertyDescriptor> propertyDescriptorMap = new HashMap<>();
    // store properties needing to be populated in target, not found in source
    List<String> missingProperties = new ArrayList<>();

    Map<String, PropertyDescriptor> sourcePropertiesByName
            = Arrays.stream(PropertyUtils.getPropertyDescriptors(source))
            .collect(toMap(PropertyDescriptor::getName, Function.<PropertyDescriptor>identity()));
    System.out.println("Source class has: " + sourcePropertiesByName.size() + " properties");

    PropertyDescriptor[] targetProperties = PropertyUtils.getPropertyDescriptors(target);
    System.out.println("Target class has: " + targetProperties.length + " properties");

    // only do declared properties for now i.e. don't go up to superclasses.
    // navigating up to superclasses would create problems as it would go all the way up to java.lang.Object
    Set<String> declaredTargetFields = new HashSet<>();
    for (Field declaredField : target.getDeclaredFields()) {
        declaredTargetFields.add(declaredField.getName());
    }
    System.out.println("Target has: " + declaredTargetFields.size() + " fields declared in class itself");

    for (PropertyDescriptor targetPropertyDescriptor : targetProperties) {
        String targetPropertyName = targetPropertyDescriptor.getName();
        System.out.println("Processing property: " + targetPropertyName);

        if (declaredTargetFields.contains(targetPropertyName)) {
            PropertyDescriptor sourcePropertyDescriptor = sourcePropertiesByName.get(targetPropertyName);
            if (sourcePropertyDescriptor != null) {
                System.out.println("Found mapping for " + targetPropertyName);
                propertyDescriptorMap.put(sourcePropertyDescriptor, targetPropertyDescriptor);
            } else {
                System.out.println("WARNING - cannot find property " + targetPropertyName + " in source");
                missingProperties.add(targetPropertyName);
            }
        } else {
            System.out.println("Skipping property: " + targetPropertyName + " as declared in superclass");
        }
    }
    return new PropertyInfo(propertyDescriptorMap, missingProperties);
}
Great, now we have enough info to generate our converter. Our conversion method will accept a domain object, and return a DTO, so the method signature will look like this:
public DTOClassName toDTO(DomainClassName domainClassParameter)
How do we do this in JavaPoet? Well, firstly, let’s work out the parameter name. For some domain class names, we just need to take the class name and convert the first letter to lowercase. For some of the domain classes I am using, the class name ends in “Impl”, which I’d like to remove. So my logic to work out the parameter name is this:
String domainClassName = domainClass.getSimpleName();
String domainClassParameterName = domainClassName.substring(0, 1).toLowerCase() + domainClassName.substring(1);
if (domainClassParameterName.endsWith("Impl")) {
     domainClassParameterName = domainClassParameterName.substring(0, domainClassParameterName.length() - 4);
}
Now we can use JavaPoet to generate the method signature, using the MethodSpec.Builder class:
MethodSpec.Builder toDTOMethodBuilder = MethodSpec.methodBuilder("toDTO")
    .addModifiers(Modifier.PUBLIC)
    .addParameter(domainClass, domainClassParameterName)
    .returns(dtoClass);
Next, we need to create a new instance of our DTO object, like this:
DTOClass dto = new DTOClass();
In JavaPoet, you use $T to indicate a type, then supply that type, like this:
toDTOMethodBuilder.addStatement("$T dto = new $T()", dtoClass, dtoClass);
Note that we have to supply the class twice here, as we have used the $T type marker twice in our statement. Why bother using this $T marker? What is wrong with just manually inserting the class name? Well, by using $T, JavaPoet understands that we are giving it a reference to a class, and it can then take care of the import for you! No need to manually keep track of what classes you need to import in your generated code, and whether you have already added an import, JavaPoet will do all that for you!

Now we can simply iterate over the sets of matched properties, and write the conversion code. The easiest case is of course where the property type is the same in both source and target. If every property type was the same, the code would be:

for (PropertyDescriptor domainClassProperty : domainToDTOPropertyMap.keySet()) {

   String domainClassPropertyName = domainClassProperty.getName();
   System.out.println("Processing property: " + domainClassPropertyName);

   PropertyDescriptor dtoPropertyDescriptor = domainToDTOPropertyMap.get(domainClassProperty);
   Method domainClassReadMethod = domainClassProperty.getReadMethod();
   String dtoWriteMethodName = dtoPropertyDescriptor.getWriteMethod().getName();
   final String getProperty = domainClassParameterName + "." + domainClassReadMethod.getName() + "()";

   toDTOMethodBuilder.addStatement("dto." + dtoWriteMethodName + "(" + getProperty + ")");
}
In the more general case, you need to map between different types for the properties, so you will end up with a series of if statements checking the types:
   if (Some.class.equals(domainClassProperty.getPropertyType()) && 
           Other.class.equals(dtoPropertyDescriptor.getPropertyType()) {
        // write code to convert from Some.class to Other.class
   } 
   // if you have properties that might a subclass, or implementation of an interface, use "isAssignableFrom"
   else if (Some2.class.isAssignableFrom(domainClassProperty.getPropertyType()) &&
           Other2.class.isAssignableFrom(dtoPropertyDescriptor.getPropertyType()) {
       // write code to convert from Some2 class (or subclass) to Other2.class (or subclass)
   } 
   else {
       toDTOMethodBuilder.addStatement("dto." + dtoWriteMethodName + "(" + getProperty + ")");
   }
}
Having done all the properties, we should report on any properties that couldn’t be mapped, so a developer can check these manually:
for (String property : missingProperties) {
    // in early versions of JavaPoet, use addStatement. In later versions, use addComment
    toDTOMethodBuilder.addStatement("// TODO deal with property: " + property);
}
Now we simply add the return statement, and call build() on the builder to generate the method spec:
toDTOMethodBuilder.addStatement("return dto");
return toDTOMethodBuilder.build();
To put the conversion method into a Java class and write it out, we do the following:
TypeSpec converterClass = TypeSpec.classBuilder(converterClassName)
    .addModifiers(Modifier.PUBLIC)
    .addMethod(toDTOMethod)
    .build();

JavaFile javaFile = JavaFile.builder(converterPackage, converterClass).indent("    ").build();
javaFile.writeTo(new File("/path/to/chosen/directory));
All done! Now a developer can simply run this code to generate the conversion code, whenever they need to convert from a domain object to DTO. If desired, you can write similar code to generate an accompanying unit test.

For more examples of JavaPoet syntax, check out the readme on the github project: https://github.com/square/javapoet

Posted in Java, Uncategorized | Tagged | Leave a comment

Writing a custom spliterator in Java 8

In this article I’m going to give two examples of writing a custom spliterator in Java 8. What is a spliterator and why would you need to write your own? Well, a spliterator is used by the Java streams code when you call stream() on a collection or other object. The two most important methods in the Spliterator interface are as follows:
  • boolean tryAdvance(Consumer action);
  • Spliterator trySplit();
Any custom spliterator must implement tryAdvance. It is this method which is invoked to get each element of a stream to process. The trySplit method only needs to be implemented if you are going to create a parallel stream. It is invoked to split the stream into sections which can be safely processed in parallel.

Now let’s run through a few examples. I’m going to include all of the code snippets inline, but if you want a working example, all of the source code is available on my github: https://github.com/hedleyproctor/java8-examples

To begin with, let’s consider a case where you don’t need a parallel stream, but you do have a custom class for which you want to write a spliterator. Suppose I work on an application that processes html data, and I decide that to get test data for my application, I could just scrap random pages off the web. I could use a library like jsoup to get pages, and for each page, put any links on a list, so that if I need to get another page, I can just retrieve the next link. A simple implementation could look like this:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.LinkedList;
import java.util.Queue;

public class WebPageProvider {

    private Queue<String> urls = new LinkedList<String>();

    public WebPageProvider() {
        urls.add("https://en.wikipedia.org/wiki/Main_Page");
    }

    public Document getPage() {
        org.jsoup.nodes.Document doc = null;

        while (doc == null) {
            String nextPageURL = urls.remove();
            System.out.println("Next page: " + nextPageURL);
            try {
                doc = Jsoup.connect(nextPageURL).get();
            } catch (IOException e) {
                // we'll try the next one on our list
            }
        }

        // get links and put on our queue
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            String newURL = link.attr("abs:href");
            // System.out.println(newURL);
            urls.add(newURL);
        }
        return new Document(doc);
    }
}

Now, what I’d really like to be able to do is to use all of the useful methods in streams to be able to provide different sorts of test data. For example, suppose I just wanted images, I could map each web page to get the list of images on the page, then call flatMap to flatten the stream of List objects back to a stream of Image objects, like this:

 StreamSupport.stream( new WebPageSpliterator(new WebPageProvider()), false)
                                        .map(Document::getImages)
                                        .flatMap(List::stream)
                                        .limit(10);

Or perhaps filter to only include documents with five or more images:

StreamSupport.stream(new WebPageSpliterator(new WebPageProvider()), false)
                                                        .filter(doc -> doc.getImages().size() >= 5)
                                                        .limit(10);

Seems useful, so how do we implement the spliterator? Well, it’s pretty trivial:

import java.util.Spliterator;
import java.util.function.Consumer;

public class WebPageSpliterator implements Spliterator<Document> {
    private WebPageProvider webPageProvider;

    public WebPageSpliterator(WebPageProvider webPageProvider) {
        this.webPageProvider = webPageProvider;
    }

    @Override
    public boolean tryAdvance(Consumer<? super Document> action) {
        action.accept(webPageProvider.getPage());
        return true;
    }

    @Override
    public Spliterator<Document> trySplit() {
        return null;
    }

    @Override
    public long estimateSize() {
        return 0;
    }

    @Override
    public int characteristics() {
        return 0;
    }
}

You can see that all we’ve had to do is implement the tryAdvance method. Since the backing provider can provide an infinite number of web pages (assuming pages keep linking to other pages) there is no complex logic needed inside this method. It simply calls the accept method of the Consumer code passed into it (Consumer is a Java 8 functional interface, allowing callers to pass in a lambda) and then returns true, to signify that more pages can be returned if required.

Now let’s consider a more complex example involving parallel processing. When would you need to write a custom spliterator for parallel processing? Well, one situation is when you have a stream of objects, but the stream has an internal ordering or structure, meaning that a naive split of the stream at a random point might not produce sections that can validly be processed in parallel. In my github repo, I’ve given two separate examples of this type of scenario. In one, you have a character stream, which actually represents a custom record format. i.e. you need to split the stream at the record boundaries. In the other, you have a stream of Payment objects, but really these are grouped into payment batches, and you must split the stream at a payment batch boundary. Let’s look at this example. The payment batch test data is created like this:

    private List<Payment> createSampleData() {
        List<Payment> paymentList = new ArrayList<>();
        for (int i=0; i<1000; i++) {
            paymentList.add(new Payment(10,"A"));
            paymentList.add(new Payment(20,"A"));
            paymentList.add(new Payment(30,"A"));
            // total = 60

            paymentList.add(new Payment(20,"B"));
            paymentList.add(new Payment(30,"B"));
            paymentList.add(new Payment(40,"B"));
            paymentList.add(new Payment(50,"B"));
            paymentList.add(new Payment(60,"B"));
            // total = 200

            paymentList.add(new Payment(30,"C"));
            paymentList.add(new Payment(30,"C"));
            paymentList.add(new Payment(20,"C"));
            // total = 80
        }
        return paymentList;
    }

We want to total each batch. You can see that if you did this in parallel, but didn’t split on the batch boundaries, you would get the wrong totals, because you would count more batches than actually exist. e.g. by splitting the second batch into two. We can verify this, and then implement a custom spliterator and check that with the custom spliterator, the totals are correct. First, let’s create a collector to count up the totals:

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.function.BiConsumer;
import java.util.function.BinaryOperator;
import java.util.function.Function;
import java.util.function.Supplier;
import java.util.stream.Collector;

public class PaymentBatchTotaller 
    implements Collector<Payment,PaymentBatchTotaller.Accumulator,Map<String,Double>> {

    public class Total {
        public double amount;
        public int numberOfBatches;
    }

    public class Accumulator {
        Map<String,Total> totalsByCategory = new HashMap<>();
        String currentPaymentCategory;
    }

    @Override
    public Supplier<Accumulator> supplier() {
        return Accumulator::new;
    }

    @Override
    public BiConsumer<Accumulator,Payment> accumulator() {
        return (accumulator,payment) -> {
            // store this amount
            Total batchTotalForThisCategory = accumulator.totalsByCategory.get(payment.getCategory());
            if (batchTotalForThisCategory == null) {
                batchTotalForThisCategory = new Total();
                accumulator.totalsByCategory.put(payment.getCategory(),batchTotalForThisCategory);
            }
            batchTotalForThisCategory.amount += payment.getAmount();

            // if this was start of a new batch, increment the counter
            if (!payment.getCategory().equals(accumulator.currentPaymentCategory)) {
                batchTotalForThisCategory.numberOfBatches += 1;
                accumulator.currentPaymentCategory = payment.getCategory();
            }
        };
    }

    @Override
    public BinaryOperator<Accumulator> combiner() {
        return (accumulator1,accumulator2) -> {
            for (String category : accumulator1.totalsByCategory.keySet()) {
                Total total2 = accumulator2.totalsByCategory.get(category);
                if (total2 == null) {
                    accumulator2.totalsByCategory.put(category,accumulator1.totalsByCategory.get(category));
                } else {
                    Total total1 = accumulator1.totalsByCategory.get(category);
                    total2.amount += total1.amount;
                    total2.numberOfBatches += total1.numberOfBatches;
                }
            }
            return accumulator2;
        };
    }

    @Override
    public Function<Accumulator, Map<String, Double>> finisher() {
        return (accumulator) -> {
            Map<String,Double> results = new HashMap<>();
            for (Map.Entry<String,Total> entry : accumulator.totalsByCategory.entrySet()) {
                String category = entry.getKey();
                Total total = entry.getValue();
                double averageForBatchInThisCategory = total.amount / total.numberOfBatches;
                results.put(category,averageForBatchInThisCategory);
            }
            return results;
        };
    }

    @Override
    public Set<Characteristics> characteristics() {
        return Collections.EMPTY_SET;
    }
}

You can see that this collector keeps totals for each payment batch category, along with the number of batches in that category, then the finisher method divides each total by the number of batches in that category to get the average batch size. (If you aren’t familiar with custom collectors, you might like to read my previous article Yet another Java 8 custom collector example.)

If we run a test with a naive split of the stream, the totals will be wrong:

List<Payment> payments = createSampleData();

// won't work in parallel!
Map<String,Double> averageTotalsPerBatchAndCategory = payments.parallelStream().collect(new PaymentBatchTotaller());

Set<Map.Entry<String,Double>> entrySet = averageTotalsPerBatchAndCategory.entrySet();
assertEquals(entrySet.size(),3);
for (Map.Entry<String,Double> total : averageTotalsPerBatchAndCategory.entrySet()) {
    if (total.getKey().equals("A")) {
       assertEquals(60d,total.getValue());
    } else if (total.getKey().equals("B")) {
       assertEquals(200d,total.getValue());
    } else {
       assertEquals(80d,total.getValue());
    }
}

To begin with, our spliterator must keep hold of its backing list, and will need to keep track of its current and end positions in the list:

public class PaymentBatchSpliterator implements Spliterator<Payment> {

    private List<Payment> paymentList;
    private int current;
    private int last;  // inclusive

    public PaymentBatchSpliterator(List<Payment> payments) {
        this.paymentList = payments;
        last = paymentList.size() - 1;
    }

The implementation of tryAdvance is fairly simple. Providing we aren’t at the end of the list yet, we need to call accept on the Consumer code passed in, then increment our current counter and return true:

@Override
public boolean tryAdvance(Consumer<? super Payment> action) {
    if (current <= last) {
        action.accept(paymentList.get(current));
        current++;
        return true
    }
    return false;
}

Now we come to the real logic, the implementation of trySplit. We can implement this by saying: generate a possible split position, half way along the list, then check if it is a boundary between payment batches, if not, move forward until it is. The code looks like this:

    @Override
    public Spliterator<Payment> trySplit() {
        if ((last - current) < 100) {
            return null;
        }

        // first stab at finding a split position
        int splitPosition = current + (last - current) / 2;
        // if the categories are the same, we can't split here, as we are in the middle of a batch
        String categoryBeforeSplit = paymentList.get(splitPosition-1).getCategory();
        String categoryAfterSplit = paymentList.get(splitPosition).getCategory();

        // keep moving forward until we reach a split between categories
        while (categoryBeforeSplit.equals(categoryAfterSplit)) {
            splitPosition++;
            categoryBeforeSplit = categoryAfterSplit;
            categoryAfterSplit = paymentList.get(splitPosition).getCategory();
        }

        // safe to create a new spliterator
        PaymentBatchSpliterator secondHalf = new PaymentBatchSpliterator(paymentList,splitPosition,last);
        // reset our own last value
        last = splitPosition - 1;

        return secondHalf;
    }

Finally there is one little detail not to be missed. We must implement the estimateSize() method. Why? Well, this is called internally by the stream code to check if it needs to do any more splitting – if you don’t implement it, your stream will never be split! The implementation is trivial:

    @Override
    public long estimateSize() {
        return last - current;
    }

Finally we can test this by using the spliterator in our test code when we count the totals:

        Map<String,Double> averageTotalsPerBatchAndCategory =
                StreamSupport.stream(new PaymentBatchSpliterator(payments),true).collect(new PaymentBatchTotaller());

This will generate the correct totals. If you want to look at the character stream example, please check the github repo. You might also be interested in some of my other blog posts on Java 8: Streams tutorial Using Optional in Java 8

Posted in Java, Uncategorized | Leave a comment