Java XML processing with JAXB and special characters

When processing XML, it is important to understand that there are two kinds of special characters:
  • Characters that are permitted, but must either be escaped or inside a CDATA block.
  • Control characters that are not permitted at all.

Characters that fall into the first category are:

  • & – can be replaced by &
  • < - can be replaced by &lt;
  • > – can be replaced by &gt;
  • ” – can be replaced by &quot;
  • ‘ – can be replaced by &apos;

If you are creating XML with Java and JAXB, you do not need to do anything to escape these characters, it will be done automatically.

What about control characters that are not allowed at all? The specification is here: https://www.w3.org/TR/REC-xml/#NT-Char

The range of valid characters is defined as:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 
This means that all of the first 32 ASCII characters are prohibited, except:
  • 9 = horizontal tab
  • xA = 10 = line feed
  • xD = 13 = carriage return

Bizarrely, JAXB will allow you to generate XML even when your input contains control characters. Any control character will hit the default escape mechanism and be converted to its hex representation. e.g. control character 2 becomes:

&#2;

This means you can generate XML with JAXB that you cannot read in with JAXB, as these characters will cause a hard failure!!! How can you deal with this? If you Google this problem, you will find a number of posts about writing your own escape handler. For example: https://stackoverflow.com/questions/4435934/handling-xml-escape-characters-e-g-quotes-using-jaxb-marshaller

I investigated this and found two important points:

Firstly, I found two implementations of JAXB on my classpath, both in the com.sun package. One was the “internal” one, the other was the Reference Implementation (RI). This is important because when trying to override the escape handler, you need specify the correct package in the property name, and obviously extend the correct class. If you aren’t picking up the RI by default, you can force this by setting a startup property:

javax.xml.bind.context.factory=com.sun.xml.bind.v2.ContextFactory

Secondly, custom escape handlers won’t be invoked at all if you call the method

marshal( Object jaxbElement, javax.xml.transform.Result result )
While the most common use case for creating XML involves calling marshal, we are also applying an XSLT, so we call the above method. This uses a completely different writer which does not use the standard escape handler! Hence I had to look for another place to intercept the character stream. We were creating a transformer like this:
SAXTransformerFactory stf = new com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl();
TransformerHandler transformerHandler = stf.newTransformerHandler();

We cannot alter the real transformer, but we can wrap it in another transformer, that can strip the control characters, so what I wrote is below. In this implementation, I have only blocked the low value ASCII control characters, as those are the ones actually appearing in our input data. It would easy to extend the method to block the high range characters if you want to.

import org.xml.sax.*;

import javax.xml.transform.Result;
import javax.xml.transform.Transformer;
import javax.xml.transform.sax.TransformerHandler;
import java.util.HashSet;
import java.util.Set;

public class ControlStrippingTransformerHandler implements TransformerHandler {

    private TransformerHandler transformerHandler;

    public ControlStrippingTransformerHandler(TransformerHandler transformerHandler) {
        this.transformerHandler = transformerHandler;
    }

    // this is the only method we need to override!!
    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        // remove dodgy control characters
        // for performance, we only want to create a new array if there are characters to strip
        // we only want to initialise the set of characters if there are any
        Set<Integer> charactersToStrip = null;
        for (int i = start; i < start + length; i++) {
            if (isControlCharacter(ch[i])) {
                if (charactersToStrip == null) {
                    charactersToStrip = new HashSet<>();
                }
                charactersToStrip.add(i);
            }
        }
        if (charactersToStrip != null && charactersToStrip.size() > 0) {
            // this array only needs to be the specific section of the input string
            // at most this array will be 255 characters in length, since the
            // input SAX class has a 256 character array
            char[] newArray = new char[length - charactersToStrip.size()];
            int newArrayIndex = 0;
            for (int i = start; i < start + length; i++) {
                if (!charactersToStrip.contains(i)) {
                    newArray[newArrayIndex] = ch[i];
                    newArrayIndex++;
                }
            }
            ch = newArray;
            start = 0;
            length = length - charactersToStrip.size();
        }

        transformerHandler.characters(ch, start, length);
    }

    private boolean isControlCharacter(char c) {
        if (c < 9 || c == 11 || c == 12 || (c > 13 && c < 32)) {
            return true;
        }
        return false;
    }

    // other methods just call the wrapped transformer methods
}
Posted in Java, XML | Tagged , | Leave a comment

Hibernate Best Practices – Parameterising Criteria Queries

Recently one of our database team pointed out some queries that were not running efficiently. We looked into the queries and they were Hibernate criteria queries. Consider a simple helper method that is creating a predicate for use in a criteria query:
private Predicate restrictToAcceptedTxWith(Long policyTermId, DateTime at, Root<PolicyTransaction> ptxRoot, CriteriaBuilder cb) {
    return cb.and(
            cb.isNotNull(ptxRoot.get("acceptedDate")),
            cb.equal(ptxRoot.get("policyTerm").get("id"), policyTermId),
            cb.greaterThanOrEqualTo(ptxRoot.get("transactionPeriodEndDate"), at),
            cb.lessThanOrEqualTo(ptxRoot.get("transactionPeriodStartDate"), at)
    );
}
There doesn’t seem anything wrong with this, it looks like a regular bit of Hibernate code, so what is the problem? Well, it takes a parameter – in this case policyTermId, but when it passes that into the criteria builder equal() method, it doesn’t explicitly create it as a parameterised expression. Why is this a problem? To understand, you need to understand how a database evaluates and runs a SQL statement. To evaluate a query, a database must do the following:
  1. Construct one or more potential query plans
  2. Choose a plan
  3. Execute the query using the chosen plan

The reason SQL statements can result in multiple query plans is because SQL is a declarative language, rather than an imperative language like Java. You are telling the database conceptually what you want, without specifying exactly how it needs to be done. If you want to join tables A, B, C and D, you could do that by joining A to B, then the result of that to C, then the result of that to D. Or you could join A to D, then the result of that to B, then the result of that to C, and so on. In fact you could even join A to B, then separately join C to D, then join those two results together. (This is usually called a bushy query plan.) Choosing a query plan takes time, so once one is chosen, it is cached, so that if the same SQL query is run again, just with different parameters, the query plan choice does not have to be repeated.

This is the problem here. All databases understand the concept of parameterised SQL. Consider a simple query:

select * from policy where policy_number = ?
The database constructs the query plan and caches it based on the parameterised SQL. However, if you send in SQL where a parameter has actually been inlined into the SQL, like this:
select * from policy where policy_number = 123
The database has no understanding that “123” is meant to be a parameter, so the query plan (if cached at all) would be cached on the full query string. Then when you send in:
select * from policy where policy_number = 456
the database will not pick up the cached plan. It has to reevaluate the query plan options.

In the case of the Hibernate criteria, it can be parameterised as follows:

private Predicate restrictToAcceptedTxWith(DateTime at, Root<PolicyTransaction> ptxRoot, CriteriaBuilder cb) {
        final ParameterExpression<Long> policyTermIdParam = cb.parameter(Long.class, POLICY_TERM_ID_PARAM_NAME);
        return cb.and(
                cb.isNotNull(ptxRoot.get("acceptedDate")),
                cb.equal(ptxRoot.get("policyTerm").get("id"), policyTermIdParam),
                cb.greaterThanOrEqualTo(ptxRoot.get("transactionPeriodEndDate"), at),
                cb.lessThanOrEqualTo(ptxRoot.get("transactionPeriodStartDate"), at)
        );
    }

// calling code then calls
 getCurrentSession().createQuery(cq).setParameter(POLICY_TERM_ID_PARAM_NAME, policyTermId).list();
Now the generated SQL will be parameterised, and the database will be able to cache the query plan.
Posted in Hibernate | Tagged , | Leave a comment

Hibernate internals – how does pagination work?

Hibernate supports pagination for all query types – HQL, Criteria and SQL, via the query setMaxResults() and setFirstResult() methods, but how does this work under the covers? Well, assuming your database supports it, it will add a row_number() to your query, then wrap this inside another query to pick out the desired rows. You can see the SQL for yourself if you turn on Hibernate SQL logging (by setting the org.hibernate.SQL log level to DEBUG):
2021-05-17 16:22:00,452 http-nio-8081-exec-5 DEBUG [org.hibernate.SQL] 
WITH query AS (SELECT inner_query.*, ROW_NUMBER() OVER (ORDER BY CURRENT_TIMESTAMP) 
as __hibernate_row_nr__ 
FROM ( select
  clm.claim_id claimId, 
  clm.claim_number claimNumber,
  clmst.code claimStatusCode, 
  clmst.description claimStatusDescription,
  ppb.companyname policyholderCompanyName
You can see what is going on if you look inside the Hibernate Loader class:
protected SqlStatementWrapper executeQueryStatement(
      String sqlStatement,
      QueryParameters queryParameters,
      boolean scroll,
      List<AfterLoadAction> afterLoadActions,
      SharedSessionContractImplementor session) throws SQLException {
   // Processing query filters.
   queryParameters.processFilters( sqlStatement, session );
   // Applying LIMIT clause.
   final LimitHandler limitHandler = getLimitHandler(
         queryParameters.getRowSelection()
   );
   String sql = limitHandler.processSql( queryParameters.getFilteredSQL(), queryParameters.getRowSelection() );
In my case I’m running on SQL Server and the limit handler is the class SQLServer2005LimitHandler. If you look inside this class you can see the code that parses the SQL statement, adds a row_number() and wraps it in an outer query:
/**
 * Add a LIMIT clause to the given SQL SELECT (HHH-2655: ROW_NUMBER for Paging)
 *
 * The LIMIT SQL will look like:
 *
 * <pre>
 * WITH query AS (
 *   SELECT inner_query.*
 *        , ROW_NUMBER() OVER (ORDER BY CURRENT_TIMESTAMP) as __hibernate_row_nr__
 *     FROM ( original_query_with_top_if_order_by_present_and_all_aliased_columns ) inner_query
 * )
 * SELECT alias_list FROM query WHERE __hibernate_row_nr__ >= offset AND __hibernate_row_nr__ < offset + last
 * </pre>
 *
 * When offset equals {@literal 0}, only <code>TOP(?)</code> expression is added to the original query.
 *
 * @return A new SQL statement with the LIMIT clause applied.
 */
@Override
public String processSql(String sql, RowSelection selection) {
   final StringBuilder sb = new StringBuilder( sql );
   if ( sb.charAt( sb.length() - 1 ) == ';' ) {
      sb.setLength( sb.length() - 1 );
   }
   if ( LimitHelper.hasFirstRow( selection ) ) {
      final String selectClause = fillAliasInSelectClause( sb );
      final int orderByIndex = shallowIndexOfPattern( sb, ORDER_BY_PATTERN, 0 );
      if ( orderByIndex > 0 ) {
         // ORDER BY requires using TOP.
         addTopExpression( sb );
      }
      encloseWithOuterQuery( sb );
      // Wrap the query within a with statement:
      sb.insert( 0, "WITH query AS (" ).append( ") SELECT " ).append( selectClause ).append( " FROM query " );
      sb.append( "WHERE __hibernate_row_nr__ >= ? AND __hibernate_row_nr__ < ?" );
   }
   else {
      addTopExpression( sb );
   }
   return sb.toString();
}
Posted in Hibernate, Java | Tagged , | Leave a comment

Memory leak with TestNG and Spring DirtiesContext annotation

We recently added some new integration tests to our test suite. All three tests were Spring integration tests. Normally in a Spring integration test, the entire code for a test method operates in a single transaction, and at the end of that test method, all database changes are rolled back, so your Spring context is safe to be cached and reused for the next test. However sometimes you are invoking code that starts and commits transactions, meaning data will be still in the database after the test has finished. To help deal with this scenario, Spring provides the @DirtiesContext annotation, which tells Spring to remove the context from its cache so a new context will be created for the next test. All three new tests used the @DirtiesContext annotation. However after they were checked in, the build failed with an out of memory error. We configured Java to generate a heapdump on out of memory and I looked at the heapdump with the Eclipse Memory Analyser Tool (MAT). This is what the results showed:

Yikes! There were eight separate instances of the Hibernate SessionFactoryImpl! It transpires that TestNG has a known bug in that it keeps references to your tests even after each test is completed. This means that even though Spring was removing the context from its cache and a new context was being created, most of the old context was not eligible for garbage collection.

TestNG memory leak bug 1461

My colleague Elko actually wrote an annotation for us that we use in our test suite, which clears the entire database after a test method. Not only does this avoid the TestNG memory leak, but is it massively faster, as it avoids reconstructing the test context, which takes around 30-45 seconds to construct. It is a TestNG listener which does the following:

  • Turns off referential integrity
  • Queries the information_schema table to get all table names
  • Iterates over all tables and truncates each one
  • Turns referential integrity back on
Posted in Java, Testing, TestNG | Tagged , , | Leave a comment

Jetson Nano and TensorFlow

I got a Jetson Nano for Christmas from my brother-in-law. This is a Single Board Computer (SBC), similar to a Raspberry Pi, but more oriented to machine learning as it has a powerful nVidia GPU on it. Neural networks like TensorFlow run most efficiently on GPUs.

The components I’m using are:

  • Jetson Nano 2Gb
  • Raspberry Pi USB-C power supply
  • Logitech wireless USB keyboard
  • 64Gb SD card
  • PiHut USB switch (so you can switch the Nano power on and off without having to use the plug socket

Setup instructions from nVidia are pretty good:

https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-devkit

Note that the Nano comes in two versions – 4Gb and 2Gb – the first download link for the operating system is the 4Gb version, so if you are using the 2Gb version like me, make sure you click the second download link. It is a 6Gb zip which took around an hour to download. After that it needs to be unzipped, which produces a 14Gb disc image file, and written to your SD card. I’m using an Apple Mac, so as per the nVidia instructions, I used Etcher to copy the file to the SD card, which took around 10 minutes to write, and another 3 minutes to validate. The operating system is an Ubuntu variant, called Linux4Tigra, which runs the Lightweight Desktop Environment (http://www.lxde.org/):

Note that the Jetson Nano does not include a wireless ethernet card. In my case, I have a wireless range extender plugged in, and I’m using a wired connection to that. Alternatively, the latest versions of the Nano support wireless via USB. For example:

https://learn.sparkfun.com/tutorials/adding-wifi-to-the-nvidia-jetson/all

After getting the Nano up and running, I installed Python 3 and TensorFlow by following these instructions:

https://docs.nvidia.com/deeplearning/frameworks/install-tf-jetson-platform/index.html

I added an alias to .bash_aliases so that “python” is aliased to “python3”. (If you add an alias like this, it is recommended you add it for your user, not as a system wide alias, as this could break operating system functionality that requires python 2.)

Then I ran through the Tensorflow Quickstart tutorial, which is an image classification neural network, using the MNIST dataset 1, which is the digits from 0-9:

https://www.tensorflow.org/tutorials/quickstart/beginner

Although the code is all included in the above link, I want to inline it here to show just how short it is:

#!/usr/bin/python3

import tensorflow as tf

# load standard sample data from the mnist dataset 1
# dataset 1 is images of the digits 0-9
mnist = tf.keras.datasets.mnist

# load both training and test data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# the images have colour values from 0-255. these need to be scaled to be from 0-1
x_train, x_test = x_train / 255.0, x_test / 255.0

# now define our model
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

predictions = model(x_train[:1]).numpy()

tf.nn.softmax(predictions).numpy()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)

model.evaluate(x_test,  y_test, verbose=2)

model(x_test[:5])

I think it is amazing that in around 20 lines of python, we can load a dataset, define a neural network, train it and test it! Of course, the python is orchestrating the process, not doing the heavy lifting. The hard work is done by C code running on the GPU. Running this on the Nano took a couple of minutes. The output shows the work being sent to the GPU, as we expect. Here is an edited excerpt from the output:

2020-12-30 11:46:56.368109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1884] Adding visible gpu devices: 0
2020-12-30 11:48:03.136928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1428] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
Epoch 1/5
1875/1875 [==============================] - 17s 9ms/step - loss: 0.2918 - accuracy: 0.9158
Epoch 2/5
1875/1875 [==============================] - 17s 9ms/step - loss: 0.1385 - accuracy: 0.9584
Epoch 3/5
1875/1875 [==============================] - 16s 9ms/step - loss: 0.1049 - accuracy: 0.9678
Epoch 4/5
1875/1875 [==============================] - 17s 9ms/step - loss: 0.0864 - accuracy: 0.9733
Epoch 5/5
1875/1875 [==============================] - 16s 9ms/step - loss: 0.0738 - accuracy: 0.9774
2020-12-30 11:51:36.564930: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 31360000 exceeds 10% of free system memory.
313/313 - 2s - loss: 0.0733 - accuracy: 0.9774
Posted in Machine Learning, Python | Tagged , | Leave a comment

Using live templates in IntelliJ

If you need to write repeated text in IntelliJ then you can use its live templating function to help you. Suppose I’m writing a Liquibase script that will be composed of many similar changesets:

<changeSet id="MapSalvageCategoryAToMIAFTR" author="proctorh">
    <preConditions onFail="MARK_RAN">
        <and>
            <sqlCheck expectedResult="0">SELECT COUNT(*) FROM EXTERNAL_SYSTEM_MAPPED_PROP WHERE REF_DATA_TABLE =
                'rd_salvage_category' AND REF_DATA_CODE='CATEGORYA' AND EXTERNAL_SYSTEM_ID = (SELECT ID FROM
                RD_EXTERNAL_SYSTEM WHERE CODE='MIAFTR')
            </sqlCheck>
        </and>
    </preConditions>
    <sql>INSERT INTO EXTERNAL_SYSTEM_MAPPED_PROP       (EXTERNAL_SYSTEM_ID,REF_DATA_TABLE,REF_DATA_CODE,EXT_SYS_PROPERTY_VALUE) (SELECT
        ID,'RD_SALVAGE_CATEGORY','CATEGORYA','A' FROM RD_EXTERNAL_SYSTEM WHERE CODE='MIAFTR')
    </sql>
</changeSet>

I want to repeat these inserts, but with different values for the ref data code, and what it is being mapped to.

  • Select the code and go to Tools -> Save as live template.
  • Choose the abbreviation for the template.
  • Edit the template to insert variables where required.
<changeSet id="MapSalvageCategoryAToMIAFTR" author="proctorh">
    <preConditions onFail="MARK_RAN">
        <and>
            <sqlCheck expectedResult="0">SELECT COUNT(*) FROM EXTERNAL_SYSTEM_MAPPED_PROP WHERE REF_DATA_TABLE =
                'rd_salvage_category' AND REF_DATA_CODE='$refdata$' AND EXTERNAL_SYSTEM_ID = (SELECT ID FROM
                RD_EXTERNAL_SYSTEM WHERE CODE='MIAFTR')
            </sqlCheck>
        </and>
    </preConditions>
    <sql>INSERT INTO EXTERNAL_SYSTEM_MAPPED_PROP
        (EXTERNAL_SYSTEM_ID,REF_DATA_TABLE,REF_DATA_CODE,EXT_SYS_PROPERTY_VALUE) (SELECT
        ID,'RD_SALVAGE_CATEGORY','$refdata$','$value$' FROM RD_EXTERNAL_SYSTEM WHERE CODE='MIAFTR')
    </sql>
</changeSet>

Now you can repeat the code block by doing:

  • Ctrl / Command + J to bring up the Insert template menu. The carat will be positioned on the first variable. Type the variable value, then press return to go to the next variable.

Posted in IntelliJ | Tagged | Leave a comment

IntelliJ Hints and Tips

Most useful keyboard shortcuts (these are for Mac):

  • CMD + O Open class
  • CMD + SHIFT + OOpen file
  • CMD + F Find in file
  • CMD + R Replace in file
  • CMD + SHIFT + F Find in path
  • CMD + SHIFT + R Replace in path
  • ALT + ENTER Quick fix (for problem under cursor)
  • CTRL + ALT + O Organise imports (remove unused)
  • CMD + N Generate (getters, setters, toString etc)
  • CTRL + I Implement interfaces
  • ALT + CMD + L Format file

Full list:
https://resources.jetbrains.com/storage/products/intellij-idea/docs/IntelliJIDEA_ReferenceCard.pdf

Also:

https://www.jetbrains.com/help/idea/mastering-keyboard-shortcuts.html

For efficient editing, also useful to be aware of:

Right click -> Refactoring options, such as rename, extract constant, extract method
New -> Scratch file – allows you to create a temporary file of any kind – text, xml, sql etc.
Regular expression search and replace
Block / multi column editing. Hold down ALT and drag to select an area with the mouse. You’ll get a cursor on each line and anything you type will be repeated on all lines. Or hold down SHIFT + ALT to click and put multiple cursors anywhere.
String manipulation plugin: https://plugins.jetbrains.com/plugin/2162-string-manipulation
Zero width character plugin: https://plugins.jetbrains.com/plugin/7448-zero-width-characters-locator

Code Swapping

When deploying a WAR file, deploy the exploded version to make it easier to recompile and repackage changes into it.
Configure “Build artifact” in the run config.
While debugging and changing a single class, right-click and choose “Compile XXX.java”. This causes the class to be reloaded.
While debugging and changing multiple classes use the Build Project button (ctrl-F9 / Cmd-F9). This causes all affected classes to be reloaded.
For non Java classes such as xml files, right click and select “Package file”. This will move the file over to the exploded target location.
Standard hot code swap supports changing code within a single method. However it will not support many other changes. e.g. Changing method signatures. Updating a spring context. e.g. adding, changing or removing beans

Debugger tips

Evaluate expression – allows you to evaluate an expression in the current context. i.e. access variables, collections etc. You can write the expression on one line or switch to the multi line option.
Catch on Exception – if you don’t know where in the code you need to stop, but you know that an exception is being thrown, this will pause execution at that point. You can specify which sort of exception you are interested in.
Conditional breakpoint. e.g. if you are in a loop which is executing 5000 times, this will allow you to stop when a specific expression is true. Make sure your expression is null safe though!
“Drop Frame”. You can right-click on any stack frame in the debugger and drop back to that point in the execution, no need to rerun.
“Disabled until the selected breakpoint is hit”, which means that you can have one breakpoint depend on another.

Posted in IntelliJ, Java | Tagged , | Leave a comment

Data migration in SQL Server

Recently I’ve had to write a data migration for SQL Server to split a large table (28 million rows) into separate tables. Some notes here on my thoughts…

Firstly, SQL Server has INSERT…SELECT syntax which allows you to copy from one table to another. It seems like any solution will be based around using this.

Secondly, my assumption is that for a large migration, we’ll need to run in batches, with a transaction for each batch, as it will take too long to run in a single transaction.

One first idea was to write something like this, and run it inside a loop, breaking out when no more rows were being copied:

BEGIN TRANSACTION X
SET IDENTITY_INSERT TARGET_TABLE ON
INSERT INTO TARGET_TABLE
  (ID, 
   VERSION, 
   TSTAMP, 
  ...other fields here
)
select top 100000 
  source.ID, 
  source.VERSION,
  source.TSTAMP, 
  ...other fields here
from source_db.dbo.source_table source
left outer join TARGET_TABLE x
    on source.id = x.id
where source.item_type = 'REQUIRED_TYPE'
and x.id is null
 
SELECT @rows_processed = @@rowcount
SET IDENTITY_INSERT TARGET_TABLE OFF
COMMIT TRANSACTION X
However, testing this with millions of rows suggests it is taking too long to perform the left join, as the time to do the join increases with every batch, as we add rows to the target table. As the target table is a new table, and hence has no rows to begin with, and we have an integer primary key, I ended up changing the where condition on the INSERT..SELECT to the following:
 WHERE source.ITEM_TYPE = 'REQUIRED_TYPE'
     AND source.ID > (SELECT COALESCE(MAX(ID),0) FROM TARGET_TABLE)
     ORDER BY source.ID
This means there is no join, just identifying the max id. Because we are ordering by the id and that is the primary key, there is no sorting required. In my testing, this took around 1 min 20 seconds to copy 3 million rows, compared to around 20 minutes for the join based approach.

I also had to migrate audit table data. This is interesting for two reasons. Firstly the audit tables don’t have a single primary key, but rather a composite. Secondly, the target tables already have some data, so this is more of a merge than just a copy. For this reason I ended up still using a left join for the audit data migration. I experimented with the batch size. A size of 10,000 took 28 minutes to copy 3 million rows, whereas a batch size of 100,000 took 18 minutes. This makes sense if the join is taking a long time, as we’ll be reducing the number if times we do the join by a factor of 10. I suspect the performance of the audit migration could be improved by changing the implementation so that the join is only performed once. e.g. perform initial join to check what rows need to be copied over, and store that either in a physical table or table variable, ordered by the same composite primary key. Then use a cursor or offset fetch to get the next batch of ids that need to be copied and use that to join with the source table to get the rows to copy. However my current implementation is fast enough, so this is one to investigate in a future migration perhaps.

Finally it is interesting to note that in my first implementation, I used some count(*) queries so that I could report progress in the logs. However with millions of rows, the time taken to perform these queries becomes non-trivial if you are doing a count with every batch. In my testing it could take 5-6 seconds to count with around 3 million rows, so doing that for 100 batches would mean 10 minutes just performing a count.

Links

Insert select: https://www.w3schools.com/sql/sql_insert_into_select.asp
Posted in Databases and SQL, SQL Server | Tagged , | Leave a comment

log4j in maven tests

If you want to specify log4j configuration when running tests via Maven, you can do so by updating the Maven surefire plugin configuration to point to a specific log4j configuration file. For log4j v2 the appropriate parameter is called log4j.configuration. e.g.
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-surefire-plugin</artifactId>
    <configuration>
        <systemProperties>
            <property>    
                <name>log4j.configurationFile</name>
                <value>src/test/resources/log4j2-module-name.xml</value>
            </property>
        </systemProperties>
    </configuration>
</plugin>
You can see from the snippet that, since the tests run from the top level of the current module, the path to the file needs to include the path to the test resources folder. Alternatively, the log4j.configurationFile name does support a fully qualified URL in the form file://. The other thing you will notice is that I haven’t used the default name of “log4j2.xml” as the file name. This is because in a large maven project it can be confusing to have many files with the same name, so I prefer to put the module name into the file name to make things clearer.
Posted in Java, Maven | Tagged , | Leave a comment

Generating code with JavaPoet

Why write code when you can generate it? There are lots of situations when it makes more sense to generate. In this article I’m going to work through an example of how to use JavaPoet and Apache BeanUtils to write a class that will generate domain to DTO conversion code.

In our app, due to the gradual removal of our old way of doing things, we have a lot of code that does the following:

domain object -> legacy DTO object -> new DTO object

The legacy DTO objects are no longer needed, so now we would like to delete them. Really we want the code to convert from the domain object directly to the new DTO object. When doing this sort of conversion, you always face a choice – you just code a generic converter class, which understands all of the data conversions that you need to perform, and uses runtime reflection, simply iterating over all of the properties, and converting each one. However, one major problem with this is that it is very fragile – you cannot search for usages of getters or setters in your IDE, and if someone changes or removes a property, you will end up with a runtime failure, not a build or test failure. For this reason, we want to use plain old java code to do the conversion. However, we don’t want to write it by hand, so it makes sense to use JavaPoet to generate it. JavaPoet is a very easy way to do code generation. Let me show how I used it in this scenario.

Firstly, download JavaPoet or add to your Maven dependencies: https://github.com/square/javapoet In my case, both the domain and DTO classes are java beans (i.e. they have properties, and each property has a getter and setter) so rather than just using reflection, I can use the Apache BeanUtils classes to make it easier to read these properties, so my Maven setup includes both JavaPoet and Apache BeanUtils:

<dependency>
  <groupId>commons-beanutils</groupId>
  <artifactId>commons-beanutils</artifactId>
  <version>1.8.3</version>
</dependency>
<dependency>
    <groupId>com.squareup</groupId>
    <artifactId>javapoet</artifactId>
    <version>1.9.0</version>
</dependency>
Now let’s start solving the problem at hand. Firstly, we need to map between the properties in the DTO and the domain class, and also keep a record of any properties that exist in the DTO, but cannot be found in the domain class, so we can put warnings in the generated code to say that the properties need to be manually checked. To begin with, I’ll create a mini helper class to return a map of the properties, and any missing ones:
class PropertyInfo {
    Map<PropertyDescriptor, PropertyDescriptor> propertyDescriptorMap;
    List<String> missingProperties;

    public PropertyInfo(Map<PropertyDescriptor, PropertyDescriptor> propertyDescriptorMap, List<String> missingProperties) {
        this.propertyDescriptorMap = propertyDescriptorMap;
        this.missingProperties = missingProperties;
    }
}
Now we can write a method using Apache BeanUtils that iterates over the properties and matches them on their names:
PropertyInfo getPropertyMapping(Class source, Class target) {
    // iterate over each property / field to generate a list of properties we can deal with, and ones we cannot
    Map<PropertyDescriptor, PropertyDescriptor> propertyDescriptorMap = new HashMap<>();
    // store properties needing to be populated in target, not found in source
    List<String> missingProperties = new ArrayList<>();

    Map<String, PropertyDescriptor> sourcePropertiesByName
            = Arrays.stream(PropertyUtils.getPropertyDescriptors(source))
            .collect(toMap(PropertyDescriptor::getName, Function.<PropertyDescriptor>identity()));
    System.out.println("Source class has: " + sourcePropertiesByName.size() + " properties");

    PropertyDescriptor[] targetProperties = PropertyUtils.getPropertyDescriptors(target);
    System.out.println("Target class has: " + targetProperties.length + " properties");

    // only do declared properties for now i.e. don't go up to superclasses.
    // navigating up to superclasses would create problems as it would go all the way up to java.lang.Object
    Set<String> declaredTargetFields = new HashSet<>();
    for (Field declaredField : target.getDeclaredFields()) {
        declaredTargetFields.add(declaredField.getName());
    }
    System.out.println("Target has: " + declaredTargetFields.size() + " fields declared in class itself");

    for (PropertyDescriptor targetPropertyDescriptor : targetProperties) {
        String targetPropertyName = targetPropertyDescriptor.getName();
        System.out.println("Processing property: " + targetPropertyName);

        if (declaredTargetFields.contains(targetPropertyName)) {
            PropertyDescriptor sourcePropertyDescriptor = sourcePropertiesByName.get(targetPropertyName);
            if (sourcePropertyDescriptor != null) {
                System.out.println("Found mapping for " + targetPropertyName);
                propertyDescriptorMap.put(sourcePropertyDescriptor, targetPropertyDescriptor);
            } else {
                System.out.println("WARNING - cannot find property " + targetPropertyName + " in source");
                missingProperties.add(targetPropertyName);
            }
        } else {
            System.out.println("Skipping property: " + targetPropertyName + " as declared in superclass");
        }
    }
    return new PropertyInfo(propertyDescriptorMap, missingProperties);
}
Great, now we have enough info to generate our converter. Our conversion method will accept a domain object, and return a DTO, so the method signature will look like this:
public DTOClassName toDTO(DomainClassName domainClassParameter)
How do we do this in JavaPoet? Well, firstly, let’s work out the parameter name. For some domain class names, we just need to take the class name and convert the first letter to lowercase. For some of the domain classes I am using, the class name ends in “Impl”, which I’d like to remove. So my logic to work out the parameter name is this:
String domainClassName = domainClass.getSimpleName();
String domainClassParameterName = domainClassName.substring(0, 1).toLowerCase() + domainClassName.substring(1);
if (domainClassParameterName.endsWith("Impl")) {
     domainClassParameterName = domainClassParameterName.substring(0, domainClassParameterName.length() - 4);
}
Now we can use JavaPoet to generate the method signature, using the MethodSpec.Builder class:
MethodSpec.Builder toDTOMethodBuilder = MethodSpec.methodBuilder("toDTO")
    .addModifiers(Modifier.PUBLIC)
    .addParameter(domainClass, domainClassParameterName)
    .returns(dtoClass);
Next, we need to create a new instance of our DTO object, like this:
DTOClass dto = new DTOClass();
In JavaPoet, you use $T to indicate a type, then supply that type, like this:
toDTOMethodBuilder.addStatement("$T dto = new $T()", dtoClass, dtoClass);
Note that we have to supply the class twice here, as we have used the $T type marker twice in our statement. Why bother using this $T marker? What is wrong with just manually inserting the class name? Well, by using $T, JavaPoet understands that we are giving it a reference to a class, and it can then take care of the import for you! No need to manually keep track of what classes you need to import in your generated code, and whether you have already added an import, JavaPoet will do all that for you!

Now we can simply iterate over the sets of matched properties, and write the conversion code. The easiest case is of course where the property type is the same in both source and target. If every property type was the same, the code would be:

for (PropertyDescriptor domainClassProperty : domainToDTOPropertyMap.keySet()) {

   String domainClassPropertyName = domainClassProperty.getName();
   System.out.println("Processing property: " + domainClassPropertyName);

   PropertyDescriptor dtoPropertyDescriptor = domainToDTOPropertyMap.get(domainClassProperty);
   Method domainClassReadMethod = domainClassProperty.getReadMethod();
   String dtoWriteMethodName = dtoPropertyDescriptor.getWriteMethod().getName();
   final String getProperty = domainClassParameterName + "." + domainClassReadMethod.getName() + "()";

   toDTOMethodBuilder.addStatement("dto." + dtoWriteMethodName + "(" + getProperty + ")");
}
In the more general case, you need to map between different types for the properties, so you will end up with a series of if statements checking the types:
   if (Some.class.equals(domainClassProperty.getPropertyType()) && 
           Other.class.equals(dtoPropertyDescriptor.getPropertyType()) {
        // write code to convert from Some.class to Other.class
   } 
   // if you have properties that might a subclass, or implementation of an interface, use "isAssignableFrom"
   else if (Some2.class.isAssignableFrom(domainClassProperty.getPropertyType()) &&
           Other2.class.isAssignableFrom(dtoPropertyDescriptor.getPropertyType()) {
       // write code to convert from Some2 class (or subclass) to Other2.class (or subclass)
   } 
   else {
       toDTOMethodBuilder.addStatement("dto." + dtoWriteMethodName + "(" + getProperty + ")");
   }
}
Having done all the properties, we should report on any properties that couldn’t be mapped, so a developer can check these manually:
for (String property : missingProperties) {
    // in early versions of JavaPoet, use addStatement. In later versions, use addComment
    toDTOMethodBuilder.addStatement("// TODO deal with property: " + property);
}
Now we simply add the return statement, and call build() on the builder to generate the method spec:
toDTOMethodBuilder.addStatement("return dto");
return toDTOMethodBuilder.build();
To put the conversion method into a Java class and write it out, we do the following:
TypeSpec converterClass = TypeSpec.classBuilder(converterClassName)
    .addModifiers(Modifier.PUBLIC)
    .addMethod(toDTOMethod)
    .build();

JavaFile javaFile = JavaFile.builder(converterPackage, converterClass).indent("    ").build();
javaFile.writeTo(new File("/path/to/chosen/directory));
All done! Now a developer can simply run this code to generate the conversion code, whenever they need to convert from a domain object to DTO. If desired, you can write similar code to generate an accompanying unit test.

For more examples of JavaPoet syntax, check out the readme on the github project: https://github.com/square/javapoet

Posted in Java, Uncategorized | Tagged | Leave a comment