My colleague S barged into my cabin with an interesting problem in hand. S was working on a data migration project for a huge Insurance firm. The migration consists export of data from a legacy system and importing it back into a new system.

In the export process, the rows that qualify to be exported is found by running a fairly complex query. The query has around 12 joins. This query filters rows from the legacy database that are worthy to be imported into the new system.

The Customer expects certain data in the legacy to be available in the new system. But for reason, the query does not filter it and hence is not available in the new system. The Customer is surprised at this and my colleague S has to do some fact finding to explain why ( and where in the 12-way join) the data got missed out.

This fact finding turned out to be tedious exercise. S has a way to solve this. Write a diagnostic query, which is same as the 12-way join but split into, say 12 different queries. At each stage, the output is analyzed for the data of interest. The diagnostic program prints out the query at which the output did not contain the data in question.

The question put to me was if there can be any improvements to this approach. Because this involved quite a lot of queries, S was a little concerned about performance.

We thought for a while and we came up with a minor enhancement that would possibly speed up things a bit.

Let the queries be Q1, Q2, … Qn. The current approach is

Fire Q1 -> Examine output ->
If data available -> Fire Q2 … and so on.
Else printDiagnostics and exit.

A little thought suggests that, If the output of, say , Q6 does not contain the data, then we need look below Q6 ie., Queries Q1 thru Q5. Again, if we fire Q3 next and observe that the data is available in the output, then the “search” for the Query narrows down to Q4-> Q6.

This is nothing more than the “divide-and-conquer” approach used in search/sort.

Start from the middle and based on whether data is available in the output, choose which half to investigate.

This solution gives us the confidence that we indeed fire lesser number of queries than the earlier approach.