Right tools at the Wrong Hands

July 24, 2009


In this post, I wish to share my cribs on how ORM (Object Relational Mapping) tools get abused by naive developers.
The impact is such that one tends to think something is perhaps not right with ORMs.

Object Relational Mapping tools (ORM) map the Object world on to the RDBMS world.
This is a huge productivity benefit. OO developers are comfortable with the notion of Objects and dealing with objects throughout makes it all the more easier. Popular ORM implementations (in the Java world) are Hibernate, iBatis and TopLink.

The single most drawback I see with ORM in the hands of unseasoned developers is that “ORM cultivates in them a disregard for databases and DB concepts“. This isn’t an issue with ORM per se. It is the unseasoned developer who is to be blamed and all. Fine … But then, it results in badly written applications whose performance suck.

ORM’s treat a row in the database as an object (more or less). It is indeed a huge productivity gain for developers to work with objects without bothering about how it came all the way from database. Whereas there is indeed a productivity gain, the down side is the developer’s insensitivity towards the database concepts.

For example, The specification document reads, “…Get all the orders for the given customer. For each order, set the is_delivered flag to TRUE”.

The unseasoned developer translates the specification faithfully in this way

Collection myorders = OrderDAO.findbyID(“customerID”);
for (order : myorders){
order.setIsDelivered(“TRUE”);
order.save();
}

Let us look at the above trivial code from a database perspective.

Collection myorders = OrderDAO.findbyID(“someID”);

This results in the Query like “SELECT * FROM ORDERS WHERE CUSTID = ?”

Inside the loop, the followign statments
order.setIsDelivered(“TRUE”);
order.save();
translate to
UPDATE ORDER SET IS_DELV=1 WHERE ORDERID = ?

Please note that the UPDATE statement is executed for EACH order.

Instead, what we actually need is

“UPDATE ORDER SET IS_DELV=1 WHERE CUSTID = ?”

If there are N orders for a Customer, then N+1 Queries are executed in place of just One Query.

The fundamental difference between a language like Java and SQL is that while Java is a programming language
SQL is a DECLARATIVE language. SQL tells what needs to be done and omits the how. Java programming is mostly about the how.

Another classic example is … the specification reads “If there are records meeting this criteria, then update them with this value…”

The unseasoned ORM-er would do it this way

Collection myorders = order.findbyParams(x,y,z,p,q,r);
if (myorders.size > 0){
order.setSomeParam(somevalue);
order.save();
}

I cannot blame ORM for such abuse of SQL. Even developers using plain jdbc (and no ORM)
would translate the above spec as

select count(*) from ORDER WHERE
if (rowcount > 0) {
//perform the update
}

The SQL which is actually needed is

UPDATE ORDER SET SOME_PARAM = :somevalue WHERE

Obviously, if the criteria matches no records, the update would not happen anyway.

ORMs make it easier for naive developers to overlook the basics of databases.
ORM works best at the hands of those who have solid DB fundamentals and use ORM the right way.

The issues get magnified when say, 5000 users are connected, and thus there is a potential for each query to be fired 5000 times ! This means, a single query that is found to be unnecessary, amounts to firing 5000 queries. What a colossal waste of resources. Add to this any missing index etc and the problem gets compounded.

Also ORMs prevent the naive developer from appreciating joins – both complex and simple.
ORMs make it easier for developers to follow the specifications to the dot without any database considerations.

For example, A specifcation reads

“Get the f_id from Table1. Using this query on Table2 to obtain px_y. Use px_y from Table3 to get the m_imp_data.”

The ORM savvy developer faithfully follows the specifications.

String str_f_id = Table1DAO.find(“someid”).getfid();
String str_px_y = Table2DAO.findByBlah(str_f_id).getpxy();
String str_imp_data = Table3.DAO.findByBlah(str_px_y).getimpdata();

Well, as I said before, a naive developer using plain jdbc could as well do the same

String Qry1 = “select f_id from Table1 where blahID = ?”
String Qry2 = “select px_y from Table2 where x_f_id = ?”
String Qry3 = “select m_imp_data from Table3 where x_px_y = ?”

But the point is that ORMs make it a lot easier as the developer is out of touch with SQL (almost).
What is needed here is a simple inner join. You can read more about Joins here.

In a future post, I shall provide examples on the right usage of ORM and how ORM and plain jdbc should co-exist for better performance of the application.


Troubleshooting a join

May 31, 2009


My colleague S barged into my cabin with an interesting problem in hand. S was working on a data migration project for a huge Insurance firm. The migration consists export of data from a legacy system and importing it back into a new system.

In the export process, the rows that qualify to be exported is found by running a fairly complex query. The query has around 12 joins. This query filters rows from the legacy database that are worthy to be imported into the new system.

The Customer expects certain data in the legacy to be available in the new system. But for reason, the query does not filter it and hence is not available in the new system. The Customer is surprised at this and my colleague S has to do some fact finding to explain why ( and where in the 12-way join) the data got missed out.

This fact finding turned out to be tedious exercise. S has a way to solve this. Write a diagnostic query, which is same as the 12-way join but split into, say 12 different queries. At each stage, the output is analyzed for the data of interest. The diagnostic program prints out the query at which the output did not contain the data in question.

The question put to me was if there can be any improvements to this approach. Because this involved quite a lot of queries, S was a little concerned about performance.

We thought for a while and we came up with a minor enhancement that would possibly speed up things a bit.

Let the queries be Q1, Q2, … Qn. The current approach is

Fire Q1 -> Examine output ->
If data available -> Fire Q2 … and so on.
Else printDiagnostics and exit.

A little thought suggests that, If the output of, say , Q6 does not contain the data, then we need look below Q6 ie., Queries Q1 thru Q5. Again, if we fire Q3 next and observe that the data is available in the output, then the “search” for the Query narrows down to Q4-> Q6.

This is nothing more than the “divide-and-conquer” approach used in search/sort.

Start from the middle and based on whether data is available in the output, choose which half to investigate.

This solution gives us the confidence that we indeed fire lesser number of queries than the earlier approach.


A ‘relay’ blog post

August 20, 2008


When I come across blog posts that merely link to something else and not much meat from the author, I used to feel that it is a blog post for the sake of it. Such blog posts serve as a ‘relay’ or ‘repeater’ for the original post.

However, relaying-blogposts do have some advantages like for example
1) If the original post is a very helpful one ( for eg., a technical tip) and not very popular, ‘relay’ing increases the reach.
2) search engines rate these original posts higher and so better chances of a good post reaching the needy through a search.

Now, why am I saying all this ? Only because I too want to relay a particular piece of information. In this case, It is not a blog post but a mailing list posting.

We use tomcat clustering (behind Apache) with MS SQL. There occurs strange problems at the strangest of times. No connections to the database becomes available thereby raising availability concerns. Search Engines did not prove to be very efficient. Searching for “tomcat MS SQL” does indeed give (too many) results, but it is the case of finding the needle in a haystack. Most of the results are the case where they are not able to connect to MS SQL even for the first time. But, my problem is that everything works fine for sometime and THEN one fine day it conks. And this “sometime” varies. The time between failure varies as much as 1 day to 3 months.

Finally, I could nail down a post that *exactly* described my problem and some good Samaritan has provided some suggestions too. Here it is !


handling user uploaded hierarchical data

August 14, 2008

My colleague U developed an application feature that would enable our users to upload documents. Users can also organize their uploaded documents as a tree of files and folders.

We then discussed how to store this information in the database.

Method 1 : Create a folder for the user on the server side file system and create the required files and folders as how the user needs it. The user’s display of his tree structure is exactly mimicked on the server side.

Method 2 : Do not use file systems. Put all the stuff discussed above in the database, including the files.

method 3 : Create a folder for the user. All the uploaded files reside in this directory. The user’s tree structure is maintained in the database (yes, we have to handle files with duplicate names. but this is not a big deal.).

We implemented Method (3). Method (1) was rejected straightaway. The difference between Method(2) an Method(3) is not much. It is about where to store the file contents ?

I would like to know which among Methods (2) and (3) is better ? I’ve heard contradicting views on file IO Vs database SELECTs of files (blob). It was quite simple to code the Method(3) way (may be, more of a perception issue).

The advantages of Method (3) are
1. The files can be backed up easily.
2. The server on which these files reside can be moved around, for performance reasons, easily.
3. Easier to index these files using search engines like lucene.

One might argue that all these are possible with the database approach as well.
These are my counter points.
1. The dbs can be backed up — But DB size bloats up. Certainly an issue for me.
2. server can be moved around — Certainly not as easily as moving files around. I need to set up a database cluster or have some sort of sharding or have a separate db for just the file storing tables but then it IS complicated than handling files, isnt it ?
3. Index – full text index of databases … again not as elegant. It is OS specific or DB specific etc.

Method (3) wins hands down right ?
I am open to hear other’s opinion on this. Any (!null) pointers would be just great !


left outer joins

August 5, 2008

today we faced an interesting problem regarding left outer joins. i will skip the details and come to the learning part directly. more so because it was more our misunderstanding of joins rather than the problem that made the whole experience interesting.

as we know, left outer join would fill up the non matching records with null values. So when we have filter conditions on the right hand side table (whose values are capable of being null), we should be careful in choosing whether we need those null rows as well.

for example :

select A.x, B,y

from A left outer join B on A.x = B.x

where

A.y = B.y and

A.y = ‘abc’ and

B.y = ‘abc’

pls note that in some cases you might have to change the last line of the query as

( B.y = ‘abc’ OR B.y is null)