A name by any spelling …

June 19, 2009

My non-tamil speaking friends often take a dig at me regarding typical tamil spellings of non-tamil words.

Tamil does not distinguish phonologically between voiced and unvoiced consonants;
phonetically, voice is assigned depending on a consonant’s position in a word.
Of all the 18 or so official languages in India, only tamil has this peculiarity of not being able to represent all the sounds.

Let us take an example,
The name Padma is pronounce ‘Pad’ as in pud-dle and ma as in Ma-ll.
Typical tamil pronounciations range from padma -> padhma -> badhma -> bathma -> badma.

This is because, tamil has just one letter to represent ‘Pa’, ‘Pha’, ‘Ba’, ‘Bha’.
So, it is not surprising to find the name Brinda transformed into

Brinda -> Birundha -> Pirundha -> Piruntha.

But wait, though the spellings are indeed different, isnt the sound “more-or-less” the same ?

If we can associate some value to the “phonetics” then we can perhaps determine if Brinda and Piruntha indeed sound the same !

Here is where soundex comes in !

Let us look at the names and their soundex values. The first letter represents the starting letter of the name.
The rest gives the sound a number. Closer the two numbers, similar they sound.

Brinda = B653
Birundha = B653
Pirundha = P653
Piruntha = P653

padhma = P350
badma = B350
bathma = B350
badhma = B350

Here is the javascript implementation of soundex.

Lets us take names from different cultures

(1) Kuhlmann and Kulamagal

Kuhlmann = K450
Kulamagal = K452

(2) Thurman and Duraimurugan

Thurman = T650
Duraimurugan = D656

A name by any spelling, sounds as much sweet isnt it ?

But there are some variations that too much to ask for
like Lakshmi -> Letchumi

Lakshmi = L250
Letchumi = L325

Applications use soundex to overcome spelling differences in names etc. To find if an applicant has any previous insurance policy, the search on the database is often performed using soundex to get the possible matches. Most Databases provide this function out of the box.


Troubleshooting a join

May 31, 2009


My colleague S barged into my cabin with an interesting problem in hand. S was working on a data migration project for a huge Insurance firm. The migration consists export of data from a legacy system and importing it back into a new system.

In the export process, the rows that qualify to be exported is found by running a fairly complex query. The query has around 12 joins. This query filters rows from the legacy database that are worthy to be imported into the new system.

The Customer expects certain data in the legacy to be available in the new system. But for reason, the query does not filter it and hence is not available in the new system. The Customer is surprised at this and my colleague S has to do some fact finding to explain why ( and where in the 12-way join) the data got missed out.

This fact finding turned out to be tedious exercise. S has a way to solve this. Write a diagnostic query, which is same as the 12-way join but split into, say 12 different queries. At each stage, the output is analyzed for the data of interest. The diagnostic program prints out the query at which the output did not contain the data in question.

The question put to me was if there can be any improvements to this approach. Because this involved quite a lot of queries, S was a little concerned about performance.

We thought for a while and we came up with a minor enhancement that would possibly speed up things a bit.

Let the queries be Q1, Q2, … Qn. The current approach is

Fire Q1 -> Examine output ->
If data available -> Fire Q2 … and so on.
Else printDiagnostics and exit.

A little thought suggests that, If the output of, say , Q6 does not contain the data, then we need look below Q6 ie., Queries Q1 thru Q5. Again, if we fire Q3 next and observe that the data is available in the output, then the “search” for the Query narrows down to Q4-> Q6.

This is nothing more than the “divide-and-conquer” approach used in search/sort.

Start from the middle and based on whether data is available in the output, choose which half to investigate.

This solution gives us the confidence that we indeed fire lesser number of queries than the earlier approach.


Is Java the new COBOL?

April 5, 2009


When I started my career in 1996, I was put into a COBOL training on Unisys A-series mainframes.
I was desparate to get out of mainframes because thats what my friends told me to do. Not to work on outdated, legacy stuff. I was lucky to get a break, moved on to more modern languages of that day and then eventually to Java.

I often wondered, a decade down the line, what would be considered legacy ? Ofcourse, COBOL would still be around, the mainframes would still be there and it would still be called legacy, but who else would join this so called ‘legacy club’ ?

I feel it would be Java, sooner, if not a decade later.

Why ?

Java changed the way business applications were built. Was instrumental in moving the applications from Desktop to web. But at the same time supporting Desktop applications like never before. “Compile once, Run everywhere” – a dream realized. It won the lawsuit against Microsoft and this was a landmark moment for Java. The dawn of a new era.

Java/J2EE has dominated the enterprise world. In fact, it has gone beyond that. It has gone into search engines, AI,
Distributed system infrastructure. The progress has been impressive. From being non performance in its 1.1 days, it is comparable and even overtaken C++ in the performance game.

With Java’s popularity and the OpenSource movement, Java/J2EE community enjoyed all the benefits. JBoss, The Eclipse IDE, Spring Framework, Hibernate, Hadoop, Lucene are some of the ALL TIME GREATs in their respective areas. No other language enjoys such richness in terms of OpenSource frameworks of the highest quality. There thousands more. But this isnt the focus of this post. So I stop here.

The JVM architecture was a master stroke. JIT compilers, JVM specifications encouraged the emergence JVM targetted languages. Every language now wants to find its place in the JVM. JRuby, Jython (JPython) and of course Groovy. No one wants to miss the Java Bandwagon. With everything getting compiled into Bytecodes, runtime performance comes for free. In addition, one can call these programs from within Java.
This paradigm is immensely powerful.

JVM targetted languages openup a host of possibilities. Look at JPython as a rapid prototyping language. Take the 21 liner spell checker of Peter Norvig. Compile it in Jython, add the generated java class in your existing modules and you get quite a performant spell checker in Java, without having to write those verbose Java code. Enter JRuby – and there is no more Ruby Vs Java wars.

Then Why, why do I think Java is the next COBOL ?

All is well for Java in the single core world. Its multithreading capabilities were ahead of its time. Wins hands down compared to other languages of its time. But is Java the right choice for Multicore Systems ?

We are entering an age of Multicore. Having hit the limits of Physics, we do not expect the single processor speed to go up in leaps and bounds. Instead, the innovations are going to be how many cores would be packed in a machine ?

Sadly, Java does not seem to look that good in the Multicore world.

It isnt about Java per se. It is about hitting the limitations of OO languages. Limitations or disadvantages of writing programs with side-effects. Emergence of a new terrain where OO laws dont hold good. Just like how Newtonian physics could not explain the Quantum world.

Its time for functional programming to take over. It is time to write side-effect free programs.
Erlang – A language with parallelization sewn into its core. It is time for such languages to take over.

Let me explain a bit more.

Multicore systems can be exploited by writing programs that parallelize parts of the program that are parallelizable.
In the Java world, this translates to writing multithreaded programs. With multithreading comes the need to synchronize, lock etc. This is because java programs are state aware. To put it right, programs that are stateful are difficult to be multithreaded especially if there is global data.

Functional programmming advocates side-effect free programming. For example, once a variable is set, its value can never change. x = x+1 is something that never even compiles in a side-effect free language (like erlang).

When there is no state change, there is no need for synchronization, locks etc. This makes it easier to parallelize programs.

My own personal experience has been that multithreaded java programs have not been able to efficiently use multicores. The only way out is to have multiple process and a service based approach to (ie., a cluster approach) to utilise multicores.

Yet another learning curve. I underwent the transition from Procedural to Object Oriented. Took a while to understand classes, Objects etc. For a generation that knew only classes ever existed, functional programming would be a good mental exercise. Its worth the effort though.

Functional Programming isnt new at all (XSLT is a functional programming language). It is much older than OO. Academics have always favoured functional programming as it has strong mathematical underpinnings. Academics embraced Relational database theory because that too has strong roots in mathematics (set theory). Academics frowned upon OO because it dint have a firm mathematical backing. With functional programming making a comeback, I am sure they are going to love it.

If you are the kind who would want to get tagged a legacy guy – then you would do better practicing functional programming. Pickup an Erlang book and get started. New languages would emerge – in the lines of Erlang.

But wait, people are talking about having a JVM targetted Erlang ? The terms jErlang , Erlang4J are getting more hits in google.

Hey … dont get distracted by all these. Learn to write side-effect free programs instead.


A Developer’s itch

February 17, 2009

My son can get quite cranky when it comes to having food. So we need to be adequately stocked with Animal stories or Mythological Stories or any random stories that he might ask for. Like for example, one day he demanded a “Train story”, and I had to cook up one !

To make my stories effective, I go to images.google.com and look for relevant pictures. If the story is about trains, I search for trains. At the rate of One-spoon-per-image, the rate of his food intake is quite fast!

But every time I click on a google image result, it loads a page with many images whereas I am interested only in the “See full size image” link. I need to click again to see the image (in full size).

This was an obvious itch. But quite an easy one to rid off. A simple greasemonkey script would do that. You can find my script here.

I am sure, many of the cool software might have born out of someone’s itch. Can anyone suggest to me a website that collects such “itches”. If there is none, then its a good one to start.


split kitty revisited

January 15, 2009

Sometime back I had posted this stuff. It was about partitioning a sequence into N more or less equal parts.

You can see my javascript solution here and python solution here. Suggestions and criticisms most welcome.

Update : In the javascript version, instead of bothering to initialize the “min” variable with a BIG value, it is enough to initialize it with the first element of the input array.