Simple benchmarking

September 27, 2008


In any e-learning system, there exists data regarding all the assessments and the students responses/submission of these assessments.
Such historic data are of tremendous use in providing value added services to end users.

An assessment can be defined as a set of questions.
If testid represents a test and qid represents a question, then

a test is defined by the set testid = {qid1, qid2, qid3 …}

A student attempting such a test can either
1. answer the question correct
2. answer it wrong
3. not attempt it

If students are identified as stu, then a student’s performance in a test can be defined by the set

Performance(stuid) = {testid, stuid, {qids of questions answered correct}, {qids answered wrong}, {qids unattempted}

For example,

Given a test

test1 = {q1,q2,q3,q4,q5,q6}

We can have a possible performance snapshot as

{test1, stu1, {q1}, {q3,q4}, {q2,q5,q6}}
{test1, stu2, {q1,q2}, {q5,q6}, {q3,q4}}
{test1, stu3, {q1,q5}, {q4,q5}, {q2,q3,}}
{test1, stu4, {q1,q6,q2,q4,a5}, {q3},{}}


{test1, stu5, {},{}, {q1,q2,q3,q4,q5,q6}

What can be inferred from this data ?

We can find out
1. for each question the triplet (how many answered correct n(c), how many answered wrong n(w), how many did not attempt n(u) )

Quite simple isnt’t ? Based on this one could deduce 2 useful information

For a given Question,

Difficulty index (DI) = 1 – ( n(c)/ (n(c) + n(w) + n(u)) —— (1)

% answered (PA) = ( n(c)/ (n(c) + n(w) + n(u)) * 100 —— (2)

Even though (1) and (2) arent much different from each other, it does make sense to keep them.
Let us see why.

If a pool of questions that are used to generate assessments can be subject to such an analysis as mentioned above,
Difficulty Index ( this can further be classified as easy, medium & difficult) can form a part of the question metadata.
This information helps
1. teachers to set assessments of varying difficulty levels
2. create tools that generate assessments of varying difficulty levels.

From a student’s point of view, PA is much useful, as it gives the student an instantaneous benchmarking.
In the above example, q4 hasnt been answered correctly by anyone. A student who answers q4 correctly knows that
only 0% of the students have been able to solve it. This instant benchmarking is of immense value.
However, it is worth of note that such benchmarking of question is an ongoing process and the DI index varies over time.

Parameters like age group, expertise level, country etc makes such universal classification of Questions highly debatable.
For instance, a 12 grader could find a 6th grade problem of level “Difficult” to be quite easy.
But there isnt a simple solution to this problem except that the question metadata could indicate the age group in which the
classification (as ‘Diffcult’ or ‘Easy’) hold good.

I shall post my experiences after I implement and roll it out.


Stupid … Keep it simple !

September 4, 2008


We did a small analysis to find how much of the content do the free trial users view.
To our surprise, we found that a sizable chunk around 40% of the users who sign up for free trial, don’t even login into the site and thereby end up viewing nothing.

We were dismayed at the results. Given that we did not have a nosy registration form, given that we did not force it on anyone, given that it has been the user’s conscious choice to visit us and try out stuff, why … why should 52 % of the users go away without even logging in once !

We concluded the data must be wrong or the way we got this data is incorrect. We immediately got down to look at the code that records user login, it was fine ; looked at the database records, they were fine; looked at the query used to fetch the data for the above analysis, it was fine. re-ran the query, still the same results. We had to swallow the lump … indeed 52 % of the users who voluntarily signed up did not bother to login even once !

Given this bitter truth, we set ourselves into figuring out how to reduce this. We looked into the existing process … the user gets into the site, Clicks on the free trial option, fills up the form, then upon successful registration, receives an email. Now we EXPECT the USER to check the email, see the login, password details and then login back into our site. The login name is not chosen by the user, but a lousy alphanumeric concocted by a piece of legacy code.
There is quite some things that we expected the user would do.

These 52% did not bother to check the email and login back. Perhaps, some might have checked it, and not received the email, or it went into the spam folder. The reason could be many. But it highlights a very interesting point about user behaviour.
In the web,
Make your product as sticky as possible as quickly as possible.
Presume all users are lazy and want the easiest and quickest way out.

Since the application generated the login IDs soon after the registration process is complete, we auto-logged in the user the very first time (subsequent access requires login/password to be keyed). Of course, we did email them as well. Next, we hid the concocted alphanumeric ID from the user and instead used their email ID as their loginID.

With these changes, the figure dropped from 52% to around 15% as all most users clicked on the auto login option. However, this time we were interested in knowing how many bothered to read some content. To our satisfaction, this figure was more than 80%. This has been a fantastic learning for us and has been influencing our designs a lot since then.

Update : The stats have been updated to reflect our latest analysis.


Reacting to Requirements

August 19, 2008


The team got a request to change login ID of an existing user. This was indeed very strange. The login ID percolates through all the database tables and change in the ID means to move all the data from the old ID to the new login ID. This was indeed an uphill (read boring, time consuming and mundane) task.

But one wonders what could be the need or what could’ve given rise to such a requirement. More than being downbeat about doing the data migration, we were curious to know the reason behind. It isn’t a standard practice anywhere to change loginID AND move all the data along. So I spoke to the person on field, who passed on this “want” of a customer to us, to get more clarity on what triggered this requirement.

The discussion went like this
us : got your mail regarding login ID change. i have some questions.
he : shoot ‘em.
us : what is wrong with the current login ID
he : login IDs are like employee numbers of teachers. it is confidential.
they feel students could misuse it.
us : but we don’t display the login ID anywhere. Leave alone the fact that loginID is not a confidential
information.
he : err … its like this … teacher starts the projector and the login screen comes up. now teacher
logs in by typing the login ID. All the students can now see the login ID, as the teacher types it.
This what the teacher fears. Instead they would like to use a login ID that is not in any way linked to
anything else.
us : but this should be a problem for all teachers. all of them have their employment number as loginID.
Has anyone else complained ?
he : No. But we cant say NO to this customer.
us : how do the others manage ? I mean, they should login and then start the projector isn’t it ? Or
henceforth let us not encourage creation of login IDs that can represent sensitive information like
employment number or bank account number etc.
he : may be you are right.
us : ok. bye.

The learning for us has been
0. We often tend to seek solutions by expecting the application to be extremely resilient. Whereas this is the power of software, sometimes it could turn out to be detrimental. Solutions can be implemented at various points and through various channels (like pre-sales, sales, registration, non-technical support). A good co-operation between all these teams is crucial.
1. Let IDs not represent sensitive information. In other words, inform the user that the login ID is for use by others as well. Just like a person’s name.
2. Distinguish between users ‘needs’ and ‘wants’. In this case user ‘wants’ a change in login ID. This isn’t actually ‘needed’ if they know how to use the projector OR if they had chosen a harmless login ID.
3. developer frustrations tend to get reduced when the root cause is known and a proposal to prevent such occurrences in future is made.
4. Be alert. Interesting problems disguise themselves as seemingly boring maintenance activities. Ask the right questions and explore. It cant get any more interesting !


handling user uploaded hierarchical data

August 14, 2008

My colleague U developed an application feature that would enable our users to upload documents. Users can also organize their uploaded documents as a tree of files and folders.

We then discussed how to store this information in the database.

Method 1 : Create a folder for the user on the server side file system and create the required files and folders as how the user needs it. The user’s display of his tree structure is exactly mimicked on the server side.

Method 2 : Do not use file systems. Put all the stuff discussed above in the database, including the files.

method 3 : Create a folder for the user. All the uploaded files reside in this directory. The user’s tree structure is maintained in the database (yes, we have to handle files with duplicate names. but this is not a big deal.).

We implemented Method (3). Method (1) was rejected straightaway. The difference between Method(2) an Method(3) is not much. It is about where to store the file contents ?

I would like to know which among Methods (2) and (3) is better ? I’ve heard contradicting views on file IO Vs database SELECTs of files (blob). It was quite simple to code the Method(3) way (may be, more of a perception issue).

The advantages of Method (3) are
1. The files can be backed up easily.
2. The server on which these files reside can be moved around, for performance reasons, easily.
3. Easier to index these files using search engines like lucene.

One might argue that all these are possible with the database approach as well.
These are my counter points.
1. The dbs can be backed up — But DB size bloats up. Certainly an issue for me.
2. server can be moved around — Certainly not as easily as moving files around. I need to set up a database cluster or have some sort of sharding or have a separate db for just the file storing tables but then it IS complicated than handling files, isnt it ?
3. Index – full text index of databases … again not as elegant. It is OS specific or DB specific etc.

Method (3) wins hands down right ?
I am open to hear other’s opinion on this. Any (!null) pointers would be just great !


whom to throw out ?

August 11, 2008

This is yet another interesting encounter at work sometime back. which user to throw out !
let me explain.

Let us say that a web app allows a user to be logged in only once at any given time. Now if the same userID tries to login again, should this attempt fail and retain the already logged user OR should the already logged user be thrown out and the last one wins ?

Commonsense suggests that subsequent attempts at login with the same userID (that is already logged in) should fail. This assumes that the first login is genuine and subsequent attempts are deliberate.

Another reason why subsequent attempts should fail is that, assume it is not that way, then first login succeeds-> second attempt succeeds throwing out the first one -> third attempt succeeds throwing out the second one -> … But why would a user play with himself like this ? Anyway then it is better that subsequent login attempts fail.

Time to make some more of the context clear. The web app is being used by school going kids. Their usage pattern is that kids login in school do some work online. Then come home and continue some stuff online (from home). Since school kids need to workout problems, the idle timeout of the web app is set to a rather high value ( 2 – 3hrs).

Given the above context, it makes sense that the last-one-wins approach is more appropriate. This is so because, kids forget to logout in the school and not even close the browser and so when they are back home they are unable to login till their idle timeout of the session (created in school) expires.

Quite a nice experience where a slight twist in the context made a seemingly inappropriate solution a preferred one.