Search This Blog

Thursday, March 25, 2021

150,094,635,296,999,121

150,094,635,296,999,121
To me the number 150,094,635,296,999,121 means 27 ^ 12
It has something to do with a Data Migration project I have done recently. My customer wanted to migrate all the legacy data stored in various Excel sheets, SAP legacy system, Oracle eBS legacy system to a new Enterprise System written in Ruby, using Ruby On Rails.
From a casual look, migrating data from one system to the other is just reading different pieces of data from different columns which belong to different tables in the old system, then writing them to different columns in different tables in the new system.
Read from A, write to B, a piece of shit, you might think.

However, Data Migration is not only about reading data from A then writing data to B, but the Migration Process must ensure that the data written into the new system must be consistent and valid. Otherwise, you will have a pile of gibberish shit instead of meaningful data. So the process of Data Validation and Sanitization is much more complicated and important than the actual process of reading and writing data.

I will not explain in details why Data Validation and Sanitization is important and complicated, why big corporations around the world pay hundreds of millions dollars every year for Data Migration projects. If you still have doubt about it, you can take any one of these 3 actions:
1) Take my word for it.
2) Go to study Computer Science yourself, and work with some integration projects until you know why.
3) Get the hell out of my Note, go read something else.

Anyway, sometime, complications come from a very seemingly innocent validation rule similar to this:
Business Rule:
You will receive a bunch of data that represents Elevators' maintenance and histories.
Each elevator part that is sold and built at the customer site will have a serial number. So if in the data you receive, there is Part Serial Number, you can find the Job site Address in the paperwork. If in the data, there is no Serial Number, but there is Job site Address, you can find Serial Number in the paperwork. If none of them exists, the data is invalid.

One person with more technical background will translate it to something like this:
Technical Spec:
- Serial Number: If you receive empty serial number, the Serial Number is not valid. If you have some data in Serial Number, validate if it exists in database. If it does, use the existing one. If it doesn't, create a new one.

- Job Site Address: If it is empty, it is not valid. If you have some data in Job Site Address, validate if it exists in database. If it does, use the existing one. If it doesn't, create a new one.

- However, you must combine the existences of Serial Number the with the Jobsite Address to see if you can find one from the other, so the validation result will be the combination of those 2 attributes, as specified in the Business Rule.

The Technical Spec above is easy, simple and completely understandable to most of the human beings, except some billions stupids at the bottom of the human evolution (But most of them are -supprisingly - rich and successful human beings - ironic, huh? )

Computer Understanding:
But computers cannot understand the Technical Spec.
As a human being, we can see that each attribute (Serial Number or Job Site Address) above can have 1 of 3 statuses: Missing (Empty Data) , Existing in the Database, Not Existing in the Database.
Combination of 2 attributes will have total 3 ^ 2 = 9 cases, in each case, the computer must perform relevant action.

So, if you want the computer to do the validation correctly with all the possible combinations of data, you have to specify all those 9 cases explicitly, i.e. you have to use if-else with 9 different branches, or case-when (or Switch-Case) with 9 explicit cases. You can choose to do nested if-else or flatten them out as combinations of AND, OR logical operators, no matter what, it still comprises of 9 different flows of executions.

If you still have some doubt, again, you have 3 courses of actions: 1) Take my word for it; 2) If you know some programming, write some PSEUDO CODE for it; 3) Get the hell out of my Note.

Even in the case one validation fails for one attribute, you still need to go to the next, because you will want to tell the user some thing similar to "Attribute m1, attribute m2, attribute m3 ...etc .. attribute mN are missing " after the whole validation process is done once.

It is much easier if one attribute's validation fails, then you stop the whole process. But consider the case when you submit some paperwork, for example for Job Application. The HR told you that you missed your Resume. When you submitted the Resume, the HR told you that you missed the cover letter. When you submitted the cover letter, then the HR told you that you missed the References ...etc.. Of course if that thing happens, you will consider that the HR is super stupid. He should have told you at once that you had missed Resume and Cover Letter and Reference, instead telling you one-by-one each time you moved your miserable ass from your house to his office to submit the paper.

So it is the same here. If you receive a bunch of 60 attributes, you should tell the user at once that which ones are missing (empty), which ones are in the database and which ones must be created, without forcing them to submit the data tens times over.

Come back to the point, if one validation of an attribute fails, you still must go to the next ones.
In our simplified Business Rule up there, it means that you must validate all the 9 cases.

For our sake of discussion, let's call the 2 attributes described above a Group.
Now, actually the Business Rules I receive for the validations consist of about 12 Groups similar to the one described above. Each Group has average 3 attributes, each attribute has 3 possible states.

In English, to write such Business Rules for all 12 Groups, it take about 2 pages of A4 paper, as you can deduce from the simplified Business Rule above. (If you can not see it, just get the hell out of my Note).

However, each group has 3 attributes, each attribute has 3 possible outcomes. If you want to solve this Validation programmatically, the number of if-else branches you have to use is, well, 27 ^ 12, or the very fucking big number 150,094,635,296,999,121

If you type such a number of if-else, all 7 billions people in the world can type until the end of the Universe without completing it. If you use computer to generate those if-else, the time will be a little shorter, but what size of hard disk you will need to store just the "if" and the "else"?

Yet, the requirements look simple enough, 2 pages of A4 paper, and if you don't validate the data, you will have a very real possiblity of having a shitty useless pile of gibberish in your new shiny system with newest technology.

How have people solved this problem in real life? Or more accurate, how have people solved this problem in real life up until 2 weeks ago?

Well, the answer is very simple: They don't solve it. They will go with one of the 3 following approaches:
Ordinary Solutions:
1) Don't validate at all. Just validate some simple cases. Let the fucking miserable users recognize and fix the data inconsistence when they recognize that they have some pieces of shitty data in their system.

2) Validate the painful way: If one attribute fails, stop the whole process, and force the migration team and/or users to resubmit data as many times as needed.

It is why many migration projects are so painful. There are many migration teams and users threatening to kill each others by machine guns after migration projects, and many executives have to go to mental hospitals afterward.

3) Hire tens of thousands Indian or Chinese (in future, can very well be Vietnamese) to do the Data Entry jobs, reading from the legacy system screen, and typing data into the new system.
Because they are human beings, they can understand very damn well the Business Rules written in 2 pages A4. Nobody needs to tell them all 27 ^ 12 cases like the fucking stupid computers.

It is good for the the economy of third world countries. But it is also a huge insult to human intellectual.

So, up til now, you might ask: What is my Note about? Do I write my Note just to complain about the huge number 27 ^ 12 ?

Well, let say I solved this problem, mathematically and programmatically. You want to know how I solved it? He..he.. How much will corporations pay me to save them the time, money and stupid works as mentioned in the Ordinary Solutions above? How much can you pay me? If you can pay me nothing, you will receive no answer.

But let me tell you: The implementation is less than 100 lines of Ruby code, including mathematical principles's definition, blank lines and comments.

This note also serves as an illustration for naive people who claim naively that  programming languages are just languages. If you use static languages, for example Java, no fucking way you can solve this problem. (Unless you use Java to write compiler/interpreter for a dynamic language :-)) )


Written by Chau Hong Linh
Category: Computer Science

No comments:

Post a Comment

PHÂN BIỆT QUẢN TRỊ VÀ QUẢN LÝ

PHÂN BIỆT QUẢN TRỊ VÀ QUẢN LÝ Hội đồng quản trị, tiếng Anh là BOD (Board Of Directors). Còn Ban giám đốc hay Ban quản lý tiếng Anh là BOM (B...