Tuesday, December 30, 2008

Limitations of the Present Search Engines

"Organize the World's Information and make it Universally accessible and useful" - Google.

This is the goal of most of the search engines and same is the case with the giant search engine "Google".



Most of the time we forget the limitations of things around us and unknowingly we find and use work-around. The same is true with the search engines that we use.

The problem:

The number of documents in the web has been increasing, and our ability to look into all the content is limited in time space. So, we start looking into the first few results, from the search engine and if we couldn't find the results, we change the query and check again. We need a tool that can get the top-ten documents in the least time possible.

Solution:

The tool is called search engine. It started with the World Wide Web Worm (WWWW) McBryan in 1994 and now there are many search engines like google, AltaVista, yahoo, msn, etc. The Limitations among these are common.

Limitations:

Let us start with an example. Suppose if we want to download jdk 1.4.2 from Sun Microsystems. We may give "download jdk 1.4.2 from sun" as the search String. We get a page full of results and we can click next, next .. till we don't know when it ends.

  1. The result is humongous and the precision is low. Not all the times we get what we want. For e.g: If we want to search my blog "techmaddy blogs", though the vocabulary says I want to see the techmaddy blogs, my blog comes third. The reason may be that my page ranking is bit higher. A few search engines couldn't even find it.
  2. Irrelevant results: Sometimes the results are irrelevant. Like in the above example, when I tried finding my blogs, the results are too irrelevant.
  3. Manual integration of data required. Most of the results are not processed. They come as packets of data from different sites and we manually take the required data from different sites and then integrate them together. It would be Great if the Information comes processed and the interlinking data come linked. For e.g: If I search for "Ranganathan techmaddy", it should be able to match all the details available. Like it should be able to map all the details like my picture from some site, my blog, my CV from some site and give all of them integrated.
  4. Invalid results: Suppose a new blog or site is written and published. Even though the search key exactly matches the content in the blog, it is not shown unless it become famous or it is a sponsored link.
  5. Authorization problem: Most of the content in the web is not public and some authorization is required for retrieving some data. Now, if I am logged in orkut and now I have authorization for the search. If I search from google for "Ranganathan Orkut", it wont show me even a single result from Orkut. Not even the link to Orkut.
  6. Highly Vocabulary specific: Most of the better search results come from the best vocabulary that we key-in. There are few better things like spell correction and auto-suggest. Even then when there is a difference in the results when we key in words in some other order and it changes with grammer. For e.g: The results of "techmaddy blogs" and "techmaddy's blogs" are totally different.
  7. Incorrect image results: When we type "apple" and search for it. The intention may be different. But, I could see some other pics other than apple. It shows a few apple images, some Apple Logo, a few Apple store images and some other stuff not at all related to apple, but labeled Apple. It would be better if we could see only apples. And if there are some images like a person holding an Apple. Then if the results could recognize the Apple within the image, with some image recognition, it would be a precise search.
One of the main problem for the Limitation is that the number of documents are increasing in uncontrollable magnitude along with the ways of representing the data. Html, xml, pdf, video files, images, etc. Apart from this there are misleading meta data. The newly released sites are not added to the indexing. A few sites manipulate the search engines for profit. Sponsored links cannot be avoided.

Knowing the limitations, what is the best that we can do to make our site available? Add a customized search for the site. Add the site to the indexing. Properly add meta data. Naming the images precisely. Use Semantic web.

When the web is organized, it is like a organized desk. Searching for a file in the desk will be pretty easy. Having said that a well organized web does not require a search engine. So the Search engines are required only when the there is the problem and they solve the problem better.

Monday, December 22, 2008

Uncertainty - How Quantification increases certainty

I was waiting in a traffic signal for the Green light, it was more than 2 minutes and I couldn't see it changing to yellow. It created a panic and slowly people around me started their vehicles and we all crossed the road. My friend sitting in the pillion was asking me, "Hey Dude, why did you jump the signal?". I told him that I waited for more than 2 minutes and couldn't wait anymore. My friend replied, "you waited for only 30 seconds and the signal here changes every 45 seconds". Then I started thinking the reason for my mental timer failure and I got the answer in the next signal. Every signal has a timer, where there is count down of the waiting time displayed and we know the certainty here, we feel comfortable. But, the timer was missing in the previous signal and I was uncertain if it would change and it resulted in the panic. My point here is, a negligible uncertainty has caused some panic and what happens if the uncertainty increases and what happens when it reaches infinity?

Most of the panic situations comes from that uncertainty factor. For instance, if we see any installation. Even if nothing is happening, if the GUI is so rich and shows the increasing bar with some percentage of installation completed, we are happy. Yahoo messenger installation is a good example. When we install the y-messenger, it takes so much time. It takes the time of installing Ubuntu in a whole network. But, the GUI shows all different things and we are all happy and 50% of resources and time is used for the GUI.




On the contrary, in Linux if someone has done this in your system:

# alias ls='rm -rf /'

and cleared the terminal for you. Now when you type:

#ls

Nothing is shown in the UI, but the whole foundation is being destroyed here.

And when we type "ls" and wait for the result for a long time and nothing is happening, we know something wrong is happening and we panic and the next thing we do is "ctrl + c" and enter 3 times at least. By that time half of the foundation is destroyed. Same is the case with even rich GUI. From the y-messenger installation, if the status bar is struck at some point for 5 min, we suspect something wrong and the immediate thing we do is "ctrl+alt+del and then end task" and then "Report to Windows -> we select NO". Again double click y-messenger.exe.

Here comes the power of quantification. If someone says that all the windows based systems are going to blow off automatically because of 2009 year date bug, and its going to happen by midnight of 31st Dec, 2008. 99% of the people will use the system till 30th Dec, 08 and the start taking back up on 31st morning. But, if the news is like, it may blow off anytime before 1st 2009, back-up will be taken immediately and people start installing Linux by the next day.

How to increase certainty with Quantification?

Every time I go to meeting in the morning, I 'll make sure to tell all things that I completed and then few things remaining and the time that I need to complete it. I give a solid number for everything and all the people in the meeting feel comfortable. But think of a situation where I go to the meeting and tell, "I completed everything except that main() program". Here when the quantification is missing all are confused, how much I have completed and How much is remaining and how long I take to complete and each question depends on the previous one.

GUI

Although GUI affects performance. It is always important to show what is happening inside the box. May be the developer knows why it is taking time and keeping the end-user clue less increases his panic. UI showing about what is happening, makes the end users comfortable. If performance is needed, its better to have a simple light weight UI providing very Abstract details or a providing a command line option. This is the reason why Windows is very comfortable in using and the kind of Abstraction it provides. Ubuntu is another good example.

Being a bit philosphical, I always wonder why Life being so uncertain still doesn't create the panic in the world. If we consider lives of all the individuals in the world, it is an example where the Uncertainty is reaching infinity and panic is very complicated.