Tuesday, December 30, 2008

Limitations of the Present Search Engines

"Organize the World's Information and make it Universally accessible and useful" - Google.

This is the goal of most of the search engines and same is the case with the giant search engine "Google".



Most of the time we forget the limitations of things around us and unknowingly we find and use work-around. The same is true with the search engines that we use.

The problem:

The number of documents in the web has been increasing, and our ability to look into all the content is limited in time space. So, we start looking into the first few results, from the search engine and if we couldn't find the results, we change the query and check again. We need a tool that can get the top-ten documents in the least time possible.

Solution:

The tool is called search engine. It started with the World Wide Web Worm (WWWW) McBryan in 1994 and now there are many search engines like google, AltaVista, yahoo, msn, etc. The Limitations among these are common.

Limitations:

Let us start with an example. Suppose if we want to download jdk 1.4.2 from Sun Microsystems. We may give "download jdk 1.4.2 from sun" as the search String. We get a page full of results and we can click next, next .. till we don't know when it ends.

  1. The result is humongous and the precision is low. Not all the times we get what we want. For e.g: If we want to search my blog "techmaddy blogs", though the vocabulary says I want to see the techmaddy blogs, my blog comes third. The reason may be that my page ranking is bit higher. A few search engines couldn't even find it.
  2. Irrelevant results: Sometimes the results are irrelevant. Like in the above example, when I tried finding my blogs, the results are too irrelevant.
  3. Manual integration of data required. Most of the results are not processed. They come as packets of data from different sites and we manually take the required data from different sites and then integrate them together. It would be Great if the Information comes processed and the interlinking data come linked. For e.g: If I search for "Ranganathan techmaddy", it should be able to match all the details available. Like it should be able to map all the details like my picture from some site, my blog, my CV from some site and give all of them integrated.
  4. Invalid results: Suppose a new blog or site is written and published. Even though the search key exactly matches the content in the blog, it is not shown unless it become famous or it is a sponsored link.
  5. Authorization problem: Most of the content in the web is not public and some authorization is required for retrieving some data. Now, if I am logged in orkut and now I have authorization for the search. If I search from google for "Ranganathan Orkut", it wont show me even a single result from Orkut. Not even the link to Orkut.
  6. Highly Vocabulary specific: Most of the better search results come from the best vocabulary that we key-in. There are few better things like spell correction and auto-suggest. Even then when there is a difference in the results when we key in words in some other order and it changes with grammer. For e.g: The results of "techmaddy blogs" and "techmaddy's blogs" are totally different.
  7. Incorrect image results: When we type "apple" and search for it. The intention may be different. But, I could see some other pics other than apple. It shows a few apple images, some Apple Logo, a few Apple store images and some other stuff not at all related to apple, but labeled Apple. It would be better if we could see only apples. And if there are some images like a person holding an Apple. Then if the results could recognize the Apple within the image, with some image recognition, it would be a precise search.
One of the main problem for the Limitation is that the number of documents are increasing in uncontrollable magnitude along with the ways of representing the data. Html, xml, pdf, video files, images, etc. Apart from this there are misleading meta data. The newly released sites are not added to the indexing. A few sites manipulate the search engines for profit. Sponsored links cannot be avoided.

Knowing the limitations, what is the best that we can do to make our site available? Add a customized search for the site. Add the site to the indexing. Properly add meta data. Naming the images precisely. Use Semantic web.

When the web is organized, it is like a organized desk. Searching for a file in the desk will be pretty easy. Having said that a well organized web does not require a search engine. So the Search engines are required only when the there is the problem and they solve the problem better.

7 comments:

  1. The Purpose of the Bolg was excellent...
    we try to follow and try to develop our own site...
    Thanks Maddy.....

    ReplyDelete
  2. You have done nice research regarding this to find out flaws of present search engines.... There are different searching techniques available, in data mining, which are used by search engines...These techniques have their own limitations, as u mentioned here. No technique is perfect right now...Keep posting

    ReplyDelete
  3. yes true... the problems we generally face with search engines is bulleted correctly...

    ReplyDelete
  4. Very concisely explained. now I'm thinking what the solution should be for this. I just indexed my professional blog on Google with your help.

    So I'm thinking if these are the problems of the search engines, then there should be more features in search engines. Sites like cuil.com have introduced the concept of categorization of your search. There are many more search engines who come with diff ideas. I thought categorization was interesting and useful as several times ur searching for a particular category and the results are from another. Like am looking for a movie named Macy, but there is a bigger and more popular Shopping Mall named Macy's, so the results are not always from the correct category and hence become irrelevant. There's a obviously a lot of things search engines can do to be better.

    ReplyDelete
  5. The content is very informative.
    I could understand the limitations, but is their a way to avoid?

    ReplyDelete