FIRST, Full Information Retrieval System Thesaurus Methodology

Juan Chamero, from Intelligent Agents Internet Corp, Miami USA, August 2001

 

 

Abstract

 

FIRST, Full Information Retrieval System Thesaurus is a methodology to create evolutionary HKM’s, Maps of the Human Knowledge hosted in the Web. FIRST point towards an acceptable  kernel” of the HK estimated in nearly 500,000 basic documents selected from a exponential growing universe doubling the size yearly and actually having nearly 1400 million sites. There are many laudable and enormous scientific efforts made along the idea of building an accurate taxonomy of the Web and trying to define precisely that kernel. At the moment the only tools we as users have to locate knowledge in the Web are the search engines and directories that deliver answers lists ranging from hundreds to millions of documents being the supposed “authorities” hidden in a rather chaotic distribution within those lists. That means exhausting searching processes with thousands of “clicks” in order to locate something valuable, let’s say an authority.

 

FIRST create evolutionary search engines that deliver reasonable good answers with only one click from the beginning. We talk of reasonability as a synonym of mediocrity because the first kernel is only a mediocre solution henceforth to be optimized via its interactions with users. FIRST could be considered also an Expert System able to learn mainly from those interactions mismatching. So initially FIRST generated kernels could be considered mediocre one click solutions, for a given culture and for a given language but able to learn converging to a consensual kernel. To accomplish that the only that FIRST kernels need are interactions with users. As long as users represent the whole the more the kernel will tend to represent the knowledge of that whole. For that reason, we imagine a network of HKM’s implemented via our FIRST or some others equivalent evolutionary tools. As each node of this semantic network will serve a given population (or market) we could easily implement something like a DIAN, Distributed Intelligent Agents Network to coordinate the efforts made by each local staff de Intelligent Agents (coopbots). Each node will have a kernel in a different stage of evolution depending of its age, measured in interactivity, and of its population profile.

 

The main differentiation of FIRST from most present knowledge classification and representation projects rests on the hybrid procedure of building the mediocre starting solution: a staff of human experts aided by IA’s and IR algorithms. The reason of this approach is the actual “state of the art” of Artificial Intelligence, AI. The best actual robots are unable to accurately detect general authorities and are easy to be disguised, unfortunately by millions of document owners that either unethically or by ignorance try to present their sites as authorities. Another flaw is the primitiveness of even the most advanced robots unable to edit comprehensible synthesis of sites. Otherwise the human being is extremely good for those tasks, by far more accurate and more efficient.

 

The map itself consists of I-URL’s, Intelligent URL’s, brief documents from half to two pages, describing the sites referenced like pieces of tutorials, classified along a set of taxonomy variables and tagged with a set Intelligent Tags, some of them to manage and to track their evolutionary process. For each Major Subject of the HK, a Tutorial, a Thesaurus, a Semantic Network and a Logical Tree are provided and bound to the virtual evolutionary process of the users playing a sort of “knowledge game” versus the kernel.

 

FIRST is presented here within a context of the IR-AI  state of the art”. The methodology has been tested to build a HKM in 120 days. Time is a very important engineering factor due to the explosive expansion of the Web and because its inherent high volatility. The task performed by the human experts staff is similar to the task of providing, to a Knowledge Expert System, the basic knowledge to “play” a Game of Knowledge reasonably well versus average Web users. Resembling the beginnings of the Big Blue that beats Kasparov: it initially should have been able to beat not a master but at least a second category chess player (with a reasonable good ELO standard) and from that the evolutionary path through three six levels more: first category, master, international master, grand master, championship.

 


Content Index

 

 

1- The Future of Cyberspace – The Noosphere

Introduction

The Web space Regions

Region Volumes Estimations

The Web space looks like the Sky at night

How the Search Engines illuminate the Resources

The Cyberspace as a Global Market

Websites are built to match users

Mismatch reasons

The solution

What’s does Intelligent mean

Some examples about actual general search inefficiency

Human Knowledge Shells

 

2- About a New Approach to Internet Communications

Internet is a very particular net

Information Offer versus Information Demand

What people needs

Jargons Evolution

 

3- FIRST, Full Information Retrieval System Thesaurus

 

The actual Information Retrieval process in the Cyberspace

The main reasons of that uselessness

Uselessness Measure

Using Search Engines

Searching in Databases

Our Approach to this problem: FIRST

Internet Drawbacks: Internet the realm of mismatch

The solution in Theory

Are the Search Engines really useless?

WOO_1: Architecture and meaning of the first Virtual Library

Virtual Library

Volume of “sufficient” Virtual Libraries

The Two Key Components for Retrieval

The Thesaurus

The power of the right statistics

 

 

4- i-URL’s and Intelligent Databases

 

i-URL’s Databases

The inefficiency of actual Search Engines and Directories

First Step: Valuable Comments – Virtual Libraries

Second Step: How to build Virtual Libraries

The Thesaurus concept

How we combine the virtues of Thesaurus and Indexes

How an I-URL looks like

The egg-chicken problem

Advantages of our i-Virtual Libraries

Notes


5- Evolutionary Process - Some Program Analysis Considerations

 

User Track Mechanics

Tracking “zoom”

Thesaurus Evolution Mechanics

Analysis of some other types of user interactions

Another crucial events: users’ feedback

Path keyword <=> string correspondences

 

6- Noosphere Mechanics – Evolutionary Sequence

 

7- An Approach to Website Taxonomy

 

How to browse a site to measure its structure

A first raw approach

 

8- FIRST within the vast world of AI – IR

 

FIRST niche

Navigation by AI-IR Authorities

KR

DAI and KK

Key behavior of some IA’s

DAI, network

WEB scenario

HKM complementary Information

Operational Hints

Some Subtle Hints

HITS

OGS

TAPER

Web Sizing

KR - One outstanding IR “authority”

 

New and Old ideas in action now

Clustering: Vivisimo and Teoma

More about KR

 

 


1- The Future of Cyberspace

The Web space and the Noosphere[1]

Introduction

You may find 30,136 pages dealing with “noosphere” in Altavista at 2.22 PM Eastern Time for USA and Canada on Thursday 12th of April 2001. This is a rather strange word for many people that did not deserve an entry in the Merriam Webster online dictionary yet. However we know, use and enjoy the Cyberspace, concept that at nearly the same time deserves as many as 777,290 entries in the same Altavista, but on the contrary it has an entry in Merriam Webster since 1986, with the following meaning: the on-line world of computer networks. Web space is another neologism not yet included in that dictionary but deserves 485,805 entries in Altavista.

 

The Web space growths at a fantastic pace holding today nearly one and half billion of documents, ranging from Virtual Libraries and virtual reference e-books dealing with the Major Subjects of the human knowledge through ephemeral news and trivial virtual flyers generated “on the fly” at any moment continuously. We may find in the Web documents belonging to any of the three Internet major resources or categories: Information, Knowledge and Entertainment.

 

 

The Web space Regions

 

 

In the above figure the black crown represents the Web space and the green circle the users. The gray crown represents an intermediate net to be built in the near future with intelligent resumes of the Human Knowledge, pointing to the Web basic documents and e-books. One user is shown extracting a “cone” of what he/she needs in terms of information and knowledge. The intelligent resumes must be engineered in order to be good enough as introductory guides/tutorials with a set of essential hyperlinks inside. If the user wants more detail goes then directly to the right sources within the black region. Depending of the Major Subject dealt with the user may go from resume to resume or jumping to higher level guides inside the gray region going to the black region only to look for specific themes. Moreover many users will be satisfied browsing within the gray region without even venturing into the black region.

 

 

 

Another user goes directly to the black region guided by aid of classical search engines as now. The black region will be always necessary and its size will grow fast as time passes by. On the contrary, the gray region will fluctuate around a medium volume growing at a relatively very low rhythm. Effectively, the Human Knowledge “kernel” of basic documents is almost bound, changing its content but always around the same set of Major Subjects. The growth of the gray region is extremely low in comparison to the black region. Some Major Subjects die and some others are born along the time but slowly.

 

 

Region Volumes Estimations

For more Web sizing information see our Chapter 8 about The Vast World of AI-IR

 

As a science fiction exercise we invite you to make some calculations resembling some Isaac Asimov’s stories and Carl Sagan’s speculations. Being the actual Human Knowledge bound to let’s say 250 Major Subjects or Disciplines and if for each of them we define a Virtual Library with non redundant 2,000 e-books, in the average, we will have a volume of 500,000 e-books. Now we could design a methodology to synthesize an intelligent text resume for each e-book in no more than 2,000 characters, in the average, totaling 1,000 MB ó 1 GB storing one character in one single byte. That would be the volume of the gray region!, not too much really!.

 

Let’s then compare this volume to the volume of the black region and to the volume of the resources of the Human Knowledge. Once upon a time, there were a Web space with one and a half billion documents with an average volume estimated in 2.5 MB (we have documents ranging from 10KB and less to 100MB and more: to get that figure we supposed the following arbitrary size series 1, 10, 100, 1,000, 10,000, 100,000 in KB and we assigned to each term the following arbitrary weights: .64, .32, .16, .08, .004, .002 respectively). Then we have a volume of nearly 3750, 000,000 MB!. Within that giant space float disperse the basic e-books, the resources of the Human Knowledge with an estimated volume of nearly 500,000 MB assigning 1MB to each one, half a million of text and 100 images of 5KB in the average.

 

Black Region: ~3,750,000 GB => HK ~ 500 GB => Grey Region ~ 1 GB

 

Incredible result that demonstrates how easy will be able to compile a rather stable HKIS, Human Knowledge intelligent Summary in relation to the unstable, noisy, bubbling, fizzy and always growing black region. Once the effort is done the upgrade will be facilitated via Expert Systems and a set of specialized Intelligent Agents that will detect and extract from the black region only the “necessary” changes.

 

The Web space looks like the Sky at night

 

In the figure above we depict the actual Web space in black, resembling the physical space of the Universe. No doubt the information we need as users is up there but where?. That virtual space is really almost black for us. Some members of the Cyberspace that provide searching services titled as Search Engines and/or Web World Wide Directories are like stars that irradiate light all over the space to make sites indirectly visible. Sometimes we may find quite a few sites with their own light, like stars, activated by publicity in conventional media but the rest is only illuminated by those services at users’ request. Let’s go deepen a little about the nature of this singular searching process.

 

For each resource (body) located in the Web space in an URL, which stands for Uniform Resource Locator, robots of those lighting services prepare a brief summary with some information extracted from it, no more than a paragraph and then all the information collected goes to their databases. The summaries have attached to them some keywords extracted from the resources visited and consequently are indexed in as many keywords as they have attached.

How the Search Engines illuminate the Resources

 

The actual robots are very “clever” but extremely primitive compared to human beings. They are doing their best and they have to perform their work fast in fractions of millisecond per resource as well so it would be unpractical being more sophisticated because the time of “evaluation” grows exponentially with the level of cleverness. To facilitate the robots work the Website programmers and developers have at hand wise tools but many of them overuse those facilities so badly to make them unwise. In fact with those tools the programmers could communicate to the robots some essential information the site owners wish to be known about the site.

 

Those wise gateways are now noisy because most people try to deceive the robots overselling what should be the essential information. Why do they that?. Because the Search Engines must present the sites listed hierarchically, the first the best!. It occurs something like in the Classified Section of the newspapers: the people wishing to be listed first unethically make nonsense use of the first letter of the alphabet: AAAAAAA Home Services go first that for instance AA Home Services. The Search Engines do not have too much room to design a “fair” methodology to rank the sites with equity and Internet is a non-police realm besides

 

One trivial criterion should be to count how many times a keyword is cited within the resource but that proved to be misleading because the robots only browse the resource partially being practically impossible to differentiate a sound academic treatise from a student homework concerning the same subject. To make the things worse, programmers, developers, and content experts know all those tricks and consequently they make overuse of the keywords they believe are significant.

 

The Search Engines have improved too much along the last two years but the searching process continues being highly inefficient and tends to collapse. To help site owners to gain positions within the lists (in fact to get more light) proliferate ethical and unethical techniques and programs most of them apt to deceive the “enemy”, namely the Search Engines. Even in a ‘Bona Fide” utopia it’s impossible for a robot to differentiate between a complex site and a humble site dealing with the same subject. Complex sites architectures could even make the sites invisible for them because they are only well suited to evaluate flat and simple sites. For instance search engines like Google needs also to break even commercially and start selling pseudo forms of score enforcing ways to desperate site owners that need traffic to subsist.

 

We emphasize again the fact that the “light” that a Search Engine provides to each URL is indirect like the Moon reflects the Sun’s light. Then our conclusion is that most of the information and the knowledge is hidden in the darkness of the Cyberspace.

 

The Cyberspace as a Global Market

The Matchmaking Realm

Now that we know the meaning of the HK Human Knowledge we may define HKIS, the Human Knowledge Intelligent Summaries, a set of summaries that we have to explain soon why do we title them as intelligent, and NHKIS, for a Network of Human Knowledge Intelligent Summaries that correspond to the gray crown of the above figures. Now we are going to enter into the problem of the languages and jargons spoken in the Black Region, in the Gray Region and mainly in the Green Region.

 

 

Websites are built to match users

Internet the Realm of Mismatch

 

The Websites are built to match users, are like lighthouses in the darkness, to broadcast information, knowledge and in the case of e-Commerce some kind of attracting information as “opportunities”. What really happens is that at present Internet is more the Realm of Mismatch than of Matching. The lighthouses owners cannot find the users and the users neither cannot find the alleged opportunities nor understand the broadcasted messages. This mismatching scenario is dramatic in the case of Portals, huge lighthouses created to attract as many people as possible via general interest “attractions”.

 

Something similar occurs with the databases where are stored millions of units of supposedly useful information such as catalogs, services, manufacturers, professionals, jobs opportunities, commercial firms, etc: users could not find what they need. When we are talking of mismatch we mean figures well over 95% and in some databases matching efficiencies lower than 0,1%.

 

In the figure above we depicted this dramatic mismatch. The yellow point is a Website with its offer represented by the cone emerging from it, let’ say the Offer expressed in its language and in its particular jargon. A point black within the green circle represents a user and the cone emerging out from it his/her Demand expressed also in his/her language and particular jargon.

 

 

Mismatch reasons

Websites and user speak and think different

 

What we discovered is that both sides speak approximately the same language but by sure different jargons and more than that, they think different!. We have depicted the gray crown because the portion corresponding to its Major Subject virtually exists: that’s the portion in dark gray within its cone.  They have the “truth” expressed in its particular jargon, and sometimes the “official” and standard jargon. If the Website were for instance a “Vertical” of the Chemical Industry, of course its jargon will then be within the Chemical Industry Standards and its menu should be expressed technically correct, resembling the Index of a Manual for that particular Major Subject: Chemical Industry.

 

So our conclusion of a research done along two years studying the mismatch causes was that the lighthouses speak -or intend to speak- official jargons, certified by the establishment of their particular Major Subjects. They are supposed having the truth and they think as “teachers”, expressing their truth in their menus that are in fact “logical trees”. They may allege to be e-books and they behave, think, and look, pretty much the same as physical books.

 

Now let’s analyze how the users act, express and behave. If one user meets the site to learn, the cones convergence is obliged, the user is forced to think in terms of concepts of the menu that for him/her resembles a program of study, and we have a match scenario. If the user meets the site to search something, that’s different. When one goes to search something one tends to think in keywords terms instead, keywords that belong to our own jargon and at large to our own Thesaurus. So, either by ignorance or on the contrary, being an expert, the users’ cones diverge substantially from the site’s cone. One of the main reasons of this divergence is that the site owners ignore what their market targets need. Many of them are migrating from conventional businesses to e-Commerce approaches and extrapolate their market know-how as it is. They were working hard along decades to match their markets and to establish agreed jargons and now they have to face unknown users coming virtually from all over the world.

 

The solution

Evidently the solution will be the evolution from mismatch to match in the most efficient way. To accomplish that, both the Offer and the Demand, have to approximate each other until both share a win-win scenario and a common jargon.

In the figure above we depict a mismatch condition where we might distinguish three zones: the red zone represents the idle and/or useless Knowledge; the gray zone corresponds to the common section with an agreed Thesaurus concordance; and the blue zone corresponds to what the users need, want, and apparently does not exist within the site. So the site owners and administrators have three lines of action: a) reduce to zero the red zones, for instance adapting and/or eliminating supposed “attractions”; b) learn as much as possible about the blue zone, and; combine both strategies.

 

At this moment the dark green zones are extremely tiny, less than 5% being Internet the Realm of Mismatch between Users’ Demand and Sites’ Offer. The big efforts to be done consist in minimizing costs eliminating useless attractions and learn from non-satisfied Users’ needs. To accomplish both purposes the site owners need intelligent tools, agents that warn them about red and blue events.

 

 

What’s does Intelligent mean

 

Let’s analyze the basic process of users-Internet interactions. One user meets one site to interact in one of three forms some times concurrently: investing time, making click over a link or filling a form or a box with some text, for instance to make a query to a database. The site statistic are well prepared to account for clicks, telling what “paths” were browsed by each user but they are not well suited to account for interaction derived from textual interactions. Of course, you may record the queries and even the answers but that’s not enough to learn from mismatching. To accomplish that we may create programs and/or intelligent agents that account for the different uses over the components of each answer, but they have to do then a rather heavy accounting.

 

If we query a commercial database for tires the answer would be a list of tires stores; and to have statistics about how frequent the users ask for this specific keyword we need to account for it; and to know about the “presence” of each store as a potential seller we need to account for it; and if we want to know about the popularity of each store we need to go farther, accounting for it and so forth. That accounting process involves a terrific burden even done in the site server’s side.

 

An intelligent approach should be to have all possible counters to detect documents popularity and users’ behavior, built in into the data to be queried. That’s the beginning of the idea: to provide a set of counters within the data to be queried by users for each type of statistic. So when a data is requested a counter is activated accounting for the presence, and when it is selected by a click another counter is activated and when the user by reading the “intelligent summary” received decide to make a click over the original site or over one of its inner hyperlinks, another counter is activated.

 

 

 

Here is represented a typical track of user-site interaction. The user makes a query for “tires”. The i-Intelligent Database answers sending all data it has indexed by tire adding a list of synonyms and related keywords it has for tire. Each activated i-URL accounts its presence in that answer adding one to the corresponding counter in the i-Tags zone. If the user makes click on a specific i-URL the system presents it to the user accounting for this preference in another counter of the i-Tags zone.

 

Finally if the user decides to access the commented site located in the black crown makes a click and another counter is activated within the i-Tags zone. At the same time the counter corresponding to the keyword tire is activated adding one and the same if the user activates some synonym or related keyword. If the answer is zero data it means a mismatch because an error or a warning about a non-existent resource within the database. In both cases the system has to activate different counters for the wrong or non-existing keyword in order to account for the popularity of this specific mismatch. If the popularity is high it is a warning signal to the site Chief Editor (either human or virtual) about the potential acceptance of the keyword, either as a synonym or a related keyword. At the same time, the system may urge to look for additional data within the black region. From time to time the systems could suggest the rehearsal of the i-URL’s summaries database in order to assign data to the new keywords as well. We will see how to work with a network of these Expert Systems at different stadium of evolution.

 

 

Within the intelligent feature we consider to register the IP of the users interactions and the sequence of queries, normally related to something not found. The keywords users’ strings are in their turn related to specific subjects within the Major Subject of the site. So, statistically, the keywords strings analysis tells us about the popularity of the actual menu items and suggests new items to be considered.

 

 

 

 

Some examples about actual general search inefficiency

 

Let’s try to search for something apparently trivial like “Internet statistics”, for instance using one of the best search engines, Google: More than 1,500,000 sites!. Do not dip too much along that list, only check what the first 20 or 30 sites offers. Most of the content shown by the sites of that sample is obsolete and when updated you are harassed by myriad of sales offers about particularly statistics, market research studies and similar, priced on the thousands up. And if this scenario occurs with supposed authorities: Library of Congress, Cyberatlas, About.com statistics sites, Internet Index, Data Quest, InternetStats, what then with the 1,500.000 resting?.

 

What if that noisy cluster be replaced by a brief comment made by a statistician, telling the state of the art about Internet Statistics and suggesting alternatives ways to compile statistics from free updated authorities that by sure exist in the Web?. That’s is very easy to do and economic either, it should take no more than one hour of that specialist. Of course that would be feasible as a permanent solution if the cost of updating that kind of reports were relatively insignificant. Concerning this problem we estimated that the global cost for updating a given HKM is of the order of 3% to 5% per annum the cost of its creation. So the HKM’s will be updated by two ways: evolutionary by evolution through their interaction with users and authoritative by human experts updates.

 

Let’s see another examples with “sex” and “games”. Sex has more than 48,000,000 sites and is well known that the sources of sexual and pornographic content are fewer than 100. The rest are speculators, repetitions, transfers, and commuting sites of only one click per user playing the ingenuous role of useful idiots. Something similar occurs with games with more than 35, 000,000 sites and again the world providers of games machines, solutions, and software are no more than 100!.

 

 

Human Knowledge Shells

 

For a given culture and for a given moment we have the following regions in the Web space

:

 

 

Red: a given HKM

Black Blue: HK Virtual Library

Regular Navy Blue: Ideal HK

Blue: Ideal HK plus New Research

Light Blue: Ideal HK plus NR plus Knowledge Movements

Deep Light Blue: Ideal HK plus NR plus KM plus Information

 

Everything is working within an expanding universe of Human Intellectual Activity. It takes too much time and effort for new ideas and concepts to form part of the Ideal HK. We as human have two kinds of memory, semantic and episodic, and any cultures in a given moment have its semantic memory, conscious and unconscious, intuitive and rational as well as its episodic memory.

 

Along the human history the dominant cultures have controlled the inflow of the Human Intellectual Activity in explicit and implicit ways, for instance discouraging the dissension. Internet allows us as users to dissent with any form of “established” HK and to influence on an equality basis the allegedly ideal HK. This feature will accelerate in an unprecedented way the enrichment of the ideal HK. For that reason we emphasize in FIRST the mismatch between the HKM and users thoughts, questions and expectations, oriented to satisfy users, that is the human being as a whole and as a unit. 

 


2- About a New Approach to Internet Communications

Linguistic Approach

 

 

Internet is a very particular net

 

We make specific reference to Internet Data Management because the “Big Net” differs substantially from most nets. Internet deals with all possible groups of people and all possible groups of interest. Internet users belong to all possible markets from kids to old people in all possible economic, social and political levels and cultures. This Universality makes the Internet man-machine interactions extremely varied.

 

On the contrary, in any other network we may define a “jargon”, ethic and rules. When we build a new Internet Website we really ignore what will our potential users be, and consequently what they want, what they need and even we ignore their jargons. We imagine a target market and for that specific market we design the site content, in fact, the “Information Offer” to that market.

 

 

 

 

 

The figure above depicts the matchmaking process within the Internet “noosphere”. The users in green express what they want and even think in terms of “keywords”, expressed in their own jargon, are open and flexible. On the contrary, the Website owners through their sites believe they have the truth, only the truth but the truth. In that sense being or not an authority they resemble “The Law” of the establishment of the Human Knowledge. The law, for each Major Subject is expressed in Indexes of the main branches of that Major Subject, resembling a “Logical Tree”, depicted in gray over the yellow truth. They imagine their sites as universal facilitators but always following the pattern of the logical tree and expressed in their jargons.

 

The Websites have their own Thesaurus, set of “official” keywords, depicted in white over black background, within the darkness of the Web space. Between the logical tree and the Thesaurus exists a correspondence. The Website owners are shown with the Truth Staff in yellow. The users-Internet interactions are depicted as a progressive matchmaking process, going from green to black and vice versa, learning one from the other match-mismatch. Both sides strive for knowing interchanging knowledge

 

 

Information Offer versus Information Demand

 

Paradoxically, even being the Web so well suited to add, to generate and to manage intelligence most people ignore this fantastic possibility. If we define our Information Offer as WOO, which stands for What Owners Offer and what the users want by WUW, which stands for What Users Want, the Web Architecture permits the continuous match between them and as a byproduct the intelligence emerging from any mismatch.

 

That possibility means the following: WUW is what users want expressed in their specific jargon/s, meanwhile WOO is the Website information offer expressed in let’s say the “official/legal” jargon, the one we choose to communicate with our target market. The continuous mismatch between WOO versus WUW would permits us to know the following five crucial things:

 

·         What the Market wants

·         The Market major characteristics

·         The Market homogeneity and/or its segmentation

·         The Market jargon/s

·         The Market needs.

 

The knowledge of the market jargon/s permit us to optimize our offer: for instance, a negative answer to an user query could mean either that we don’t have what he/she wants or that the name of what he/she is looking for in his/her jargon differs in our jargon.

 

 

What people needs

 

What we know directly from users queries is what they want, not what they need. The difference between WUW and WUN, What Users Need is substantial. People generally know what they need but adjust their needs to the supposed or alleged Website capabilities. We learn what our users need as time passes by if we make use of the intelligence byproducts and/or from surveys.

 

 

How is normally organized the Information Offer and how the Queries

 

The IO is normally presented as ordered sets under the form of Catalogs, Indexed Lists and Indexes but the queries, where the users express their particulars needs WUN are expressed by keywords. Both communication systems are completely different even though could be complemented and we could make them work together towards the ideal match between WUN versus WOO.

 

As we see soon the users communicate with the different Websites via their subjective jargons, at least as many jargons as MS, “major subjects” they are interested in. For instance, if I’m an entrepreneur that manufactures sport car wheels I’m going to query B2B sites to look for subjects related to the sport car wheels expressing myself in “my” jargon, with differences with the “official” jargons used in the B2B sites and of course, the query outcomes will strongly depend of the jargons differences.

 

 

Jargons Evolution

 

In a similar way as the official languages change from time to time, influenced at large by the pressures of the people jargons, coexisting both at any time, we may endow an extremely efficient evolutionary feature to the Websites of the Cyberspace via Expert Systems, that learn from the man-Internet interactions. We dare to qualify this feature as extremely efficient because in the Cyberspace every transaction could be easily and precisely accounted for. So, each time one user uses a keyword belonging to his/her jargon this event could and should be accounted for.

 

Let’s then imagine what kind of intelligent byproduct could we extract of this simple but astonishing feature. Within a homogeneous market the keywords tend to be the same among their members. So in our lat example, if the majority of users make queries asking for wheels and the word-product wheel does not exist in our database a trivial byproduct takes the form of the following suggestion: add wheels to the database as soon as possible. On the other hand if the word-product “ergaston” was never asked for along a considerable amount of time, another trivial message should be: take ergaston out from the database.

 

 

The figure above depicts the evolution of the matchmaking process. In the beginning, the Website owners had the oval green-gray target, where one user is shown with a black dot. But that user really belongs to a users affinity market depicted as a dark green oval with a cone of Internet interest that differs too much from the ideal initial target. The Website owners need an intelligent process to shift towards the bigger potential market dark green. With a cone border yellow we depict the final “stable” matchmaking. 


 

3- FIRST, Full Information Retrieval System Thesaurus

 

 

The actual Information Retrieval process in the Cyberspace

 

The Cyberspace actually has about 1,500 million documents ranging from reference to trivial, from truly e-books dealing with the major subjects of the human knowledge to daily news and even with minute to minute human interactions information as in the case of Newsgroups, Chat and Forum “on the fly” pages generation. This information mass grows continuously at an exponential rate, rather chaotically, as its production rate is being by far exceeded by the human capacity for filtering, qualifying and classifying it.

 

To help the retrieval of information from the Cyberspace we make use of Search Engines and Directories that are unable to attain WUN, What Users (We the Humans) Need. From all that information mass the search engines offer to us “summaries”, telling what kind of information could we get in each location of the Cyberspace (the URL, Uniform Resource Locator). So for each URL we as users obtain its summary. Those summaries are normally written by the Search Engines robots, which try to do their best extracting pieces of “intelligence” from each Cyberspace location.

 

In the figure we depict some sites within the darkness of the Cyberspace. We may find from huge sites storing millions of documents and with hundreds of sections through tiny sites with a flat design storing a few pages. One Search Engine shown as a yellow crown sends its robots to visit existing sites from time to time making a brief “robotic” summary of them. As we will see soon those brief reports are noisy, deceiving the users (green circle). The Search Engine assigns priorities, which act in turn as a measure of the site magnitude (as the brightness of a star). As it’s depicted, the priorities (the navy blue dots) have nothing to do with the real magnitude of the site (depicted as the white circle diameter). So the yellow crown is a severe distortion of the Web. These priorities defined for the keywords set of a given site resemble the “light” that illuminates it: a high priority means a powerful beam of light reflecting over the site highlighting it to the users sight.

 

The actual information provided for the search engines are as primitive as the map of the sky we had one thousand years ago. The robots only detect some keywords the site content have, equivalent to the chemical elements of the celestial bodies, but tell us nothing about its structure, type of body and magnitude. Today we may have for each celestial body the following data:

 

Among many others, diameter, density, its constitutive elements spectral distribution, brightness, radiation, and albedo. For each of these variables we have site equivalents that must be known in order to say that we have a comprehensible Cyberspace map. For instance we need to know something that resembles magnitude, density and elements distribution and brightness.

 

Being the bodies of this cultural and intellectual space (noosphere), intellectual creatures, we need an intellectual summary of it, what is known as the abstract in essays and research papers.  For instance a site could be camouflaged to appear attractive emphasizing the importance of a given element, let’s say “climate”, to deceive a robot as being a specialized climate site but in reality having nothing about climate content. The same happens with information: Portal’ news, for instance, are presented as content sites, being that true only concerning a specific type of information resource known as “news”, of an extremely ephemeral life of hours. On the contrary, content of philosophy or mathematics are by far denser, heavier, with lives lasting centuries in the average. So we could distinguish all kind of bodies from fizzy (news) to rocky  (academic).

 

Another complementary source of information are the Databases hosted as collateral of the Websites, as huge stores of organized and structured data. The content and quality of these databases are normally a subjective “bona fide” declaration made by the Website owners. So far for the users, the Cyberspace, particularly the Web Cyberspace, looks like a net of information resources with some “Indexes” to facilitate their retrieval task. Those robots made indexes are too noisy being practically useless. Below we attach a well-known graphic sample of this uselessness

 

 

The figure depicts the finding of useful information (black spots) navigating along a searching program

 

 

The main reasons of that uselessness

 

The main reasons are, among many:

 

Increasing Websites Complexity: Robots could not cope with the Website increasing complexity. Robots are unable to evaluate properly sites like the ones belonging to the NASA, World Trade Organization, and the Library of Congress, only to mention some institutional, concerning Aerospace, Commerce, and General Knowledge respectively, and cannot differentiate them from trivial sites dealing with similar subjects.

 

Inability to cope with Human Stratagems: Robots are unable to detect and to block some subtle overselling stratagems made by the Website owners to position themselves high in the Search Engines answers to users queries.

 

Linguistic Problems: Robots could not cope with the increasing number and complexity of the languages and jargons used in the net. They make their work using rather naïve Thesaurus, only modified and enriched via the Website owners’ declarations, not as it should be via the users feedback. As a consequence of that bias the Search Engines speak the owners jargons instead of the users jargons.

 

In brief, the shadows of content that search engines offer to the users have almost nothing to do with the real content of the Cyberspace, presenting a distorted vision of it. The problem is the contagious spread of this distortion as long as the Website owners use that summary information as a “bona fide” vision of its world. As a corollary, Internet speaks today the Website owners jargons pointing to a global distorted visions of the real markets!.

 

 

Uselessness Measure

Using Search Engines

 

The mismatch measure between WUN, What Users Need and WSO, What Search-Engines Offer, should be one of the first priorities of scientific institutions interested in the Internet health. However, almost everybody is well acquainted of this abysmal mismatch and you may check it by yourself very easily making random queries about any subject. We, as a private research group, made our own investigations about that global mismatch finding the following figures:

Mismatch of WSO versus WUN is within the order of 6,000 to 1

Meaning that we, as ordinary users, searching through the Cyberspace with the help of outstanding search engines, in the average, have to browse through 6,000 summaries to find 1 potentially matching our needs.

 

Searching in Databases

 

Searching information stored in Databases proved to be a tough task as well. Students of Systems Engineering coursing the last year of their career in the Instituto Tecnológico de Monterrey, Mexico, were invited to freely query a commercial tested (2) Database being the mismatch greater than 99,9%, that is, they needed in the average more than 100 queries to match a product/service stored within the database. The main reason of the mismatch was not due to missing information in the database but to linguistic problems. That was a warning sign and we investigated some other commercial databases belonging to well-known B2B sites with similar results.

 

Note 2: By “tested” we mean that the content was checked before the trial. The information existed but the students were unable to find what they were searching because linguistic problems.

 

The abysmal and chaotic mismatch enable forms of e-Commerce delinquency:  When you as a user