FIRST,
Full Information Retrieval System Thesaurus Methodology
Juan
Chamero, from Intelligent Agents Internet Corp,
Abstract
FIRST, Full Information Retrieval System Thesaurus is a methodology to create evolutionary HKM’s, Maps of the Human Knowledge hosted in the Web. FIRST point towards an acceptable “kernel” of the HK estimated in nearly 500,000 basic documents selected from a exponential growing universe doubling the size yearly and actually having nearly 1400 million sites. There are many laudable and enormous scientific efforts made along the idea of building an accurate taxonomy of the Web and trying to define precisely that kernel. At the moment the only tools we as users have to locate knowledge in the Web are the search engines and directories that deliver answers lists ranging from hundreds to millions of documents being the supposed “authorities” hidden in a rather chaotic distribution within those lists. That means exhausting searching processes with thousands of “clicks” in order to locate something valuable, let’s say an authority.
FIRST create evolutionary search engines that deliver
reasonable good answers with only one click from the beginning. We talk of
reasonability as a synonym of mediocrity because the first kernel is only a
mediocre solution henceforth to be optimized via its interactions with users.
FIRST could be considered also an Expert System able to learn mainly from those
interactions mismatching. So initially FIRST generated kernels could be
considered mediocre one click solutions, for a given culture and for a given
language but able to learn converging to a consensual kernel. To accomplish
that the only that FIRST kernels need are interactions with users. As long as
users represent the whole the more the kernel will tend to represent the
knowledge of that whole. For that reason, we imagine a network of HKM’s
implemented via our FIRST or some others equivalent evolutionary tools. As each
node of this semantic network will serve a given population (or market) we
could easily implement something like a DIAN, Distributed Intelligent Agents
Network to coordinate the efforts made by each local staff de Intelligent
Agents (coopbots). Each node will have a kernel in a different stage of
evolution depending of its age, measured in interactivity, and of its
population profile.
The main differentiation of FIRST from most present knowledge
classification and representation projects rests on the hybrid procedure of
building the mediocre starting solution: a staff of human experts aided by IA’s
and IR algorithms. The reason of this approach is the actual “state of the art”
of Artificial Intelligence, AI. The best actual robots are unable to accurately
detect general authorities and are easy to be disguised, unfortunately by
millions of document owners that either unethically or by ignorance try to present their sites as authorities. Another flaw is
the primitiveness of even the most advanced robots unable to edit
comprehensible synthesis of sites. Otherwise the human being is extremely good
for those tasks, by far more accurate and more efficient.
The map itself consists of I-URL’s, Intelligent URL’s, brief documents from half to two pages, describing the sites referenced like pieces of tutorials, classified along a set of taxonomy variables and tagged with a set Intelligent Tags, some of them to manage and to track their evolutionary process. For each Major Subject of the HK, a Tutorial, a Thesaurus, a Semantic Network and a Logical Tree are provided and bound to the virtual evolutionary process of the users playing a sort of “knowledge game” versus the kernel.
FIRST is presented here within a context
of the IR-AI “state
of the art”. The methodology has been tested to build a HKM in 120 days. Time
is a very important engineering factor due to the explosive expansion of the
Web and because its inherent high volatility. The task
performed by the human experts staff is similar to the
task of providing, to a Knowledge Expert System, the basic knowledge to “play”
a Game of Knowledge reasonably well versus average Web users. Resembling the
beginnings of the Big Blue that beats Kasparov: it initially should have been
able to beat not a master but at least a second category chess player (with a
reasonable good ELO standard) and from that the evolutionary path through three
six levels more: first category, master, international master, grand master,
championship.
Content Index
2- About a New
Approach to Internet Communications
3- FIRST, Full
Information Retrieval System Thesaurus
The power of the right statistics
4- i-URL’s and Intelligent
Databases
Advantages of our i-Virtual
Libraries
5- Evolutionary Process - Some Program Analysis Considerations
6- Noosphere
Mechanics – Evolutionary Sequence
7- An Approach
to Website Taxonomy
8- FIRST within the
vast world of AI – IR
New and Old ideas in action now
Clustering: Vivisimo and Teoma
The Web space and the Noosphere[1]
You may find 30,136 pages dealing
with “noosphere” in Altavista at 2.22 PM Eastern Time for
The Web
space growths at a fantastic pace holding today nearly one and half billion of
documents, ranging from Virtual Libraries and virtual reference e-books dealing
with the Major Subjects of the human knowledge through ephemeral news and
trivial virtual flyers generated “on the fly” at any moment continuously. We
may find in the Web documents belonging to any of the three Internet major
resources or categories: Information, Knowledge and Entertainment.

In the
above figure the black crown represents the Web space and the green circle the
users. The gray crown represents an intermediate net to be built in the near
future with intelligent resumes of the Human Knowledge, pointing to the Web
basic documents and e-books. One user is shown extracting a “cone” of what
he/she needs in terms of information and knowledge. The intelligent resumes
must be engineered in order to be good enough as introductory guides/tutorials
with a set of essential hyperlinks inside. If the user wants more detail goes
then directly to the right sources within the black region. Depending of the
Major Subject dealt with the user may go from resume to resume or jumping to
higher level guides inside the gray region going to the black region only to
look for specific themes. Moreover many users will be satisfied browsing within
the gray region without even venturing into the black region.
Another
user goes directly to the black region guided by aid of classical search
engines as now. The black region will be always necessary and its size will
grow fast as time passes by. On the contrary, the gray region will fluctuate
around a medium volume growing at a relatively very low rhythm. Effectively,
the Human Knowledge “kernel” of basic documents is almost bound, changing its
content but always around the same set of Major Subjects. The growth of the
gray region is extremely low in comparison to the black region. Some Major
Subjects die and some others are born along the time but slowly.
For more Web sizing information see our Chapter 8 about The
Vast World of AI-IR
As a
science fiction exercise we invite you to make some calculations resembling
some Isaac Asimov’s stories and Carl Sagan’s speculations. Being the actual
Human Knowledge bound to let’s say 250 Major Subjects or Disciplines and if for
each of them we define a Virtual Library with non redundant 2,000 e-books, in
the average, we will have a volume of 500,000 e-books. Now we could design a
methodology to synthesize an intelligent text resume for each e-book in no more
than 2,000 characters, in the average, totaling 1,000 MB ó 1 GB storing one character in one single byte. That would
be the volume of the gray region!, not too much
really!.
Let’s then
compare this volume to the volume of the black region and to the volume of the
resources of the Human Knowledge. Once upon a time, there were a Web space with
one and a half billion documents with an average volume estimated in 2.5 MB (we
have documents ranging from 10KB and less to 100MB and more: to get that figure
we supposed the following arbitrary size series 1, 10, 100, 1,000, 10,000,
100,000 in KB and we assigned to each term the following arbitrary weights:
.64, .32, .16, .08, .004, .002 respectively). Then we have a volume of nearly
3750, 000,000 MB!. Within that giant space float
disperse the basic e-books, the resources of the Human Knowledge with an
estimated volume of nearly 500,000 MB assigning 1MB to each one, half a million
of text and 100 images of 5KB in the average.
Incredible
result that demonstrates how easy will be able to compile a rather stable HKIS,
Human Knowledge intelligent Summary in relation to the unstable, noisy,
bubbling, fizzy and always growing black region. Once the effort is done the
upgrade will be facilitated via Expert Systems and a set of specialized
Intelligent Agents that will detect and extract from the black region only the
“necessary” changes.
In the figure above we depict the actual Web space in black, resembling the physical space of the Universe. No doubt the information we need as users is up there but where?. That virtual space is really almost black for us. Some members of the Cyberspace that provide searching services titled as Search Engines and/or Web World Wide Directories are like stars that irradiate light all over the space to make sites indirectly visible. Sometimes we may find quite a few sites with their own light, like stars, activated by publicity in conventional media but the rest is only illuminated by those services at users’ request. Let’s go deepen a little about the nature of this singular searching process.
For each
resource (body) located in the Web space in an URL, which stands for Uniform
Resource Locator, robots of those lighting services prepare a brief summary
with some information extracted from it, no more than a paragraph and then all
the information collected goes to their databases. The summaries have attached
to them some keywords extracted from the resources visited and consequently are
indexed in as many keywords as they have attached.
The actual
robots are very “clever” but extremely primitive compared to human beings. They
are doing their best and they have to perform their work fast in fractions of
millisecond per resource as well so it would be unpractical being more
sophisticated because the time of “evaluation” grows exponentially with the
level of cleverness. To facilitate the robots work the Website programmers and
developers have at hand wise tools but many of them overuse those facilities so
badly to make them unwise. In fact with those tools the programmers could
communicate to the robots some essential information the site owners wish to be
known about the site.
Those wise
gateways are now noisy because most people try to deceive the robots
overselling what should be the essential information. Why do they that?. Because the Search Engines must present the sites listed
hierarchically, the first the best!. It occurs
something like in the Classified Section of the newspapers: the people wishing
to be listed first unethically make nonsense use of the first letter of the
alphabet: AAAAAAA Home Services go first that for instance AA Home Services.
The Search Engines do not have too much room to design a “fair” methodology to
rank the sites with equity and Internet is a non-police realm besides
One trivial
criterion should be to count how many times a keyword is cited within the
resource but that proved to be misleading because the robots only browse the
resource partially being practically impossible to differentiate a sound
academic treatise from a student homework concerning the same subject. To make
the things worse, programmers, developers, and content experts know all those
tricks and consequently they make overuse of the keywords they believe are
significant.
The Search
Engines have improved too much along the last two years but the searching
process continues being highly inefficient and tends to collapse. To help site
owners to gain positions within the lists (in fact to get more
light) proliferate ethical and unethical techniques and programs most of
them apt to deceive the “enemy”, namely the Search Engines. Even in a ‘Bona
Fide” utopia it’s impossible for a robot to differentiate between a complex
site and a humble site dealing with the same subject. Complex sites
architectures could even make the sites invisible for them because they are
only well suited to evaluate flat and simple sites. For instance search engines
like Google needs also to break even commercially and start selling pseudo
forms of score enforcing ways to desperate site owners that need traffic to
subsist.
We
emphasize again the fact that the “light” that a Search Engine provides to each
URL is indirect like the Moon reflects the Sun’s light. Then our conclusion is
that most of the information and the knowledge is
hidden in the darkness of the Cyberspace.

Now that we
know the meaning of the HK Human Knowledge we may define HKIS, the Human
Knowledge Intelligent Summaries, a set of summaries that we have to explain
soon why do we title them as intelligent, and NHKIS, for a Network of Human
Knowledge Intelligent Summaries that correspond to the gray crown of the above
figures. Now we are going to enter into the problem of the languages and
jargons spoken in the Black Region, in the Gray Region and mainly in the Green
Region.
The
Websites are built to match users, are like lighthouses in the darkness, to
broadcast information, knowledge and in the case of e-Commerce some kind of
attracting information as “opportunities”. What really happens is that at
present Internet is more the Realm of Mismatch than of Matching. The
lighthouses owners cannot find the users and the users neither cannot find the
alleged opportunities nor understand the broadcasted messages. This mismatching
scenario is dramatic in the case of Portals, huge lighthouses created to
attract as many people as possible via general interest “attractions”.
Something
similar occurs with the databases where are stored millions of units of
supposedly useful information such as catalogs, services, manufacturers,
professionals, jobs opportunities, commercial firms, etc: users could not find
what they need. When we are talking of mismatch we mean figures well over 95% and
in some databases matching efficiencies lower than 0,1%.
In the
figure above we depicted this dramatic mismatch. The yellow point is a Website
with its offer represented by the cone emerging from it, let’ say the Offer
expressed in its language and in its particular jargon. A point black within
the green circle represents a user and the cone emerging out from it his/her
Demand expressed also in his/her language and particular jargon.
What we
discovered is that both sides speak approximately the same language but by sure
different jargons and more than that, they think different!.
We have depicted the gray crown because the portion corresponding to its Major
Subject virtually exists: that’s the portion in dark gray within its cone. They have the “truth” expressed in its
particular jargon, and sometimes the “official” and standard jargon. If the
Website were for instance a “Vertical” of the Chemical Industry, of course its
jargon will then be within the Chemical Industry Standards and its menu should
be expressed technically correct, resembling the Index of a Manual for that
particular Major Subject: Chemical Industry.
So our
conclusion of a research done along two years studying the mismatch causes was
that the lighthouses speak -or intend to speak- official jargons, certified by
the establishment of their particular Major Subjects. They are supposed having
the truth and they think as “teachers”, expressing their truth in their menus
that are in fact “logical trees”. They may allege to be e-books and they
behave, think, and look, pretty much the same as physical books.
Now let’s
analyze how the users act, express and behave. If one user meets the site to
learn, the cones convergence is obliged, the user is forced to think in terms
of concepts of the menu that for him/her resembles a program of study, and we
have a match scenario. If the user meets the site to search something, that’s
different. When one goes to search something one tends to think in keywords
terms instead, keywords that belong to our own jargon and at large to our own
Thesaurus. So, either by ignorance or on the contrary, being an expert, the
users’ cones diverge substantially from the site’s cone. One of the main
reasons of this divergence is that the site owners ignore what their market
targets need. Many of them are migrating from conventional businesses to
e-Commerce approaches and extrapolate their market know-how as it is. They were
working hard along decades to match their markets and to establish agreed
jargons and now they have to face unknown users coming virtually from all over
the world.
Evidently the solution will be the evolution from mismatch to match in the most efficient way. To accomplish that, both the Offer and the Demand, have to approximate each other until both share a win-win scenario and a common jargon.

In the figure above we depict a
mismatch condition where we might distinguish three zones: the red zone
represents the idle and/or useless Knowledge; the gray zone corresponds to the
common section with an agreed Thesaurus concordance; and the blue zone
corresponds to what the users need, want, and apparently does not exist within
the site. So the site owners and administrators have three lines of action: a)
reduce to zero the red zones, for instance adapting and/or eliminating supposed
“attractions”; b) learn as much as possible about the blue zone, and; combine
both strategies.
At this moment the dark green
zones are extremely tiny, less than 5% being Internet the Realm of Mismatch
between Users’ Demand and Sites’ Offer. The big efforts to be done consist in
minimizing costs eliminating useless attractions and learn from non-satisfied
Users’ needs. To accomplish both purposes the site owners need intelligent
tools, agents that warn them about red and blue events.
Let’s analyze the basic process of users-Internet interactions. One user meets one site to interact in one of three forms some times concurrently: investing time, making click over a link or filling a form or a box with some text, for instance to make a query to a database. The site statistic are well prepared to account for clicks, telling what “paths” were browsed by each user but they are not well suited to account for interaction derived from textual interactions. Of course, you may record the queries and even the answers but that’s not enough to learn from mismatching. To accomplish that we may create programs and/or intelligent agents that account for the different uses over the components of each answer, but they have to do then a rather heavy accounting.
If we query a commercial database for tires the answer would be a list of tires stores; and to have statistics about how frequent the users ask for this specific keyword we need to account for it; and to know about the “presence” of each store as a potential seller we need to account for it; and if we want to know about the popularity of each store we need to go farther, accounting for it and so forth. That accounting process involves a terrific burden even done in the site server’s side.
An intelligent approach should be to have all possible counters to detect documents popularity and users’ behavior, built in into the data to be queried. That’s the beginning of the idea: to provide a set of counters within the data to be queried by users for each type of statistic. So when a data is requested a counter is activated accounting for the presence, and when it is selected by a click another counter is activated and when the user by reading the “intelligent summary” received decide to make a click over the original site or over one of its inner hyperlinks, another counter is activated.

Here is represented a typical track of user-site interaction. The user makes a query for “tires”. The i-Intelligent Database answers sending all data it has indexed by tire adding a list of synonyms and related keywords it has for tire. Each activated i-URL accounts its presence in that answer adding one to the corresponding counter in the i-Tags zone. If the user makes click on a specific i-URL the system presents it to the user accounting for this preference in another counter of the i-Tags zone.
Finally if the user decides to access the commented site located in the black crown makes a click and another counter is activated within the i-Tags zone. At the same time the counter corresponding to the keyword tire is activated adding one and the same if the user activates some synonym or related keyword. If the answer is zero data it means a mismatch because an error or a warning about a non-existent resource within the database. In both cases the system has to activate different counters for the wrong or non-existing keyword in order to account for the popularity of this specific mismatch. If the popularity is high it is a warning signal to the site Chief Editor (either human or virtual) about the potential acceptance of the keyword, either as a synonym or a related keyword. At the same time, the system may urge to look for additional data within the black region. From time to time the systems could suggest the rehearsal of the i-URL’s summaries database in order to assign data to the new keywords as well. We will see how to work with a network of these Expert Systems at different stadium of evolution.
Within the intelligent feature we consider to register the IP of the users interactions and the sequence of queries, normally related to something not found. The keywords users’ strings are in their turn related to specific subjects within the Major Subject of the site. So, statistically, the keywords strings analysis tells us about the popularity of the actual menu items and suggests new items to be considered.
Let’s try
to search for something apparently trivial like “Internet statistics”, for
instance using one of the best search engines, Google: More than 1,500,000
sites!. Do not dip too much along that list, only check what the first 20 or 30 sites offers. Most
of the content shown by the sites of that sample is obsolete and when updated
you are harassed by myriad of sales offers about particularly statistics,
market research studies and similar, priced on the thousands up. And if this
scenario occurs with supposed authorities: Library of Congress, Cyberatlas,
About.com statistics sites, Internet Index, Data Quest, InternetStats, what
then with the 1,500.000 resting?.
What if
that noisy cluster be replaced by a brief comment made by a statistician,
telling the state of the art about Internet Statistics and suggesting
alternatives ways to compile statistics from free updated authorities that by
sure exist in the Web?. That’s is very easy to do and
economic either, it should take no more than one hour of that specialist. Of
course that would be feasible as a permanent solution if the cost of updating
that kind of reports were relatively insignificant. Concerning this problem we
estimated that the global cost for updating a given HKM is of the order of 3%
to 5% per annum the cost of its creation. So the HKM’s will be updated by two
ways: evolutionary by evolution through their interaction with users and
authoritative by human experts updates.
Let’s see another examples with “sex” and “games”. Sex has more than
48,000,000 sites and is well known that the sources of sexual and pornographic
content are fewer than 100. The rest are speculators, repetitions, transfers,
and commuting sites of only one click per user playing the ingenuous role of
useful idiots. Something similar occurs with games with more than 35, 000,000
sites and again the world providers of games machines, solutions, and software
are no more than 100!.
For a given
culture and for a given moment we have the following regions in the Web space
:

Red: a given HKM
Black Blue: HK Virtual Library
Regular Navy Blue: Ideal HK
Blue: Ideal HK plus New Research
Light Blue: Ideal HK plus NR plus Knowledge Movements
Deep Light Blue: Ideal HK plus NR plus KM plus Information
Everything
is working within an expanding universe of Human Intellectual Activity. It
takes too much time and effort for new ideas and concepts to form part of the
Ideal HK. We as human have two kinds of memory, semantic and episodic, and any
cultures in a given moment have its semantic memory, conscious and unconscious,
intuitive and rational as well as its episodic memory.
Along the
human history the dominant cultures have controlled the inflow of the Human
Intellectual Activity in explicit and implicit ways, for instance discouraging
the dissension. Internet allows us as users to dissent with any form of
“established” HK and to influence on an equality basis the allegedly ideal HK.
This feature will accelerate in an unprecedented way the enrichment of the
ideal HK. For that reason we emphasize in FIRST the mismatch between the HKM
and users thoughts, questions and expectations, oriented to satisfy users, that
is the human being as a whole and as a unit.
We make
specific reference to Internet Data Management because the “Big Net” differs
substantially from most nets. Internet deals with all possible groups of people
and all possible groups of interest. Internet users belong to all possible
markets from kids to old people in all possible economic, social and political
levels and cultures. This Universality makes the Internet man-machine
interactions extremely varied.
On the
contrary, in any other network we may define a “jargon”, ethic and rules. When
we build a new Internet Website we really ignore what will
our potential users be, and consequently what they want, what they need
and even we ignore their jargons. We imagine a target market and for that
specific market we design the site content, in fact, the “Information Offer” to
that market.

The figure
above depicts the matchmaking process within the Internet “noosphere”. The
users in green express what they want and even think in terms of “keywords”,
expressed in their own jargon, are open and flexible. On the contrary, the
Website owners through their sites believe they have the truth, only the truth
but the truth. In that sense being or not an authority they resemble “The Law”
of the establishment of the Human Knowledge. The law, for each Major Subject is
expressed in Indexes of the main branches of that Major Subject, resembling a
“Logical Tree”, depicted in gray over the yellow truth. They imagine their
sites as universal facilitators but always following the pattern of the logical
tree and expressed in their jargons.
The Websites
have their own Thesaurus, set of “official” keywords, depicted in white over
black background, within the darkness of the Web space. Between the logical tree
and the Thesaurus exists a correspondence. The Website owners are shown with
the Truth Staff in yellow. The users-Internet interactions are depicted as a
progressive matchmaking process, going from green to black and vice versa,
learning one from the other match-mismatch. Both sides strive for knowing
interchanging knowledge
Paradoxically,
even being the Web so well suited to add, to generate and to manage
intelligence most people ignore this fantastic possibility. If we define our
Information Offer as WOO, which stands for What Owners Offer and what
the users want by WUW, which stands for What Users Want, the Web
Architecture permits the continuous match between them and as a byproduct the
intelligence emerging from any mismatch.
That
possibility means the following: WUW is what users want expressed in their
specific jargon/s, meanwhile WOO is the Website
information offer expressed in let’s say the “official/legal” jargon, the one
we choose to communicate with our target market. The continuous mismatch
between WOO versus WUW would permits us to know the following five crucial
things:
·
What the Market wants
·
The Market major
characteristics
·
The Market homogeneity
and/or its segmentation
·
The Market jargon/s
·
The Market needs.
The
knowledge of the market jargon/s permit us to optimize our offer: for instance,
a negative answer to an user query could mean either
that we don’t have what he/she wants or that the name of what he/she is looking
for in his/her jargon differs in our jargon.
What we
know directly from users queries is what they want, not what they need. The
difference between WUW and WUN, What Users Need is substantial.
People generally know what they need but adjust their needs to the supposed or
alleged Website capabilities. We learn what our users need as time passes by if
we make use of the intelligence byproducts and/or from surveys.
The IO is
normally presented as ordered sets under the form of Catalogs, Indexed Lists
and Indexes but the queries, where the users express their particulars needs
WUN are expressed by keywords. Both communication systems are completely
different even though could be complemented and we could make them work
together towards the ideal match between WUN versus WOO.
As we see soon the users communicate with the different Websites via their subjective jargons, at least as many jargons as MS, “major subjects” they are interested in. For instance, if I’m an entrepreneur that manufactures sport car wheels I’m going to query B2B sites to look for subjects related to the sport car wheels expressing myself in “my” jargon, with differences with the “official” jargons used in the B2B sites and of course, the query outcomes will strongly depend of the jargons differences.
In a
similar way as the official languages change from time to time, influenced at
large by the pressures of the people jargons, coexisting both at any time, we
may endow an extremely efficient evolutionary feature to the Websites of the
Cyberspace via Expert Systems, that learn from the
man-Internet interactions. We dare to qualify this feature as extremely
efficient because in the Cyberspace every transaction could be easily and
precisely accounted for. So, each time one user uses a keyword belonging to
his/her jargon this event could and should be accounted for.
Let’s then
imagine what kind of intelligent byproduct could we extract of this simple but
astonishing feature. Within a homogeneous market the keywords tend to be the
same among their members. So in our lat example, if the majority of users make
queries asking for wheels and the word-product wheel does not exist in our
database a trivial byproduct takes the form of the following suggestion: add
wheels to the database as soon as possible. On the other hand if the
word-product “ergaston” was never asked for along a considerable amount of
time, another trivial message should be: take ergaston out from the database.

The figure above depicts the evolution of the matchmaking process. In the beginning, the Website owners had the oval green-gray target, where one user is shown with a black dot. But that user really belongs to a users affinity market depicted as a dark green oval with a cone of Internet interest that differs too much from the ideal initial target. The Website owners need an intelligent process to shift towards the bigger potential market dark green. With a cone border yellow we depict the final “stable” matchmaking.
3- FIRST, Full Information
Retrieval System Thesaurus
The Cyberspace actually has about 1,500 million documents ranging from reference to trivial, from truly e-books dealing with the major subjects of the human knowledge to daily news and even with minute to minute human interactions information as in the case of Newsgroups, Chat and Forum “on the fly” pages generation. This information mass grows continuously at an exponential rate, rather chaotically, as its production rate is being by far exceeded by the human capacity for filtering, qualifying and classifying it.
To help the
retrieval of information from the Cyberspace we make use of Search Engines and Directories
that are unable to attain WUN, What Users (We the Humans) Need. From all
that information mass the search engines offer to us “summaries”, telling what
kind of information could we get in each location of
the Cyberspace (the URL, Uniform Resource Locator). So for each URL we as users
obtain its summary. Those summaries are normally written by the Search Engines
robots, which try to do their best extracting pieces of “intelligence” from
each Cyberspace location.

In the figure we depict some sites within the darkness of the Cyberspace. We may find from huge sites storing millions of documents and with hundreds of sections through tiny sites with a flat design storing a few pages. One Search Engine shown as a yellow crown sends its robots to visit existing sites from time to time making a brief “robotic” summary of them. As we will see soon those brief reports are noisy, deceiving the users (green circle). The Search Engine assigns priorities, which act in turn as a measure of the site magnitude (as the brightness of a star). As it’s depicted, the priorities (the navy blue dots) have nothing to do with the real magnitude of the site (depicted as the white circle diameter). So the yellow crown is a severe distortion of the Web. These priorities defined for the keywords set of a given site resemble the “light” that illuminates it: a high priority means a powerful beam of light reflecting over the site highlighting it to the users sight.
The actual
information provided for the search engines are as primitive as the map of the
sky we had one thousand years ago. The robots only detect some keywords the
site content have, equivalent to the chemical elements of the celestial bodies,
but tell us nothing about its structure, type of body and magnitude. Today we
may have for each celestial body the following data:
Among many others, diameter, density, its constitutive
elements spectral distribution, brightness, radiation, and albedo. For each of these variables we have site equivalents that
must be known in order to say that we have a comprehensible Cyberspace map. For
instance we need to know something that resembles magnitude, density and
elements distribution and brightness.
Being the
bodies of this cultural and intellectual space (noosphere), intellectual
creatures, we need an intellectual summary of it, what is known as the abstract
in essays and research papers. For
instance a site could be camouflaged to appear attractive emphasizing the
importance of a given element, let’s say “climate”, to deceive a robot as being
a specialized climate site but in reality having nothing about climate content.
The same happens with information: Portal’ news, for instance, are presented as
content sites, being that true only concerning a specific type of information
resource known as “news”, of an extremely ephemeral life of hours. On the
contrary, content of philosophy or mathematics are by far denser, heavier, with
lives lasting centuries in the average. So we could distinguish all kind of
bodies from fizzy (news) to rocky (academic).
Another
complementary source of information are the Databases
hosted as collateral of the Websites, as huge stores of organized and
structured data. The content and quality of these databases are normally a
subjective “bona fide” declaration made by the Website owners. So far for the
users, the Cyberspace, particularly the Web Cyberspace, looks like a net of
information resources with some “Indexes” to facilitate their retrieval task.
Those robots made indexes are too noisy being practically useless. Below we
attach a well-known graphic sample of this uselessness

The figure depicts
the finding of useful information (black spots) navigating along a searching
program
The main
reasons are, among many:
Increasing Websites Complexity: Robots could not cope with the Website increasing
complexity. Robots are unable to evaluate properly sites like the ones
belonging to the NASA, World Trade Organization, and the Library of Congress,
only to mention some institutional, concerning Aerospace, Commerce, and General
Knowledge respectively, and cannot differentiate them from trivial sites
dealing with similar subjects.
Inability to cope with Human Stratagems: Robots are unable to detect and to block some subtle
overselling stratagems made by the Website owners to position themselves high
in the Search Engines answers to users queries.
Linguistic Problems:
Robots could not cope with the increasing number and complexity of the
languages and jargons used in the net. They make their work using rather naïve
Thesaurus, only modified and enriched via the Website owners’ declarations, not
as it should be via the users feedback. As a
consequence of that bias the Search Engines speak the owners
jargons instead of the users jargons.
In brief,
the shadows of content that search engines offer to the users have almost
nothing to do with the real content of the Cyberspace, presenting a distorted
vision of it. The problem is the contagious spread of this distortion as long
as the Website owners use that summary information as a “bona fide” vision of
its world. As a corollary, Internet speaks today the Website owners jargons
pointing to a global distorted visions of the real markets!.
The
mismatch measure between WUN, What Users Need and WSO, What
Search-Engines Offer, should be one of the first priorities of scientific
institutions interested in the Internet health. However, almost everybody is
well acquainted of this abysmal mismatch and you may check it by yourself very
easily making random queries about any subject. We, as a private research
group, made our own investigations about that global mismatch finding the
following figures:
Meaning that we, as ordinary users, searching through the
Cyberspace with the help of outstanding search engines, in the average, have to
browse through 6,000 summaries to find 1 potentially matching our needs.
Searching
information stored in Databases proved to be a tough task as well. Students of
Systems Engineering coursing the last year of their career in the Instituto
Tecnológico de Monterrey, Mexico, were invited to freely query a commercial
tested (2) Database being the mismatch greater than 99,9%, that is,
they needed in the average more than 100 queries to match a product/service
stored within the database. The main reason of the mismatch was not due to
missing information in the database but to linguistic problems. That was a
warning sign and we investigated some other commercial databases belonging to
well-known B2B sites with similar results.
Note 2: By “tested” we mean
that the content was checked before the trial. The information existed but the
students were unable to find what they were searching because linguistic
problems.
The abysmal and chaotic mismatch enable forms of e-Commerce delinquency: When you as a user