The Future of Cyberspace
The Webspace and the Noosphere
By Juan Chamero, jach@aunmas.com
Note: The Noosphere is the part of the world of life that is
created by man's thought and culture. Pierre Teillhard De Chardin, Vladimir
Ivanovich Verdansky and Edouard Le Roy distinguish the noosphere from the
geosphere, the non-living world, and from the biosphere, the living world.
Introduction
You may find 30,136 pages dealing with “noosphere” in
Altavista at 2.22 PM Eastern Time for USA and Canada on Thursday 12th
of April 2001. This is a rather strange word for many people that did not
deserve an entry in the Merriam Webster online dictionary yet. However we know,
use and enjoy the Cyberspace, concept that at nearly the same time deserves as
many as 777,290 entries in the same Altavista, but on the contrary it has an
entry in Merriam Webster since 1986, with the following meaning:
the on-line world of computer networks. Webspace is another
neologism not yet included in that dictionary but deserves 485,805 entries in
Altavista.
The Webspace growths at a fantastic
pace holding today nearly half a billion of documents, ranging from Virtual
Libraries and virtual reference e-books dealing with the Major Subjects of the
human knowledge through ephemeral news and trivial virtual flyers generated “on
the fly” at any moment continuously. We may find in the Web documents belonging
to any of the three Internet major resources or categories: Information,
Knowledge and Entertainment.
The Webspace Regions

In the above figure the black crown
represents the Webspace and the green circle the users. The gray crown
represents an intermediate net to be built in the near future with intelligent
resumes of the Human Knowledge, pointing to the basic documents and e-books of
it. One user is shown extracting a “cone” of what he/she needs in terms of
information and knowledge. The intelligent resumes must be engineered in order
to be good enough as introductory guides/tutorials with a set of essential
hyperlinks inside. If the user wants more detail goes then directly to the
right sources within the black region. Depending of the Major Subject dealt
with the user may go from resume to resume or jumping to higher level guides
inside the gray region going to the black region only to look for specific
themes.
Another user goes directly to the
black region guided by aid of classical search engines as now. The black region
will be always necessary and will grow fast in volume as time passes by. On the
contrary the gray region will fluctuate around a medium volume growing at a
relatively very low rhythm. Effectively, the Human Knowledge is almost bound,
changing its content but always around the same set of Major Subjects. The
growth of the gray region is extremely low in comparison to the black region.
Some Major Subjects die and some others are born but slowly.
Region Volumes Estimations
As a science fiction exercise we
invite you to make some calculations resembling some Isaac Asimov stories.
Being the actual Human Knowledge bound to let’s say 250 Major Subjects or
Disciplines and if for each of them we define a Virtual Library with non
redundant 2,000 e-books, in the average, we will have a volume of 500,000
e-books.
Now we could design a methodology to synthesize an intelligent resume
for each e-book in no more than 2,000 characters, in the average, totaling
1,000 MB è1 GB, storing one character in one single byte. That would be the volume of the gray region!, not too much really!.
Let’s then compare this volume to the volume of the black region and to the volume of the resources of the Human Knowledge. Once upon a time, there were a Webspace with half a billion documents with an average volume estimated in 2.5 MB (we have documents ranging
from 10KB and less to 100MB and more: to get that figure we supposed the
following arbitrary size series 1, 10, 100, 1,000, 10,000, 100,000 in KB and we
assigned to each term the following arbitrary weights: .64, .32, .16, .08,
.004, .002 respectively). Then we have a volume of nearly 1250, 000,000 MB!.
Within that giant space float disperse the basic e-books, the resources of the
Human Knowledge with an estimated volume of nearly 500,000 MB assigning 1MB to
each one, half a million of text and 100 images of 5KB in the average.
Black Region: ~1,250,000 GB è HK
~ 500 GB è Grey Region ~ 1 GB
Incredible result that demonstrates
how easy will be able to compile a rather stable HKIS, Human Knowledge
intelligent Summary in relation to the unstable, noisy, bubbling, fizzy and
always growing black region. Once the effort is done the upgrade will be
facilitated via Expert Systems that will take from the black region out only
the changes.
The Webspace looks like the Sky at night
In the figure above we depict the
actual Webspace in black, resembling the physical space of the Universe. No
doubt the information we need as users is up there but where?. That virtual space
is really almost black for us. Some members of the Cynerspace that provide
searching services titled as Search Engines and/or Web World Wide Directories
are like stars that irradiate light all over the space to make sites indirectly
visible. Sometimes we may find quite a few sites with their own light, like
stars, activated by publicity in conventional media but the rest is only
illuminated by those services at request. Let’s go deepen a little about the
nature of this singular Webspace searching process.
For each resource (body) located in
the space in an URL, which stands for Uniform Resource Locator, robots of those
lighting services prepare a brief summary with some information extracted from
it, no more than a paragraph and then all the information collected goes to the
services’ databases. The summaries have attached to them some keywords
extracted from the resources visited and consequently are indexed in as many
keywords as they have attached.
How the Search Engines illuminate the Resources
The actual robots are very “clever”
but extremely primitive compared to human beings. They are doing their best and
they have to perform their work fast in fractions of millisecond per resource
as well so it would be unpractical being more sophisticated because the time of
“evaluation” grows exponentially with the level of cleverness. To facilitate
the robots work the Website programmers and developers have at hand wise tools
but many of them overuse those facilities so badly to make them unwise. In fact
with those tools the programmers could communicate to the robots some essential
information the site owners wish to be known about the site.
Those wise gateways are now noisy
because most people try to deceive the robots overselling what should be the
essential information. Why do they that?. Because the Search Engines must
present the sites listed hierarchically, the first the best!. It occurs
something like in the Classified Section of the newspapers: the people wishing
to be listed first unethically make nonsense use of the first letter of the
alphabet: AAAAAAA Home Services go first that for instance AA Home Services.
The Search Engines do not have too much room to design a “fair” methodology to
rank the sites with equity.
One trivial criterion should be to
count how many times a keyword is cited within the resource but that proved to
be misleading because the robots only browse the resource partially being
practically impossible to differentiate a sound academic treatise from a
student homework concerning the same subject. To make the things worse,
programmers, developers, and content experts know all those tricks and
consequently they make overuse of the keywords they believe are significant.
The Search Engines have improved
too much along the last two years but the searching process continues being
highly inefficient and tends to collapse. To help site owners to gain positions
within the lists (in fact to get more light) proliferate ethical and unethical
techniques and programs most of them apt to deceive the “enemy”, namely the
Search Engines. Even in a ‘Bona Fide” utopia it’s impossible for a robot to
differentiate between a complex site and a humble site dealing with the same
subject. Complex sites architectures could even make the sites invisible for
them because they are only well suited to evaluate flat and simple sites.
We emphasize again the fact that
the “light” that a Search Engine provides to each URL is indirect like the Moon
reflects the Sun’s light. Then our conclusion is that most of the information
and the knowledge is hidden in the darkness of the Cyberspace.
The Cyberspace as a Global Market
The matchmaking Realm

Now that we know the meaning of the
HK Human Knowledge we may define HKIS, the Human Knowledge Intelligent Summaries,
a set of summaries that we have to explain soon why do we title them as
intelligent, and NHKIS, for a Network of Human Knowledge Intelligent Summaries
that correspond to the gray crown of the above figures. Now we are going to
enter into the problem of the languages and jargons spoken in the Black Region,
in the Gray Region and mainly in the Green Region.
Websites are built to match users
Internet the Realm of Mismatch
The Website are built to match
users, are like lighthouses in the darkness, to broadcast information,
knowledge and in the case of e-Commerce some kind of information we could title
as “opportunities”. What really happens is that at present Internet is more the
Realm of Mismatch than of Matching. The lighthouses owners cannot find the
users and the users neither cannot find the alleged opportunities nor
understand the messages. This mismatching scenario is dramatic in the case of
Portals, huge lighthouses created to attract as many people as possible via
general interest “attractions”.
Something similar occurs with the
databases where are stored millions of units of supposedly useful information
such as catalogs, services, manufacturers, professionals, jobs opportunities,
commercial firms, etc: users could not find what they need. When we are talking
of mismatch we mean figures well over 95% and in some databases less than 0,1%.
In the figure above we depicted
this dramatic mismatch. The yellow point is a Website with its offer
represented by the cone emerging from it, let’ say the Offer expressed in its
language and in its particular jargon. A point black within the green circle
represents a user and the cone emerging out from it his/her Demand expressed
also in his/her language and particular jargon.
Mismatch reasons
Websites and user speak and think different
What we discovered is that both
sides speak approximately the same language but by sure different jargons and
more than that, they think different!. We have depicted the gray crown because
the portion corresponding to its Major Subject virtually exists: that’s the
portion in dark gray within its cone. They have the “truth” expressed in its particular jargon, and sometimes
the “official” and standard jargon. If the Website were for instance a
“Vertical” of the Chemical Industry, of course its jargon will then be within
the Chemical Industry Standards and its menu should be expressed technically
correct, resembling the Index of a Manual for that particular Major Subject:
Chemical Industry.
So our conclusion of a research done
along two years studying the mismatch causes was that the lighthouses speak -or
intend to- official jargons, certified by the establishment of their particular
Major Subjects. They are supposed having the truth and they think as
“teachers”, expressing their truth in their menus that are in fact “logical
trees”. They may allege to be e-books and they behave, think, and look, pretty
much the same as physical books.
Now let’s analyze how the users
act, express and behave. If one user meets the site to learn, the cones
convergence is obliged, the user thinks in terms of concepts of the menu that
for him/her resembles a program of study, and we have a match scenario. If the
user meets the site to search something, that’s different. When one goes to
search something one tends to think in keywords terms instead, keywords that
belong to our own jargon and at large in our own Thesaurus.
So, either by
ignorance or on the contrary, being an expert, the users’ cones diverge
substantially from the site’s cone. One of the main reasons of this divergence
is that the site owners ignore what their market target needs. Many of them are
migrating from conventional businesses to e-Commerce approaches and extrapolate
their market know-how as it is. They were working hard along decades to match
their markets and to establish agreed jargons and now they have to face unknown
users coming virtually from all over the world.
The solution
Evidently the solution will be the
evolution from mismatch to match in the most efficient way. To accomplish that,
both the Offer and the Demand, have to approximate each other until both share
a win-win scenario and a common jargon.

In the figure above we depict a mismatch condition where
we might distinguish three zones: the red zone represents the idle and or
useless Knowledge; the gray zone corresponds to the common section with an
agreed Thesaurus concordance; and the blue zone corresponds to what the users
needs, wants, and apparently does not exist within the site. So the site owners
and administrators have two lines of action: a) reduce to zero the red zones,
for instance adapting and/or eliminating supposed “attractions” and b) learn as
much as possible about the blue zone.
At this moment the dark green zones are extremely tiny,
less than 5% being Internet the Realm of Mismatch between Users’ Demand and
Sites’ Offer. The big effort to be done consist in minimizing costs eliminating
useless attractions and learn from non-satisfied Users’ needs. To accomplish
both purposes the site owners need intelligent tools, agents that warn them
about red and blue events.
What’s
does Intelligent mean
Let’s analyze the basic process of users-Internet
interactions. One user meets one site to interact in only two forms: making
click over a link or filling a form or a box with some text, for instance to
make a query to a database. The site statistic are well prepared to account for
clicks, telling what “paths” were browsed by each user but they are not well
suited to account for interaction derived from textual interactions. Of course,
you may record the queries and even the answers but that’s not enough to learn
from mismatching. To accomplish that we may create intelligent agents that
account for the components of each answer, for instance documents, but they
have to do then a rather heavy accounting.
If we query a commercial database for tires the answer
would be a list of tires stores; and to have statistics about how frequent the
users ask for this specific keyword we need to account for it; and to know
about the “presence” of each store as a potential seller we need to account for
it; and if we want to know about the popularity of each store we need to go
farther, accounting for it and so forth. That accounting process involves a
terrific burden even done in the site servers side.
An intelligent approach should be to have all possible
counters built into the data to be queried. That’s the beginning of the idea:
to provide a set of counters within the data to be queried by users for each
type of statistic. So when a data is requested a counter is activated
accounting for the presence, and when it
is selected by a click another counter
is activated and when the user by reading the “intelligent summary” received
decide to make a click over the original
site or over one of its inner hyperlinks, another counter is activated.
Here is represented a typical track of user-Site
interaction. The user makes a query for “tires”. The I-database Intelligent
Database answers sending all data it has indexed by tire adding a list of
synonyms and related keywords it has for tire. Each activated I-URL accounts
its presence in that answer adding one to the corresponding counter in the
I-Tags zone. If the user makes click on a specific I-URL the system presents it
to the user accounting for this preference in another counter of the I-Tags
zone.
Finally if the user decides to access the commented site
located in the black crown makes a click and another counter is activated
within the I-Tags zone. At the same time the counter corresponding to the
keyword tire is activated adding one and the same if the user activates some
synonym or related keyword. If the answer is zero data it means a mismatch
because an error or a warning about a non-existent resource within the
database.
In both cases the system has to activate different counters for the
wrong or non-existing keyword in order to account for the popularity of this
specific mismatch. If the popularity is high it is a warning signal to the site
Chief Editor about the potential acceptance of the keyword, either as a synonym
or a related keyword. At the same time, the system may urge to look for
additional data within the black region. From time to time the systems could
suggest the rehearsal of the I-URL’s summaries database in order to assign data
to the new keywords as well.
Within the intelligent feature we consider to register the
IP of the users interactions and the sequence of queries, normally related to
something not found. The keywords users’ strings are in their turn related to
specific subjects within the Major Subject of the site. So, statistically, the
keywords strings analysis tells us about the popularity of the actual menu
items and suggests new items to be considered.