FIRST, Full Information Retrieval System Thesaurus Methodology
Juan Chamero, from Intelligent Agents Internet Corp, Miami USA, August 2001



Abstract


FIRST, Full Information Retrieval System Thesaurus is a methodology to create evolutionary HKM’s, Maps of the Human Knowledge hosted in the Web. FIRST point towards an acceptable   “kernel” of the HK estimated in nearly 500,000 basic documents selected from a exponential growing universe doubling the size yearly and actually having nearly 1400 million sites. There are many laudable and enormous scientific efforts made along the idea of building an accurate taxonomy of the Web and trying to define precisely that kernel. At the moment the only tools we as users have to locate knowledge in the Web are the search engines and directories that deliver answers lists ranging from hundreds to millions of documents being the supposed “authorities” hidden in a rather chaotic distribution within those lists. That means exhausting searching processes with thousands of “clicks” in order to locate something valuable, let’s say an authority.

FIRST create evolutionary search engines that deliver reasonable good answers with only one click from the beginning. We talk of reasonability as a synonym of mediocrity because the first kernel is only a mediocre solution henceforth to be optimized via its interactions with users. FIRST could be considered also an Expert System able to learn mainly from those interactions mismatching. So initially FIRST generated kernels could be considered mediocre one click solutions, for a given culture and for a given language but able to learn converging to a consensual kernel. To accomplish that the only that FIRST kernels need are interactions with users. As long as users represent the whole the more the kernel will tend to represent the knowledge of that whole. For that reason, we imagine a network of HKM’s implemented via our FIRST or some others equivalent evolutionary tools. As each node of this semantic network will serve a given population (or market) we could easily implement something like a DIAN, Distributed Intelligent Agents Network to coordinate the efforts made by each local staff de Intelligent Agents (coopbots). Each node will have a kernel in a different stage of evolution depending of its age, measured in interactivity, and of its population profile.

The main differentiation of FIRST from most present knowledge classification and representation projects rests on the hybrid procedure of building the mediocre starting solution: a staff of human experts aided by IA’s and IR algorithms. The reason of this approach is the actual “state of the art” of Artificial Intelligence, AI. The best actual robots are unable to accurately detect general authorities and are easy to be disguised, unfortunately by millions of document owners that either unethically or by ignorance try to present their sites as authorities. Another flaw is the primitiveness of even the most advanced robots unable to edit comprehensible synthesis of sites. Otherwise the human being is extremely good for those tasks, by far more accurate and more efficient.

The map itself consists of I-URL’s, Intelligent URL’s, brief documents from half to two pages, describing the sites referenced like pieces of tutorials, classified along a set of taxonomy variables and tagged with a set Intelligent Tags, some of them to manage and to track their evolutionary process. For each Major Subject of the HK, a Tutorial, a Thesaurus, a Semantic Network and a Logical Tree are provided and bound to the virtual evolutionary process of the users playing a sort of “knowledge game” versus the kernel.

FIRST is presented here within a context of the IR-AI   “state of the art”. The methodology has been tested to build a HKM in 120 days. Time is a very important engineering factor due to the explosive expansion of the Web and because its inherent high volatility. The task performed by the human experts staff is similar to the task of providing, to a Knowledge Expert System, the basic knowledge to “play” a Game of Knowledge reasonably well versus average Web users. Resembling the beginnings of the Big Blue that beats Kasparov: it initially should have been able to beat not a master but at least a second category chess player (with a reasonable good ELO standard) and from that the evolutionary path through three six levels more: first category, master, international master, grand master, championship.


Content Index

1- The Future of Cyberspace – The Noosphere



2- About a New Approach to Internet Communications


3- FIRST, Full Information Retrieval System Thesaurus



4- i-URL’s and Intelligent Databases

5- Evolutionary Process - Some Program Analysis Considerations

6- Noosphere Mechanics – Evolutionary Sequence


7- An Approach to Website Taxonomy

8- FIRST within the vast world of AI – IR
1- The Future of Cyberspace
The Web space and the Noosphere

Introduction

You may find 30,136 pages dealing with “noosphere” in Altavista at 2.22 PM Eastern Time for USA and Canada on Thursday 12th of April 2001. This is a rather strange word for many people that did not deserve an entry in the Merriam Webster online dictionary yet. However we know, use and enjoy the Cyberspace, concept that at nearly the same time deserves as many as 777,290 entries in the same Altavista, but on the contrary it has an entry in Merriam Webster since 1986, with the following meaning: the on-line world of computer networks. Web space is another neologism not yet included in that dictionary but deserves 485,805 entries in Altavista.

The Web space growths at a fantastic pace holding today nearly one and half billion of documents, ranging from Virtual Libraries and virtual reference e-books dealing with the Major Subjects of the human knowledge through ephemeral news and trivial virtual flyers generated “on the fly” at any moment continuously. We may find in the Web documents belonging to any of the three Internet major resources or categories: Information, Knowledge and Entertainment.


The Web space Regions


In the above figure the black crown represents the Web space and the green circle the users. The gray crown represents an intermediate net to be built in the near future with intelligent resumes of the Human Knowledge, pointing to the Web basic documents and e-books. One user is shown extracting a “cone” of what he/she needs in terms of information and knowledge. The intelligent resumes must be engineered in order to be good enough as introductory guides/tutorials with a set of essential hyperlinks inside. If the user wants more detail goes then directly to the right sources within the black region. Depending of the Major Subject dealt with the user may go from resume to resume or jumping to higher level guides inside the gray region going to the black region only to look for specific themes. Moreover many users will be satisfied browsing within the gray region without even venturing into the black region.


Another user goes directly to the black region guided by aid of classical search engines as now. The black region will be always necessary and its size will grow fast as time passes by. On the contrary, the gray region will fluctuate around a medium volume growing at a relatively very low rhythm. Effectively, the Human Knowledge “kernel” of basic documents is almost bound, changing its content but always around the same set of Major Subjects. The growth of the gray region is extremely low in comparison to the black region. Some Major Subjects die and some others are born along the time but slowly.


Region Volumes Estimations
For more Web sizing information see Chapter 8 about The Vast World of AI-IR

As a science fiction exercise we invite you to make some calculations resembling some Isaac Asimov’s stories and Carl Sagan’s speculations. Being the actual Human Knowledge bound to let’s say 250 Major Subjects or Disciplines and if for each of them we define a Virtual Library with non redundant 2,000 e-books, in the average, we will have a volume of 500,000 e-books. Now we could design a methodology to synthesize an intelligent text resume for each e-book in no more than 2,000 characters, in the average, totaling 1,000 MB ó 1 GB storing one character in one single byte. That would be the volume of the gray region!, not too much really!.

Let’s then compare this volume to the volume of the black region and to the volume of the resources of the Human Knowledge. Once upon a time, there were a Web space with one and a half billion documents with an average volume estimated in 2.5 MB (we have documents ranging from 10KB and less to 100MB and more: to get that figure we supposed the following arbitrary size series 1, 10, 100, 1,000, 10,000, 100,000 in KB and we assigned to each term the following arbitrary weights: .64, .32, .16, .08, .004, .002 respectively). Then we have a volume of nearly 3750, 000,000 MB!. Within that giant space float disperse the basic e-books, the resources of the Human Knowledge with an estimated volume of nearly 500,000 MB assigning 1MB to each one, half a million of text and 100 images of 5KB in the average.

Black Region: ~3,750,000 GB => HK ~ 500 GB => Grey Region ~ 1 GB

Incredible result that demonstrates how easy will be able to compile a rather stable HKIS, Human Knowledge intelligent Summary in relation to the unstable, noisy, bubbling, fizzy and always growing black region. Once the effort is done the upgrade will be facilitated via Expert Systems and a set of specialized Intelligent Agents that will detect and extract from the black region only the “necessary” changes.


The Web space looks like the Sky at night

In the figure above we depict the actual Web space in black, resembling the physical space of the Universe. No doubt the information we need as users is up there but where?. That virtual space is really almost black for us. Some members of the Cyberspace that provide searching services titled as Search Engines and/or Web World Wide Directories are like stars that irradiate light all over the space to make sites indirectly visible. Sometimes we may find quite a few sites with their own light, like stars, activated by publicity in conventional media but the rest is only illuminated by those services at users’ request. Let’s go deepen a little about the nature of this singular searching process.

For each resource (body) located in the Web space in an URL, which stands for Uniform Resource Locator, robots of those lighting services prepare a brief summary with some information extracted from it, no more than a paragraph and then all the information collected goes to their databases. The summaries have attached to them some keywords extracted from the resources visited and consequently are indexed in as many keywords as they have attached.


How the Search Engines illuminate the Resources

The actual robots are very “clever” but extremely primitive compared to human beings. They are doing their best and they have to perform their work fast in fractions of millisecond per resource as well so it would be unpractical being more sophisticated because the time of “evaluation” grows exponentially with the level of cleverness. To facilitate the robots work the Website programmers and developers have at hand wise tools but many of them overuse those facilities so badly to make them unwise. In fact with those tools the programmers could communicate to the robots some essential information the site owners wish to be known about the site.

Those wise gateways are now noisy because most people try to deceive the robots overselling what should be the essential information. Why do they that?. Because the Search Engines must present the sites listed hierarchically, the first the best!. It occurs something like in the Classified Section of the newspapers: the people wishing to be listed first unethically make nonsense use of the first letter of the alphabet: AAAAAAA Home Services go first that for instance AA Home Services. The Search Engines do not have too much room to design a “fair” methodology to rank the sites with equity and Internet is a non-police realm besides

One trivial criterion should be to count how many times a keyword is cited within the resource but that proved to be misleading because the robots only browse the resource partially being practically impossible to differentiate a sound academic treatise from a student homework concerning the same subject. To make the things worse, programmers, developers, and content experts know all those tricks and consequently they make overuse of the keywords they believe are significant.

The Search Engines have improved too much along the last two years but the searching process continues being highly inefficient and tends to collapse. To help site owners to gain positions within the lists (in fact to get more light) proliferate ethical and unethical techniques and programs most of them apt to deceive the “enemy”, namely the Search Engines. Even in a ‘Bona Fide” utopia it’s impossible for a robot to differentiate between a complex site and a humble site dealing with the same subject. Complex sites architectures could even make the sites invisible for them because they are only well suited to evaluate flat and simple sites. For instance search engines like Google needs also to break even commercially and start selling pseudo forms of score enforcing ways to desperate site owners that need traffic to subsist.

We emphasize again the fact that the “light” that a Search Engine provides to each URL is indirect like the Moon reflects the Sun’s light. Then our conclusion is that most of the information and the knowledge is hidden in the darkness of the Cyberspace.


The Cyberspace as a Global Market
The Matchmaking Realm

Now that we know the meaning of the HK Human Knowledge we may define HKIS, the Human Knowledge Intelligent Summaries, a set of summaries that we have to explain soon why do we title them as intelligent, and NHKIS, for a Network of Human Knowledge Intelligent Summaries that correspond to the gray crown of the above figures. Now we are going to enter into the problem of the languages and jargons spoken in the Black Region, in the Gray Region and mainly in the Green Region.


Websites are built to match uses
Internet the Realm of Mismatch

The Websites are built to match users, are like lighthouses in the darkness, to broadcast information, knowledge and in the case of e-Commerce some kind of attracting information as “opportunities”. What really happens is that at present Internet is more the Realm of Mismatch than of Matching. The lighthouses owners cannot find the users and the users neither cannot find the alleged opportunities nor understand the broadcasted messages. This mismatching scenario is dramatic in the case of Portals, huge lighthouses created to attract as many people as possible via general interest “attractions”.

Something similar occurs with the databases where are stored millions of units of supposedly useful information such as catalogs, services, manufacturers, professionals, jobs opportunities, commercial firms, etc: users could not find what they need. When we are talking of mismatch we mean figures well over 95% and in some databases matching efficiencies lower than 0,1%.

In the figure above we depicted this dramatic mismatch. The yellow point is a Website with its offer represented by the cone emerging from it, let’ say the Offer expressed in its language and in its particular jargon. A point black within the green circle represents a user and the cone emerging out from it his/her Demand expressed also in his/her language and particular jargon.

Mismatch reasons
Websites and user speak and think different

What we discovered is that both sides speak approximately the same language but by sure different jargons and more than that, they think different!. We have depicted the gray crown because the portion corresponding to its Major Subject virtually exists: that’s the portion in dark gray within its cone.   They have the “truth” expressed in its particular jargon, and sometimes the “official” and standard jargon. If the Website were for instance a “Vertical” of the Chemical Industry, of course its jargon will then be within the Chemical Industry Standards and its menu should be expressed technically correct, resembling the Index of a Manual for that particular Major Subject: Chemical Industry.

So our conclusion of a research done along two years studying the mismatch causes was that the lighthouses speak -or intend to speak- official jargons, certified by the establishment of their particular Major Subjects. They are supposed having the truth and they think as “teachers”, expressing their truth in their menus that are in fact “logical trees”. They may allege to be e-books and they behave, think, and look, pretty much the same as physical books.

Now let’s analyze how the users act, express and behave. If one user meets the site to learn, the cones convergence is obliged, the user is forced to think in terms of concepts of the menu that for him/her resembles a program of study, and we have a match scenario. If the user meets the site to search something, that’s different. When one goes to search something one tends to think in keywords terms instead, keywords that belong to our own jargon and at large to our own Thesaurus. So, either by ignorance or on the contrary, being an expert, the users’ cones diverge substantially from the site’s cone. One of the main reasons of this divergence is that the site owners ignore what their market targets need. Many of them are migrating from conventional businesses to e-Commerce approaches and extrapolate their market know-how as it is. They were working hard along decades to match their markets and to establish agreed jargons and now they have to face unknown users coming virtually from all over the world.


The solution

Evidently the solution will be the evolution from mismatch to match in the most efficient way. To accomplish that, both the Offer and the Demand, have to approximate each other until both share a win-win scenario and a common jargon.

In the figure above we depict a mismatch condition where we might distinguish three zones: the red zone represents the idle and/or useless Knowledge; the gray zone corresponds to the common section with an agreed Thesaurus concordance; and the blue zone corresponds to what the users need, want, and apparently does not exist within the site. So the site owners and administrators have three lines of action: a) reduce to zero the red zones, for instance adapting and/or eliminating supposed “attractions”; b) learn as much as possible about the blue zone, and; combine both strategies.

At this moment the dark green zones are extremely tiny, less than 5% being Internet the Realm of Mismatch between Users’ Demand and Sites’ Offer. The big efforts to be done consist in minimizing costs eliminating useless attractions and learn from non-satisfied Users’ needs. To accomplish both purposes the site owners need intelligent tools, agents that warn them about red and blue events.


What’s does Intelligent mean

Let’s analyze the basic process of users-Internet interactions. One user meets one site to interact in one of three forms some times concurrently: investing time, making click over a link or filling a form or a box with some text, for instance to make a query to a database. The site statistic are well prepared to account for clicks, telling what “paths” were browsed by each user but they are not well suited to account for interaction derived from textual interactions. Of course, you may record the queries and even the answers but that’s not enough to learn from mismatching. To accomplish that we may create programs and/or intelligent agents that account for the different uses over the components of each answer, but they have to do then a rather heavy accounting.

If we query a commercial database for tires the answer would be a list of tires stores; and to have statistics about how frequent the users ask for this specific keyword we need to account for it; and to know about the “presence” of each store as a potential seller we need to account for it; and if we want to know about the popularity of each store we need to go farther, accounting for it and so forth. That accounting process involves a terrific burden even done in the site server’s side.

An intelligent approach should be to have all possible counters to detect documents popularity and users’ behavior, built in into the data to be queried. That’s the beginning of the idea: to provide a set of counters within the data to be queried by users for each type of statistic. So when a data is requested a counter is activated accounting for the presence, and when it is selected by a click another counter is activated and when the user by reading the “intelligent summary” received decide to make a click over the original site or over one of its inner hyperlinks, another counter is activated.


Here is represented a typical track of user-site interaction. The user makes a query for “tires”. The i-Intelligent Database answers sending all data it has indexed by tire adding a list of synonyms and related keywords it has for tire. Each activated i-URL accounts its presence in that answer adding one to the corresponding counter in the i-Tags zone. If the user makes click on a specific i-URL the system presents it to the user accounting for this preference in another counter of the i-Tags zone.

Finally if the user decides to access the commented site located in the black crown makes a click and another counter is activated within the i-Tags zone. At the same time the counter corresponding to the keyword tire is activated adding one and the same if the user activates some synonym or related keyword. If the answer is zero data it means a mismatch because an error or a warning about a non-existent resource within the database. In both cases the system has to activate different counters for the wrong or non-existing keyword in order to account for the popularity of this specific mismatch. If the popularity is high it is a warning signal to the site Chief Editor (either human or virtual) about the potential acceptance of the keyword, either as a synonym or a related keyword. At the same time, the system may urge to look for additional data within the black region. From time to time the systems could suggest the rehearsal of the i-URL’s summaries database in order to assign data to the new keywords as well. We will see how to work with a network of these Expert Systems at different stadium of evolution.

Within the intelligent feature we consider to register the IP of the users interactions and the sequence of queries, normally related to something not found. The keywords users’ strings are in their turn related to specific subjects within the Major Subject of the site. So, statistically, the keywords strings analysis tells us about the popularity of the actual menu items and suggests new items to be considered.


Some examples about actual general search inefficiency

Let’s try to search for something apparently trivial like “Internet statistics”, for instance using one of the best search engines, Google: More than 1,500,000 sites!. Do not dip too much along that list, only check what the first 20 or 30 sites offers. Most of the content shown by the sites of that sample is obsolete and when updated you are harassed by myriad of sales offers about particularly statistics, market research studies and similar, priced on the thousands up. And if this scenario occurs with supposed authorities: Library of Congress, Cyberatlas, About.com statistics sites, Internet Index, Data Quest, InternetStats, what then with the 1,500.000 resting?.

What if that noisy cluster be replaced by a brief comment made by a statistician, telling the state of the art about Internet Statistics and suggesting alternatives ways to compile statistics from free updated authorities that by sure exist in the Web?. That’s is very easy to do and economic either, it should take no more than one hour of that specialist. Of course that would be feasible as a permanent solution if the cost of updating that kind of reports were relatively insignificant. Concerning this problem we estimated that the global cost for updating a given HKM is of the order of 3% to 5% per annum the cost of its creation. So the HKM’s will be updated by two ways: evolutionary by evolution through their interaction with users and authoritative by human experts updates.

Let’s see another examples with “sex” and “games”. Sex has more than 48,000,000 sites and is well known that the sources of sexual and pornographic content are fewer than 100. The rest are speculators, repetitions, transfers, and commuting sites of only one click per user playing the ingenuous role of useful idiots. Something similar occurs with games with more than 35, 000,000 sites and again the world providers of games machines, solutions, and software are no more than 100!.


Human Knowledge Shells

For a given culture and for a given moment we have the following regions in the Web space



Red: a given HKM
Black Blue: HK Virtual Library
Regular Navy Blue: Ideal HK
Blue: Ideal HK plus New Research
Light Blue: Ideal HK plus NR plus Knowledge Movements
Deep Light Blue: Ideal HK plus NR plus KM plus Information


Everything is working within an expanding universe of Human Intellectual Activity. It takes too much time and effort for new ideas and concepts to form part of the Ideal HK. We as human have two kinds of memory, semantic and episodic, and any cultures in a given moment have its semantic memory, conscious and unconscious, intuitive and rational as well as its episodic memory.

Along the human history the dominant cultures have controlled the inflow of the Human Intellectual Activity in explicit and implicit ways, for instance discouraging the dissension. Internet allows us as users to dissent with any form of “established” HK and to influence on an equality basis the allegedly ideal HK. This feature will accelerate in an unprecedented way the enrichment of the ideal HK. For that reason we emphasize in FIRST the mismatch between the HKM and users thoughts, questions and expectations, oriented to satisfy users, that is the human being as a whole and as a unit.
 


2- About a New Approach to Internet Communications
Linguistic Approach


Internet is a very particular net

We make specific reference to Internet Data Management because the “Big Net” differs substantially from most nets. Internet deals with all possible groups of people and all possible groups of interest. Internet users belong to all possible markets from kids to old people in all possible economic, social and political levels and cultures. This Universality makes the Internet man-machine interactions extremely varied.

On the contrary, in any other network we may define a “jargon”, ethic and rules. When we build a new Internet Website we really ignore what will our potential users be, and consequently what they want, what they need and even we ignore their jargons. We imagine a target market and for that specific market we design the site content, in fact, the “Information Offer” to that market.



The figure above depicts the matchmaking process within the Internet “noosphere”. The users in green express what they want and even think in terms of “keywords”, expressed in their own jargon, are open and flexible. On the contrary, the Website owners through their sites believe they have the truth, only the truth but the truth. In that sense being or not an authority they resemble “The Law” of the establishment of the Human Knowledge. The law, for each Major Subject is expressed in Indexes of the main branches of that Major Subject, resembling a “Logical Tree”, depicted in gray over the yellow truth. They imagine their sites as universal facilitators but always following the pattern of the logical tree and expressed in their jargons.

The Websites have their own Thesaurus, set of “official” keywords, depicted in white over black background, within the darkness of the Web space. Between the logical tree and the Thesaurus exists a correspondence. The Website owners are shown with the Truth Staff in yellow. The users-Internet interactions are depicted as a progressive matchmaking process, going from green to black and vice versa, learning one from the other match-mismatch. Both sides strive for knowing interchanging knowledge


Information Offer versus Information Demand

Paradoxically, even being the Web so well suited to add, to generate and to manage intelligence most people ignore this fantastic possibility. If we define our Information Offer as WOO, which stands for What Owners Offer and what the users want by WUW, which stands for What Users Want, the Web Architecture permits the continuous match between them and as a byproduct the intelligence emerging from any mismatch.

That possibility means the following: WUW is what users want expressed in their specific jargon/s, meanwhile WOO is the Website information offer expressed in let’s say the “official/legal” jargon, the one we choose to communicate with our target market. The continuous mismatch between WOO versus WUW would permits us to know the following five crucial things:


The knowledge of the market jargon/s permit us to optimize our offer: for instance, a negative answer to an user query could mean either that we don’t have what he/she wants or that the name of what he/she is looking for in his/her jargon differs in our jargon.


What people needs

What we know directly from users queries is what they want, not what they need. The difference between WUW and WUN, What Users Need is substantial. People generally know what they need but adjust their needs to the supposed or alleged Website capabilities. We learn what our users need as time passes by if we make use of the intelligence byproducts and/or from surveys.


How is normally organized the Information Offer and how the Queries

The IO is normally presented as ordered sets under the form of Catalogs, Indexed Lists and Indexes but the queries, where the users express their particulars needs WUN are expressed by keywords. Both communication systems are completely different even though could be complemented and we could make them work together towards the ideal match between WUN versus WOO.

As we see soon the users communicate with the different Websites via their subjective jargons, at least as many jargons as MS, “major subjects” they are interested in. For instance, if I’m an entrepreneur that manufactures sport car wheels I’m going to query B2B sites to look for subjects related to the sport car wheels expressing myself in “my” jargon, with differences with the “official” jargons used in the B2B sites and of course, the query outcomes will strongly depend of the jargons differences.


Jargons Evolution

In a similar way as the official languages change from time to time, influenced at large by the pressures of the people jargons, coexisting both at any time, we may endow an extremely efficient evolutionary feature to the Websites of the Cyberspace via Expert Systems, that learn from the man-Internet interactions. We dare to qualify this feature as extremely efficient because in the Cyberspace every transaction could be easily and precisely accounted for. So, each time one user uses a keyword belonging to his/her jargon this event could and should be accounted for.

Let’s then imagine what kind of intelligent byproduct could we extract of this simple but astonishing feature. Within a homogeneous market the keywords tend to be the same among their members. So in our lat example, if the majority of users make queries asking for wheels and the word-product wheel does not exist in our database a trivial byproduct takes the form of the following suggestion: add wheels to the database as soon as possible. On the other hand if the word-product “ergaston” was never asked for along a considerable amount of time, another trivial message should be: take ergaston out from the database.




The figure above depicts the evolution of the matchmaking process. In the beginning, the Website owners had the oval green-gray target, where one user is shown with a black dot. But that user really belongs to a users affinity market depicted as a dark green oval with a cone of Internet interest that differs too much from the ideal initial target. The Website owners need an intelligent process to shift towards the bigger potential market dark green. With a cone border yellow we depict the final “stable” matchmaking.  



3. FIRST, Full Information Retrieval System Thesaurus


The actual Information Retrieval process in the Cyberspace

The Cyberspace actually has about 1,500 million documents ranging from reference to trivial, from truly e-books dealing with the major subjects of the human knowledge to daily news and even with minute to minute human interactions information as in the case of Newsgroups, Chat and Forum “on the fly” pages generation. This information mass grows continuously at an exponential rate, rather chaotically, as its production rate is being by far exceeded by the human capacity for filtering, qualifying and classifying it.

To help the retrieval of information from the Cyberspace we make use of Search Engines and Directories that are unable to attain WUN, What Users (We the Humans) Need. From all that information mass the search engines offer to us “summaries”, telling what kind of information could we get in each location of the Cyberspace (the URL, Uniform Resource Locator). So for each URL we as users obtain its summary. Those summaries are normally written by the Search Engines robots, which try to do their best extracting pieces of “intelligence” from each Cyberspace location.


In the figure we depict some sites within the darkness of the Cyberspace. We may find from huge sites storing millions of documents and with hundreds of sections through tiny sites with a flat design storing a few pages. One Search Engine shown as a yellow crown sends its robots to visit existing sites from time to time making a brief “robotic” summary of them. As we will see soon those brief reports are noisy, deceiving the users (green circle).

The Search Engine assigns priorities, which act in turn as a measure of the site magnitude (as the brightness of a star). As it’s depicted, the priorities (the navy blue dots) have nothing to do with the real magnitude of the site (depicted as the white circle diameter). So the yellow crown is a severe distortion of the Web. These priorities defined for the keywords set of a given site resemble the “light” that illuminates it: a high priority means a powerful beam of light reflecting over the site highlighting it to the users sight.

The actual information provided for the search engines are as primitive as the map of the sky we had one thousand years ago. The robots only detect some keywords the site content have, equivalent to the chemical elements of the celestial bodies, but tell us nothing about its structure, type of body and magnitude. Today we may have for each celestial body the following data:

Among many others, diameter, density, its constitutive elements spectral distribution, brightness, radiation, and albedo. For each of these variables we have site equivalents that must be known in order to say that we have a comprehensible Cyberspace map. For instance we need to know something that resembles magnitude, density and elements distribution and brightness.

Being the bodies of this cultural and intellectual space (noosphere), intellectual creatures, we need an intellectual summary of it, what is known as the abstract in essays and research papers. For instance a site could be camouflaged to appear attractive emphasizing the importance of a given element, let’s say “climate”, to deceive a robot as being a specialized climate site but in reality having nothing about climate content. The same happens with information: Portal’ news, for instance, are presented as content sites, being that true only concerning a specific type of information resource known as “news”, of an extremely ephemeral life of hours. On the contrary, content of philosophy or mathematics are by far denser, heavier, with lives lasting centuries in the average. So we could distinguish all kind of bodies from fizzy (news) to rocky   (academic).

Another complementary source of information are the Databases hosted as collateral of the Websites, as huge stores of organized and structured data. The content and quality of these databases are normally a subjective “bona fide” declaration made by the Website owners. So far for the users, the Cyberspace, particularly the Web Cyberspace, looks like a net of information resources with some “Indexes” to facilitate their retrieval task. Those robots made indexes are too noisy being practically useless. Below we attach a well-known graphic sample of this uselessness




The figure depicts the finding of useful information (black spots) navigating along a searching program



The main reasons of that uselessness

The main reasons are, among many:


In brief, the shadows of content that search engines offer to the users have almost nothing to do with the real content of the Cyberspace, presenting a distorted vision of it. The problem is the contagious spread of this distortion as long as the Website owners use that summary information as a “bona fide” vision of its world. As a corollary, Internet speaks today the Website owners jargons pointing to a global distorted visions of the real markets!.


Uselessness Measure
The abysmal and chaotic mismatch enable forms of e-Commerce delinquency:  When you as a user face that finding your first reaction could be being suspicious of the declared content of the database. On the owners’ side, they could allege that those mismatches are due to linguistic ignorance of the users. Unfortunately there is not something like an official audit to detect deceptions yet but we’ve found many databases really empty betting to growth via users membership with cynic declarations such as:

Come to join us!, We already have one million firms like yours!,



Our Approach to this problem: FIRST

Our methodology started as an effort to solve some Internet drawbacks Websites owners and users experimented, mainly within the dot COM domain. Concerning that, our Systems Engineering background warned us, and we were aware of, that the crisis was the "Internet answer as a system" to wrong approaches of most Internet newcomers. At large, Internet is a Net of computers and servers obeying the rules of IT and Communications. What happened along the last two years within the dot COM domain should have been a sort of science fiction for traditional IT and Communications companies. But finally the waters will find their natural courses.

Along that reasoning we were confident that the solutions to some of the Internet drawbacks should be found within classical systems engineering wisdom. Within that wisdom were classical concepts like Information Retrieval Systems, Selective Dissemination of Information and Expert Systems. Firms like BM have a long history on those milestones.   As I can remember KWIC Keywords In Context, SDI Selective Dissemination of Information, recently Taper Web semantic methodology and The Big Blue that beat Kasparov run along these lines of research

The first two were respectively a tool and a methodology to retrieve and to disseminate information efficiently taking into account the different "jargons" of the Information Offer and of the Information Demand, belong to the users realm. That was a subtle differentiation that defies the pass of time. In fact, Internet is, among many others things, an open World Market that tries to captivate as much people as possible talking different tongues and different jargons.

A jargon is a practical subset language to communicate among people, for instance between buyers and sellers, but it takes many years to get to a tacit agreement concerning definitions. For instance, the equivalent of   "tires" in Spanish could be neumáticos, gomas, cubiertas, ruedas , and hules being an agreement to consider only neumáticos as the formal equivalence of "tires" and the rest as synonyms.


Internet Drawbacks: Internet the realm of mismatch

The mismatch between offer and demand could be depicted as follows:

WOO ó WUN

Which stands for match/mismatch between WOO, What Owners Offer versus WUN, What Users Needs. Internet will be commercially useful as long as WOO approaches as much as possible to the always-changing WUN.


Let’s advance a little in the user side. We may differentiate among the following user satisfaction levels:

WUN > WUW > WUS > WUG > WUL

Where:




The solution in Theory

So being submerged in the mismatch we must learn as much as possible of it!. The Information Theory tells us that mismatching deliver to us by far more information about the "other side" than matching, in our case information about the markets. Studying carefully the mismatch we could attain a convergent solution to our mismatch problem as well.


In order to accomplish that aim we need systems that learn from mismatching as much as possible. With this idea in mind the whole problem could be stated as follows:

If our first offer to the market is WOO_1 we must find a convergent process such as

WOO_1 - WUN_1   > WOO_2 - WUN_2 > ......WOO_i -   WUN_i >.........

Where the inequalities converge to zero, exponentially if possible. That is what an Expert System does provided we may found a reasonable first approach to the market needs WOO_1, the first iteration of a continuous evolutionary process. We were talking about to learn but we are to define what from are we going to learn. We are going to learn from users ó Websites interactions.  

Additionally we must create a methodology and programs able to interpret what the (-) minus sign means in those inequalities and how do we step up from iteration to iteration.


Are the Search Engines really useless?

No, definitively no!. The search engines are extremely useful and this fact is going to be the same in the future. We are going to need search engines that cover the whole Cyberspace, as a virtual summary of the Noosphere (3) or the World Sphere of the Human Knowledge. These World Summaries Databases will be as now the best Indexes of the Human Knowledge in Internet, not appropriated to use directly by ordinary users but for Website Engineers and Architects.

Note 3: the sphere of human consciousness and mental activity especially in regard to its influence on the biosphere and in relation to evolution

For each major subject of the Human Knowledge we are going to need specialized Websites with almost 100% proprietary content and where ordinary users -looking for subjects within a given major subject - will be able to navigate in “Only one click YGWYW, You Get What You Want scenario”. That is, they will find exactly what they are looking for in only one click of their mouse. To accomplish that the Content Engineers must provide for each major subject a satisfactory initial information offer WOO_1. And we have to ask ourselves: where from are we going to get that initial content locations?. The answer is trivial: from the search engines databases.

Once we implement this satisfactory initial offer our FIRST methodology via its Expert System will start to learn from mismatching, adjusting the site offer to the user needs and only by exception querying the Search Engines databases when new content is needed. The exceptions are triggered by non-satisfied users demand. We will see next how to create intelligent summaries and how could we obtain a progressive independence of the Search Engines.


WOO_1: Architecture and meaning of the first Virtual Library





4 - i-URLs


i-URL’s Databases

i-URL stands for Intelligent Comment about a given Website located in the URL address. Everybody knows how those comments look like when delivered by search engines but everybody knows how frequently useless they are!.


The inefficiency of actual Search Engines and Directories

In fact you may spend hours looking for something useful, even being an expert Web navigator. Some confidential estimates about this unfruitful and heavy task tells about efficiencies below 1: 5,000, meaning that to find what we are looking for (the 1) we have to browse over at least 5,000 of those comments, in the average. Concerning databases we talk about query efficiency, namely how many queries in the average do we have to perform in order to find exactly we are looking for. That efficiency found in commercial databases (1) was extremely low: less than 0.1%!

That general inefficiency is one of the big problems Internet has to overcome in the near future. We are not going to discuss here the reasons of this inefficiency but only to say that it is mainly due to the Websites owner’s lack of responsibility. Most people do not respect the netiquette – the Internet etiquette rules- lying, exaggerating their sites worth, trying to deceive navigators and robots-, in fact, trying to oversell themselves through their Websites.

To make the things worse, the search engines simplify too much the process adding their proprietary noise to the sites owners’ noise, resulting then a square noisy media, that is a power two noisy environment.


First Step: Valuable Comments – Virtual Libraries

One first step is then to build databases with professional and “true” comments. For a given major subject, for instance “women’s health”, the first milestone should be to have a credible documents database concerning that specific subject. In that case we have to ask ourselves: how many basic documents will have that database to have to deserve be titled as a ‘Virtual Library”?. The exact answer is almost impossible to say but we could talk about boundaries instead. When we talk about library we mean a collection of books and in this case we have to locate a sort of e-books, Websites resembling classical books. Turning then to define boundaries we may talk about a library with a volume ranging from 2,000 to 4,000 books (2) and in our case of Virtual Library the location and clever summaries of an equal number of Websites resembling e-books. That’s not too much indeed (3) talking now in terms of Cyberspace!.

We have then to ask ourselves the next two questions: Do we may find those kinds of e-books in the Web?; Is it possible to select efficiently that specific library out from the Web?. And the answer is yes in both cases for most of the major subjects of our human activity.

Now our problem is bounded to locate efficiently those crucial Websites. However we have to face another problem once located them: how to search fast and efficiently within a Virtual Library of let’ say 3,000 Websites ó e-books, complemented by a 10,000 to 100,000 technical and scientific documents Auxiliary Library (Reviews, Journals, Proceedings, Communications, etc).


Second Step: How to build Virtual Libraries

The problem could be stated as follows: How we could build efficiently an efficient Virtual Library?. Let’s face first the second problem: How to build efficient Virtual Libraries?. Let’s suppose we have to design a Cancer Virtual Library, (Altavista found 3,709,165 pages as the search outcome for “cancer” at 6:00 PM of day 03-07-01). Of course, in our Virtual Library we are not going to search among more than 3 million Websites but only in 3,000 but still that number is big enough in terms of searching time.

Let’s imagine ourselves within a real library with those 3,000 books filling the space of three walls from ceiling to floor. If we are interested in finding all the literature available for a specific query surely we are going to need some indexing system to locate all the books dealing in some extent and deepness with the query questioned and reviewing them afterwards. Even having an adequate indexing system and a filing of “intelligent summaries” of all the books we will spend a couple of hours selecting the set of books supposedly covering the whole spectrum of the query.

Fine!. We are getting to the point of discovering a betterment methodology to design an efficient e-library

  1. Select the basic 1,000 to 3,000 óe-books;
  2. Design an indexing system with an intelligent summary (i-Comment) of each e-book depicting the main subjects dealt within.


The Thesaurus concept
Keywords versus subjects

The summaries must be true, objective and covering all matters dealt within their corresponding e-books. To be true and objective we only need adequately trained professionals. To cover all the matters the trained professionals must browse the whole e-book and know what “matters” mean.

In that interpretation we introduce some subtle details derived from our searching experience. People really look for “keywords”, that is, meaningful words and sequence-of-words triggering our memory and our awareness. Many of these keywords become knowledge items within a hierarchy of concepts for a given major subject. The keywords are important for us depending of the circumstances not derived from its hierarchical importance within a given major subject.

When an author makes the index of its book he thinks in terms of rationality and as a member of the society respecting the established order. The index resembles a conventional and sequential step-by-step recommended teaching and learning procedure. On the contrary, who is searching makes queries looking for what he/she needs as a function of the circumstances. The index resembles the Law.

So the Thesaurus that collects all the possible keywords of a given discipline is not a hierarchical logical “tree”. Each keyword is generally associated to many others within the Thesaurus as a transient closed system and sometimes a bunch of them could be matched to specific item/s of a tree logical structure. The Thesaurus is the maximum possible order within the chaos of the circumstances.

The logical trees, all the indexes we could imagine, are only “statistical” and conventional rules at a given moment of the knowledge. The knowledge, along its evolutionary process takes the form of a subjective Thesaurus because each person has its own Thesaurus for each major subject of his/her interest.


How we combine the virtues of Thesaurus and Indexes
The Law and the Circumstances

Notwithstanding we could make both concepts work together in the sake of searching efficiency. The logical structures are good as starting procedures, in the learning stages. Besides that, as the trees comes out from statistics the use of a given Thesaurus could give rise to new and more updated logical tree indexes via man-machine interaction along an evolutionary process. The indexes are too rigid and obsolete easily.

Now we can enter into the core of our new methodology to build Intelligent Virtual Libraries the ones we titled i-URL’s in the sense that each URL hosts a basic e-book, a crucial document, a hub, one authority



How an I-URL looks like



In the figure above, the yellow dot represents a reference site for a given Major Subject of the Human Knowledge; let’s say Personal Financing. The dark green dot within the green Users’ region represents a set of users interested in that Major Subject, let’ say the target market. Represented as a gray crown is the Map of the Human Knowledge, actually inexistent. A group of people interested in capture this potential market decides to build a reference site about it, let’ say a Personal Financing Portal. So, first of all, they need something equivalent to the sector of Personal Financing Virtual Library of e-books or reference sites actually existing in Internet. To accomplish that they proceed along the steps described in the beginning of this document.

The i-URL Septuplet: For each reference site they create an I-URL as an information septuplet as follows:


The egg-chicken problem
How do we get the Initial Virtual Library


Fine!. We have defined what an efficient Virtual Library about a specific major subject means. It’s straightforwardly conceivable that this system works but a problem still remains:

As we build Expert Systems that learns from the users man-machine interactions our main problem is then how do we get our first i-URL’s Database, how do we locate the first 3,000 e-books. Once solved this problem the Expert System will improve and tune-up the Virtual Library along an evolutionary path.

This is a typical egg-chicken problem: what first?: an initial Thesaurus or an initial Subjects Index?. As one brings and positively feedbacks the other no matter how do we start. For instance we may start with an initial index provided by some expert as our seed. From this initial index we may select the first keywords to start our searching process or by the contrary, we may start with an arbitrary collection of keywords as our seed also provided by an expert. In any case we must behave as head eggs hunters trying to catch our first e-book, let’ say the first full content Website authority concerning our major subject.

This first candidate to become an e-book will provide us either a subject index or a tool to better our initial Thesaurus. By sure within this Website we will have more reference links that will open our panorama driving ourselves to find better Websites or complementary sites or both. This is a sort of scientific artisan methodology well suited to deepen our knowledge about something with no precise rules but general criteria. We will see that for all these tasks we design specialized Intelligent Agents to act as general utilities to make the process efficient.

One criterion is trying to fill all the items covered by the best index we have at hand at any moment. That is, we investigate each milestone e-book as much as we can until the dominant items dealt with are fully covered and then we continue looking for more e-books that cover the remaining items of the index until the full coverage has been attained.

To accomplish this task we need searching experts with a high cultural level trained to switch fast from intuition to rational context and vice versa and within rational tracks able to switch fast between deductive and inductive processes as well.

First Round of Integration : Once built a Thesaurus covering all the items of the first index (this index has probably evolved along the search with new items and amendments) we must begin the basic e-books integration pivoting in the milestone e-books complemented with new searches using the most “popular” keywords (the ones that have more milestones indexed), The “exploration” of the milestones neighborhood is accomplished at high speed via pure intuition along a process we titled the “first round”. To select Websites in this first round we follow a “new rich” criterion: if the Website look nice for our purpose we select it. To say something about facts and figures we are talking about from 30 to 40 milestones, mainly authorities and hubs and from each milestones selecting from 100 to 200 Websites totaling from 6,000 to 8,000 Websites as the outcome of he first round. This first round works over a first raw selection made via infobots that query and gather Websites taken from search engines so the human experts really work over a rather small universe.

Second Round of Integration : Once built this “Redundant” Virtual Library we must make a tune up of it keeping the 1,500 to 2,000 best suited to our purposes and that will be the e-books collection of our initial Virtual Library. To select them we use a logical template, screening the most important Website attributes, such as: type, its traffic, design, Internet niche, universality, bandwidth, deepness, etc.

See in our section about Hints how we check the database completeness and redundancy via Intelligent Agents.

With this template we proceed to build our i-URL’s, that is, the intelligent summaries of the e-books of our initial Virtual Library. We must emphasize here that the e-books remain in their original URL’s locations. The only data we record in our i-URL’s Database are the i-URL’s.


Advantages of our i-Virtual Libraries
Versus non intelligent Virtual Libraries and versus classical Search Engines

This is a rather sophisticated and heavy “only once” task but the advantages are comparatively enormous compared to the use of the classical search engines (5):


Notes

Note 1: Along this line we made a joint research study with the Mexican university Instituto Tecnológico de Monterrey analyzing e-Commerce Databases efficiency with the following astonishing results: a groups of students of the Systems Engineering career queried an Industrial Database with 200,000 Latin American firms. They were trained in how to search by keywords, for instance by product, and the positive matches were lower than the 0.1%!.

Note 2: we are talking about basic books. Of course this information basement must be complemented with thousands of technical and scientifically publications as well

Note3: Only considering Web documents we are talking of about one and a half billion documents and we have to consider the others Internet resources such as newsgroups and millions of “pages on the fly” generated in chats and forums.  

Note 4: All our Expert Systems work under control of a Virtual Integrator, that integrates the Expert System with all kind of systems extensions such as, front-ends, back-ends, Intranet, Extranet, etc.

Note5: To remedy the search engines inefficiency some sites decide to build proprietary content, that is a collection of critical documents trying to answer a reasonable FAQ. This is extremely useful and necessary and we recommend it but it’s not enough. Effectively, the sum of real knowledge dispersed in the Cyberspace is so big for any major subject that any particular effort is like a drop of water into the ocean. Of course we may strategically design our “drop of water” in order to demonstrate that we are alive as referents and not mere passive Internet mediators.



5- Some Program Analysis Considerations
1- Thesaurus evolution Keywords Popularity and something more


User Track Mechanics




The figure above depicts a typical user track. We may define in each track the following significant events:

Warning: We are talking about existing keywords, that is, the users query the HKM by existing keywords. Perhaps the most crucial event occurs whenever an inexistent keyword is queried provided it’s correctively written. Some things must be investigated by FIRST in this case: a) test if the keyword is inexistent within an specific main subject but it’s present in the HKM database for the queried Major Subject; b) test if the keyword is inexistent in the HKM database for the queried Major Subject but could be present in some others; c) test if it’s absolutely out of the HKM.

See below the different groups of keywords. First must analyze the existence/non-existence of not recognized keywords for all those groups. The Chief Editor FIRST must carefully review these cases once properly reported by.
 

Tracking “zoom”

We could improve our insight deepening into each incident, namely:

Over c: Once a couple [keyword, subject] is keyed and properly checked about all types of consistencies programmed, FIRST answers with a hierarchical list of either the selected i-URLs or their corresponding briefs. The later procedure invites to mark the most appropriate with a click. The user could even navigate within the same list, that is, within the same couple.

Over error: Eventually the users could get a wrong URL address (however, these kind of errors must be avoided as much as possible). The system must make the most of these opportunities trying to offer the user some alternatives: similar URL’s (once checked the link works properly!) and/or advising to consult related tutorials within the system. Independently, these events must trigger one of the searching intelligent agents either to locate where the URL could have migrated (the most probable condition) or in an extreme to proceed looking for new documents. The potential documents to replace the lost one must be sent to the FIRST Chief Editor who finally approve/disapprove the new document once the corresponding i-URL is edited. Once finally approved, the announcement of the new document must be emailed to the users that previously authorized the system to be warned.

Note: An internal clock measures for each user the time duration session: once gone out to review something the system waits a reasonable time to receive the user as working along the same session. User may change subjects along one session.

Possible strings are:

[k1, c, C, k2, k3, c, k4, c, C, leave] subject i
[k1, k2, c, c, c, k3, c, C, c, C, k4, k5, leave] subject j

In the first string for subject i, the user decided to make a click over an URL once reviewed its i-URL, then returned, searching for k2 and k3 but just peeping without being interested to read the list of i-URL’s provided, then tried with k4 and making another click over another URL and finally leaving the system.

In the second string for subject j, the user sweeps over k1 but review extensively k2 list and with k3 made two more searches, then another sweeping over k4 and k5 to finally leave the system.

As our purpose is to keep only keywords strings, those strings could be summarized as follows:

[ k , k2 , k3 , k4 ] subject i

[k1 , k2 , k3 , k4 , k5 ] subject j

Where we go from a cold color (blue) to a very hot and active one (red). For each session and for each subject the keyword strings are saved for statistic purposes. Statistics are made by string as they are and alphabetical.


Thesaurus Evolution Mechanics

All keywords and i-URL’s traffic are from time to time statistically analyzed. Let’s see how the Thesaurus evolves. For each keyword we have at each moment two variables: its quantitative presence within the Logical Tree structure and its popularity. We may define within the Thesaurus the following groups:


a versus b and their respective popularities tell us about how well designed are the synonymies
a versus c and their respective popularities tell us about some semantic irregularities
a versus d and their respective popularities tell us about searching patterns the must be deeply investigated

For instance if in politics we detect a high popularity of peace and conversely a low popularity of war it means that people is changing its attitude concerning the crucial problem of peace versus war. We may investigate also all the other possible combinations b versus c, b versus d, and c versus d.


Analysis of some other types of user interactions

We may save all users login and from time to time depersonalize them defining common behaviors, common searching patterns. We are going to find all imaginable kinds of searching patterns, namely
 All these and many other categories and divided in frequent and eventual users as well.


Another crucial events: users’ feedback

A user could feedback FIRST in the following ways:

We may design the user interface warning users when they are ready to abandon the system, and welcome them when coming back from C type inspections.
Eventually as we commented above the users could get a wrong addressing.


Path keyword ó String Correspondences

We said that for each path of the initial logical tree of a given Major Subject of the KNM we define a string of keywords; whether possible with priorities; let’s say from left to right. After the three-stage procedure depicted in the FIRST white papers, we have an initial set of correspondences between paths and strings, being both related to specific subjects under each Major Subject.  


After a measurable evolution change for a given user’s market the initial World Virtual Library of HK changes. In the figures above we depict such a change. Some documents – central figure- will be considered “useless” (light yellow regions) and some were added to the system, extracted from the HK as_it_should_be region (reddish regions). Finally the third figure shows how the actual World Library of the HK and its related HKM will look. Topologically, for the next evolutionary step we consider the situation like initially, with a red circle within a larger yellow circle but leaving a smaller yellow crown.

However, if we do not change the logical tree and the Thesaurus accordingly, the procedure will fail. To make the red region converge to cover as much as possible the yellow region the procedure will enter in a vicious circle.



6- Noosphere Mechanics


Red: The HKN model,
a Human Knowledge Network sample, a cultural model constituted by a set of selected Websites

Yellow: The HKN as_it_should_be,
depicting the whole Human Culture without dominant cultural biases

Orange: The pre HKN model,
the set of documents, articles and essays that establish the “formal” HK basement at a given time for a given culture

Blue: The opinions, thinking movements,
drafts, tests, communications that feed the orange crown

From Light Blue to Black, the massive Noosphere,
a continuous of “bodies” (Websites) hosting and broadcasting information and knowledge



ó Red: ~ 500,000 sites
ó Yellow: ~ 1,000,000 sites
ñ Orange: ~ 5,000,000 documents
ññ Blue: ~ 100,000,000 documents
ñññ Massive Noosphere: ~ 1250,000,000 sites

ó approximately stable in volume.   ñ Arial; high rate of increase: the more arrows the highest rate.





Red Points: worthy Websites dispersed and extremely diluted along the Web space. The worth is a function of the culture and of course of time: for instance a site depicting how a 4 months old baby is swimming could be considered unworthy today but perhaps could be a fundamental document within 200 years for some disciplines of the Human Knowledge.





In the figure above is shown a worthy site “discovering” made by FIRST, the Expert System that manage the HKM. FIRST is continuously searching for new sites deserving to be filed in the HKM. It is not shown here how the HKM detach itself from references to “useless”, obsolete and incomplete sites.  





In the figure above, a set of “red points” forms a net, augmenting its worth substantially, growing each node by mutual inter nurturing and growing collectively as well. Some primitive examples are the “Virtual Communities” and “Web Rings”.



7- An approach to Website Taxonomy


to browse a site to measure its structure
Parameter template

For the implementation of our first Expert System, to administer the match making process of a B2B site we were searching the Web along a two-month journey, visiting more than 6,000 sites dealing with e-Commerce. Some of them were Verticals some Hubs trying to encompass the main industrial activities and services of a highly industrialized nation like USA. For each of them we tried to take into consideration the following set of factors:

Type of site

To identify all these variables we designed first an Utopian Universe of Authorities endowed with everything imaged, for instance the USA Library of Congress, http://www.loc.org, the NASA, the American Airspace Agency, http://www.nasa.gov or the WTO, the World Trade Organization, http://www.wto.org/, style="mso-spacerun: yes">  huge and complex sites that supposedly deal with its Major Subject in an integral way. Browsing carefully within some “clusters” of authorities at that level, just comparing “pound per pound” among them and with other minor sites dealing with the same Major Subjects we tested the main variables of our template.

Type of site : We found many types of sites, sometimes not easy to define, because many of them were a combination of several types: Specialized Websites with proprietary content, Portals, Directories, Facilitators, Portal of Portals, Vortex, Vortex of Vortexes, Platforms, Yellow Pages, etc.

The Design has substantially changed lately. Up to now we were witnesses of the Web evolution with designs made to attract traffic and to maintain a reasonable loyalty to the site. Every detail was exhaustively considered: speed, readability, sequencing, layout, colors, wording, flow, login, customer support, errors handling, etc.

The Web user behavior is not well known yet, but we could depict some classical characteristic in order to get both a nice first impression and a durable membership.


Deepness: By deepness, we understand the average size of the site tree level, that is, how many clicks, in the average, could we go inward finding valuable information.

Bandwidth: The width of the announced subject’s spectrum covered, measured from high to low, from rich to poor or in percentage.

These two, deepness and bandwidth concepts, probed to be extremely important to define the potentiality and quality of a site, that is for example, how do we differentiate a site with a wide bandwidth and a deepness defined as 3, from a poor site with a wide bandwidth, but almost empty, to orient and to facilitate the task of users, taking into consideration that the average user will not be able to appreciate that kind of subtle differences from the beginning.

Verticality (Functionality) was another concept we were not yet used to appreciate easily. By that we mean how integrated is the Major Subject covered along the site as a whole. We found verticality as inversely related to bandwidth. This concept was particularly useful to compare e-Commerce "Verticals".

Use of the site: the institutional, professional, academic, religious, communitarian, or commercial use of the site: to enhance its institutional image, to work for the human welfare, to fight against something, to virtually behave as a community center, etc.

Type of users: we defined three type of users looking for three types of resources, namely: beginner, medium, and expert looking for information, knowledge, and entertainment.

Universality refers to take into consideration all the possible users’ needs, expectations and cultural differences and all possible users’ jargons, from beginners to experts and from small size enterprises to big corporations and geographic origins.

Finally the traffic is a very important factor but very difficult to evaluate accurately.


A first raw approach

A fast and straightforward way to compare Websites could be structured taking into consideration only three of those variables, namely:

And accordingly we could then define a qualitative-quantitative three-dimensional metric. The deepness is decisive to judge the seriousness of a site. Most of the sample investigated of nearly 6,000 e-Commerce sites had an average deepness factor lower than 2!. For instance a well known B2B site alleged to have more than 50 vortex implemented but only 10 had a deepness of 4 and the rest one click and the nothingness concerning real proprietary content!.



8 - FIRST within the vast world of AI – IR
Some contextual ideas and hints to improve its implementation


FIRST niche

FIRST is a methodology to create a basic Knowledge index of the Web with some auto learning capabilities that allows Web users to find relevant information in only one click of their mouse. That’s all. FIRST is an Information Retrieval Methodology that has few in common with KR methodologies, languages, and algorithm to represent the Human Knowledge in a true form, has some things in common with Knowledge taxonomy. I believe that FIRST could be considered an AI primitive application that emphasizes the role of humans as experts to start running in its evolutionary process.

Even being the purpose of FIRST so humble but the task to start running as an almost autonomous Expert System capable to learn from users – Web interactions, is so immense, it was thought to be aided by two communities of IA’s (generally “knowbots”) to optimize the work of the initial staff of human experts and once running replacing progressively the human intervention till full autonomy. The first community was conceived to speed up the completion of the initial or “mediocre” solution and the second to make that mediocre solution evolve along the time. We may then imagine operating within the always exponentially expanding Web a network of these cells of HK, perhaps clones of one initial mediocre solution, but evolving differently depending of the users’ communities (human beings) and depending of the general policies that control the behavior of the HK administrators, being them either humans or IA’s.

FIRST is completely defined in a set of 6 “white papers” by itself. In this section our aim is to present FIRST to the scientific community. The Human Knowledge could be depicted as an infinite semantic (and why not emotive?) network with a complexity not yet known. Many relevant work has been done in that direction with contributions that range from metaphysics and philosophy to mathematic and logic as we will review below but many contributions are common sense findings. FIRST fall within this last category of thinking as many IR tools of the past like KWIC and as many search engines approaches of the present like Yahoo, Altavista and Google.


Navigation by AI-IR Authorities

We are going then to navigate by the Web making steps in some “authorities: concerning our aim. Our first step will be the Research Index of the NECI, Scientific Literature Digital Library (the site was moving to this new place, check I!). As a sound proof of the expansion and mobility of the Web when I was rehearsing the sites and documents of this section, the NECI “old” site moved and not only that, most of the references I was consulting disappeared, being replaced by new focus of interest like for instance learning, vision and intelligence.

One of the “rules of thumb” advised by our experience is the cardinal to determine the taxonomic size of our HKM, fixed in 250 Major Subjects. As explained in our white papers 250 are like a common upper limit in important Hubs. But let’s take a look at the index of AI taxonomy as depicted in this site. Concerning our own research about Human Knowledge it’s another example of knowledge itemization: the whole literature dealing with this Major Subject (Digital Library) that behaves as a Hub for almost every other Human MS encompasses 17 subjects and nearly 100 sub-subjects being Knowledge Representation under Artificial Intelligence subject one of them.

KR: Knowledge Representation talks about the concept of hubs and authorities, two polar kinds of “nodes” within our HK subspace: nodes that acts as hubs pointing towards the basic and most popular authorities. We are going to consider the following essays:  


Camarero applies its methodology to a piece of archeology, an object extremely bound to the past: one amphora, meanwhile Baral applies Prolog (a promissory LP language created by Kowalski, Robert A., see "The Early Years of Logic Programming", CACM, January 1988, pages 38-43), to depict two classical problems resembling pretty well each one a piece of human logic: The Flying Birds and the Yale Shooting Problem.   Both essays feed our hope about HK classification made by robots without human intervention or at least with negligible human intervention in a near future. However, the problem of the exponential expansion of the volume of the Web still remains.


DAI and KK: Meanwhile the HK in the Web space is a gigantic and living entity continuously expanding within the “noosphere”, what is really important for our practical purposes is the “Kernel” of it (FIRST points to the kernel of basic knowledge needs of Web average users). Concerning the intelligent outcomes of actual human beings only successful pieces of data and intelligence remain in the kernel. Of course to accomplish that evolutionary task of filtering and fusion we need something like a short-range memory registering every potentially valuable human intelligence outcome. What we do need are procedures to optimize the process of selection of data and intelligence to be added to the kernel or in some instances to replace parts of it because the HK follows a model of “non-monotonic logic” within an “open word context”.


Key behavior of some IA’s: We’ve seen in FIRST how easy is to build it manually, using human experts, the first approximation to such a kernel. The problem is then how to improve henceforth this kernel by mean of intelligent agents. Our first intuitive approach was to trust on the mismatch process. When one user queries the kernel there are two possible outcomes:

found - not found

When found the users are attracted by the massive power of the kernel like a gravitational force and we may suppose that they found what they were looking for. When not found, we are in the presence of a real fight_for_living scenario, in terms of intelligence behavior. When one user can’t find something he/she tries to do his/her best to “win” either beating the kernel or finding a back door open to enter. What is really important is the track of those fights. Let’s imagine millions of user trying to access myriads of such kernels from all over the world in different languages and different jargons and belonging to different types of marketing (in a broad sense) behavior.


DAI, network:

We need local agents to record those tracks, agents to typify them, “knowbots” (specific intelligent agents to deal with tiny intelligent pieces) to suggest ways of action and we need “mediatorbots”, negotiators agents to solve conflicts, for instance due to errors. Once managed locally we are going to have kernel clones by thousands and we are going to need “coopbots”, cooperative agents to joint efforts made in different Websites of a knowledge network and to behave socially. Very probably on users behalf we are going to need “reactbots”, reactive agents as well, that is agents open to known and unknown stimulus trying to react (mostly in friendly manners) to user actions, for instance detecting wandering and disorientation. And finally, we are going to need “learnbots”, learning agents and even “smartbots”, smart agents to substitute human intervention in a process initially controlled by humans.


WEB scenario(based on July 1998 data): at that time we accounted for laudatory Web Knowledge Classification projects like Taper from IBM and Grid/OPD. Concerning growth a projected rate of growing of 6% monthly, doubling each year. Now we have nearly 1300 billon documents!. However besides growing the Web scenario present significant differences concerning IR approaches, namely:  


Top search engines: Google: uses an innovative algorithm, PageRank Algorithm, created by L. Page and S. Brin for page ranking; Altavista: it’s one of the most complete databases and with its Advanced search facility equals Google feature; Infoseek: is offering an interesting search among results feature. Another Directories analyzed were Yahoo, Infomine, Britannica, Galaxy and Librarian's Index.


HKM complementary Information: In our white papers we talk about the second priority for common Web users, the noosphere shell of Technical and Scientific Information. This shell could be implemented via the Northern Light search engine services that provides queries over thousands of Journals, Reviews and Proceedings!.


Lexicons: Google, for instance had at that time nearly 14 million words!. FIRST will work with an initial Thesaurus of 500,000 keywords. Most of actual lexicons words are


References: names, titles, toponyms, brands. FIRST should prevent this, pointing users to the classical search engines.


Operational Hints: As any search engine hast three parts, a Crawler, and Index System and a Database we must learn as much as possible from these heavy duty components in order to implement FIRST. For instance, to optimize the FIRST IR task we must take care of working with portions of DNS in order to look for one server at a time, caching query results to browse sites economically. Another problem will arise from HKM updates, being highly recommended to do it incrementally instead totally as it is normally done in today's search engines. FIRST architecture considers that feature. Even being “one click” engine, FIRST answers to queries could be weighed via a relevance-popularity algorithm, something like Page Rank Algorithm namely:

Given an (a) page and a set T of links pointing to that page from other pages we may define the PR, Page Rank of (a) as

PR(a)= 1-d + d[PR(T1)/C(T1) + PR(T2)/C(T2) ....+ PR(Tn)/C(n)]

Where:

d is a damping factor and the C(Ti)’s are the links coming out from corresponding page(i).

Google sets up d as 0.85, with 1>d>0. We are talking here about "popularity" in terms of Website "owners" not in terms of Website "users", being then PR(a) the probability that a random user selects that page and d being the probability that the random user get bored before requesting some other page.

Warning: we may design a d factor for our first HKM and calculate all PR's. FIRST then will compute d and PR as users' factors instead, by far more realistic!.


Some Subtle Hints


HITS (published in French), which stands for Hypertext Induced Topic Search, created by John Kleinberger, from IBM to identify sources of authority: We find very useful to implement the concept of hubs (good sources of links) and authorities (good sources of content). I think most of our URL's will correspond to authorities. A good hub is the one that points to many authorities and conversely a good authority is the one that is pointed to by many hubs.

As in the work of Kleinberger we are going to work on a small but extremely select subset S(K) related to HK basic documents with the following properties:

Kleinberger states that "by keeping it small, one is able to afford the computational cost of applying non trivial algorithms"; "by insuring it is rich in relevant pages it is made easier to find good authorities, as these are likely to be heavily reinforced within S(k)."

Then Kleinberger suggest the following solution to find such collection of pages. For a parameter t, typically set about 200, HITS algorithm collects the highest ranked pages for the query K (for instance for a Major Subject of the HKM) from either Google or Altavista. These t pages are referred to as root net R(K). HITS then increases the number of strong authorities in our sub graph by expanding R(K) along the links that enter and leave it. HITS works moreover on transverse links that is, links that go out to external domain names.

Warning: will be interesting to compute weights for both, hubs and authorities.


OGS: OGS, Open Global Ranking Search Engine and Directory,   is a distributed concept trying to use all search facilities opinions. They propose to add some extra tags to HTML standard (a similar approach we used in our i-URL's). Warning: They still trust on Website owners “honest Indian” behavior?. For example they suggest:

a href=............ cat="/news/computers" rank="80%">

Stating that the author considers that document a serious and valuable one (80 out of 100!). OGS is still an open proposal to be collectively managed by us as users. The proposal is naïve but intelligent and within the Internet utopia of fairness, openness, democracy and freedom. In summary it propose small changes in the HTML standards to allow users easily states their opinions about the information on the sites they create links to.

Technically what is proposed is a small change of the HTML standard that lets people easily state their opinions about the information on the sites they create links to. These opinions include the category to which the site belongs and the rank of the site in that category according to the person making the link. Then, the opinions are weighted according to the author's reputation in a field, which in turn is also determined by such weighted opinions of all the people that have expressed them. We are going to study carefully these concepts when implementing FIRST


TAPER: TAPER, which stands for Taxonomy And Path Enhanced Retrieval system, was developed by Soumen Chakrabarti, in collaboration with Byron Dom and Piotr Indyk, from IBM Santa Teresa Research Lab, year 1997. You may find also the related document (in pdf) Using Taxonomy, Discriminants and Signatures for navigating in Text Databases, written by Soumen Chakrabarti, Byron Dom, Rakesh Agrawal and Prabhakar Raghavan of IBM Almaden Research Center, 1997.

Basically Taper is a hierarchical topic analyzer that achieves high speed and accuracy by means of two techniques: at each node in the topic directory, TAPER identifies a few words that, statistically, are the best indicator of the subject of a document. It then 'tunes in' to only those words in new documents and ignores 'noise' words. The second technique guesses the topic of a page based not only on its content but also on the contents of pages in its hyperlink neighborhood. We use a similar approach in the first step when building the mediocre solution of FIRST. You may find a nice document dealing broadly about we are discussing here in the doctoral thesis focused on How do we find information on the Web by Kiduk Yang, March29, 2001, School of Information and Library Sciences, University of North Caroline at Chapel Hill


Web Sizing: the idea suggested in Krishna Bharat and Andrei Broder paper (you may access to their papers at DBLP Bibliography) is straightforward and much of common sense: to sample statistically search engines universes. Related to this issue of measuring the Cyberspace, such us Web mining and Web metrics, you may find some other works of Bharat and Broder in Cybermetrics.

They designed a sampling procedure to pick pages uniformly at random from the index system of major search engines. The conclusions were that 80% of the Total Universe (200 million documents occupying 300GB's) is indexed at any moment. The biggest one at that moment, Altavista, registered 50% of that universe and the intersection of all engines probed to be extremely poor: 4%!. Concerning some deep investigations about how the Web evolves we recommend to see Web Archeology, by the Research Group of Compaq where one of the archeologists is Andrei Broder. One of the collateral outstanding findings of these investigations was that almost one third of the documents hosted in the Web are copies and that from nearly 1million words of a full English dictionary only 20,000 are normally used by navigators. Concerning that I’m convinced that most Web users make their queries with an extremely poor vocabulary of no more than 3,000 words. By the way this is very easy to investigate and I did my own personal estimation in Spanish with a sample of 250 students of the career of Systems Engineering in an Argentine University and found they used nearly that.


KR - One outstanding IR “authority”

We were reviewing the book Knowledge Representation of John F. Sowa, August 1999, commented at BestWeb that has everything we are talking about. We comment some parts here to appreciate globally where we are. Sowa distinguish three necessary components of any IR study, listed as the subtitle of its work: Logical (logic), Philosophical (ontology) and Computational foundations. We are going to use it as a mathematical background to implement FIRST programming, the tutorial at that effect prepared by Sowa.

The level of our white papers is a first global step, to be easily understood for everybody with a minimum IT/Internet background. The second level must be the general directives expressed using the glossary implicit in this section. The third level must be expressed algorithmically either using mathematical IR notation as used in the mentioned tutorial or some Logical Programming Language and the fourth level is just entering into the software realm.


New and Old ideas in action now


Clustering

Clustering is a relative “old” technology that once we get one answer to a query it could be organized in meaningful clusters so if it works and if does not take significant process time it always add never subtract concerning a better understanding (no ranking). You may see in action in the new search engine Vivísimo originated in the Carnegie Mellon University and launched in February 2001. It works fine in scientific literature, web pages, patent abstracts, newswires, meeting transcripts, and television transcripts. It is a Meta search engine because it works over several search engines at a time applying the clustering process to all the answers. It’s especially apt when users don’t know how to make accurate queries: they advise to use regular search engines in those cases. As they work directly on the pipeline of answers the procedure is titled “just in time clustering”.

Clustering in some extent along our idea of working more on keywords than over categories and any type of classification. The heuristic algorithm works freely over the answers without any preconception. The process is rather fast: clustering 200 answers of nearly three lines each tales 100ms in a Pentium III 1 GHz. Let’s try what happens with “clustering”. It gives 194 results (documents) and a set ob branches of the tree which root is clustering. It shows only part of the branches that help us to select clusters. For instance if we select data, it delivers 22 documents along the path clustering>data where we found some documents related to clustering definitions and data mining. On of the documents in this cluster explains what a cluster is, namely:

In general, a cluster is defined as a set of similar objects (p.9 Hartigan). This "similarity" in a given set may vary according to data, because clustering is used in various fields such as numerical taxonomy, morphometrics, systematics, etc. Thus, a clustering algorithm that fits the numerical measure of optimization in a data may not optimize another set of data (for example, depending on the units selected). There are many algorithms to solve a clustering problem. The algorithms used in our applet concentrate on "joining", "splitting", and "switching" search methods (also called bottom up, top down, and interchange, respectively). They are shown by their representative methods: minimum-cost spanning tree algorithm, maximum-cut, and k-means algorithm.

Good enough to start knowing something concrete about clustering. To search for methods we use the branch clustering>methods where we find only 5 documents but all valuables. We are going to carefully study how to implement clustering in FIRST, mostly because we believe that the general user do not have an accurate idea what is looking for. Usually users have some keywords more or less related to their needs and sometimes they have an idea about the name of the subject.

We may see a recent paper about Clustering and Identifying Temporal Trends in Document Databases
, (or download from our site), from Alexandrin Popescul, Gary William Flake, Steve Lawrence, Lyle H. Ungar, C. Lee Giles, IEEE Advances in Digital Libraries, ADL 2000, Washington, DC, May 22–24, pp. 173–182, 2000. To check results they used the Citeseer database available in http://csindex.com, which consists of 250,000 articles on Computer Science and they used 150,000. Their algorithm works on the ideas of co-citation and previously determined influential papers.


Teoma

Teoma is a project of Computer Labs at Rutgers University launched on May 2001 trying to excel Google. Teoma calculates relevance using link analysis to identify "communities" on the web, and then determines which sites are the authorities within those communities to find the best pages. Whereas Google uses the collective wisdom of the entire web to determine relevance, Teoma tries to identify "local" authorities to help identify the best pages for a particular topic.

Collection of Glossaries is another laudable effort made by Aussie Slang (that’s not a woman but it stands for Australian Slang!) to facilitate the users’ navigation. It’s only a directory of glossaries and dictionaries that could be useful to the initial tasks to build the HKM, when gathering trees, paths and keywords of Major Subjects of the HK (they say have catalogued more than 3,200 glossaries, really an upper limit to the volume of our Thesaurus.


More about KR

We comment here some parts of Towards Knowledge Representation: The State of Data and Processing Modeling Standards, from Anthony K. Sarris of Ontek Corporation, 1996: It’s another source to fully depict the state of the art related to KR. In the Web domain we are dealing with conventional knowledge, at last forms of the classical written knowledge complemented with some images. There are some other forms of knowledge related to social structures like social groups, organizations and enterprises not easy to represent. For instance we may write a resume of the Library of Congress site, trying to describe it with words but it will be extremely difficult to provide a map of its built in knowledge as an institution. Perhaps in a near future those maps could form part of the organizations as a by default image. This paper deals with languages and models to depict organizations to be universally understood. It’s something in a similar line to XML and XQL but referred to organizations, standards under the control of ISO, the International Standards Organization or International Organization for Standardization. We have to allocate room in the I-URL’s of FIRST to take into consideration that near future possibility.
 
In the same form that we are talking about HKM networks we have to prevent heterogeneity, that is some other forms of HKM and of course different forms of KR. So to preserve the future of FIRST we must try to think of it compatible with all imaginable forms of maps and representations. In that sense we must have into account even the compatibility with actual NPL’s, Natural Programming Languages. FIRST implementation by itself will need an organization and it will be the opportunity to use well-probed methodologies like for instance CASE to model it, redundantly to serve as a model.

As there are many development lines along knowledge we have to locate FIRST in the right track from the beginning. Knowledge should be represented, giving place then to KR tools and methodologies; the knowledge must be organized to access it giving place to libraries and finally it must be administered for human welfare giving place to knowledge management. Knowledge is the result of the socialization of humans and once generated is shared by a community. Shannon found the equivalence between energy and information and now we need to go a step further to find the equivalence of knowledge. Intuitively the Internet pioneers always talked about three main resources: information, knowledge and entertainment. In the Web we may find specific sites for each resource and there are many like Portals having the three.

Before Internet didn’t exist what we call “noosphere”, the knowledge was stored in libraries, books, and in our minds, and only pieces of information and knowledge were matter of communications, an agreed traffic of messages among people. Internet and specifically the Web bring an universally open and free noosphere where people are absolutely free to obtain what they need in terms of information, knowledge and entertainment from their e-sources. By first time the knowledge is up there, up to you, when you need it, anywhere. The only problem to be solved even being free, open and universal is how to find it and how to understand it or more technically how to retrieve it and how to decode it. To decode it properly all documents in the noosphere must be standardized, and for this reason, programs, tutorials and data of FIRST must be implemented via standards, for instance DSSSL for data

To know precisely the state of the art of the realm where FIRST is going to operate we must take a look to the efforts made concerning Digital Libraries. In our white papers we mentioned that the amount of   “basic documents” that represents the HK for a given culture at a given time tends to be rather constant or moving upwards or downwards at very small pace and sometimes fluctuating around rather constant averages. For instance we cited the famous Alexandria Library that when destroyed stored about 300,000 basic documents. We are now talking about 500,000 reference e-books, not too much more.

The Alexandria Digital Library Project, in Santa Barbara, California, USA, is focused in Earth Data. The central idea is that once finally implemented, be the origin of a world-distributed network of mirrors-clones like the network we imagine for our project. For us this library will be an authority of the type earth sciences => geo systems => image libraries. Normally Websites authorities lead to related authorities and in this case the rule draws:

Informing us that the National Science Foundation (NSF), the DOD, Department of Defense's Advanced Research Projects Agency (ARPA), and the National Aeronautics and Space Administration (NASA), three Internet “big ones” pioneers sponsor all those libraries and are part of the leading project Digital Libraries Initiative where they add the cooperation of the extremely worthy Arial; National Library of Medicine, the Arial; Library of Congress and the NEH, Arial; National Endowment for the Humanity and recently, the last but not the least the FBI, Federal Bureau of Investigations.

With such demonstration of power we wander about the freedom utopia future of Internet. No doubt, all of them are super-authorities but be careful, the Big Ones Labs of medical drugs are too well intentioned and extremely powerful concentrating too much I_am_the_truth power. We were talking about 500,000 basic documents out of an expanding Web universe that now has 1300 billion documents. Will the HK be concentrated in quite a few reference houses or dispersed in 500,000?. Or perhaps will show us an harmonious combination of dependence versus independence in matters of knowledge?.

The Noosphere is the part of the world of life that is created by man's thought and culture. Pierre Teillhard De Chardin, Vladimir Ivanovich Verdansky and Edouard Le Roy distinguish the noosphere from the geosphere, the non-living world, and from the biosphere, the living world.