Hypertextual Data Mining and Information Retrieval

 

By Sebastian Harvey

 

Edition II

 

OmegaSoft Research,

www.omegasoft.co.uk, UK

seb@omegasoft.co.uk

 

10 July 2004

 

Abstract

 

Over the last ten years, the internet has seen a huge unprecedented growth in public documents, also known as websites.  These range vastly between large commercial sites to personal homepages, and with it, a wide variety of information.  This has – and will continue to – challenge the way search engines are built.

 

While the industry in search engines has yet to apply third-generation ideas to practice, there is a growing need for more accurate and compensative search results.  In order to understand the transformation to third-generation search engines, it is important to visualise the paths to it.

 

First-generation search engines employed the use of meta tags such as Metacrawler, which contained information relating to the page it was linked too.  These meta tags were then read by so-called bots, which crawled the structure of the internet looking for this information and storing it into a centralised database.  Second-generation engines, such as Google, Yahoo and Overture look a glimpse at the whole website and indexed its content, they also took into account webpage popularity, other sites referencing them and included parsing facilities to allow multiword searching, which was previously unavailable to meta searching techniques.

 

The future of search engines, will relate to users behaviour while navigating the internet.  They will keep records of where users have been, what they have searched and what social background they belong too.

 

This project presents PathSearch, a prototype of a fully-functioning third-generation search engine.

 

Keywords: PathSearch, Search Engine, Third-Generation (also referred to as: G3), Leon, FifoBot.


1.0 The Modern & Growing Internet

 

1.1 The Theory of Third-Generation Searching

 

Third-generation searching focuses on one objective; and that is to deliver relevant high-quality results from a search query by taking into account a wide assortment of information relating to the user.

 

Why would we want to do this?  Why is it advantageous over current searching features as provided by the market leaders?

 

Say you are an American citizen, and you searched for the term “international news”, a third-generation search would rank your results by factors (take for this example, geographic location).  The results returned would be the CNN.com website.  However, should you be a British citizen and do the exact same search, the results returned would be BBC.co.uk/news.

 

The third-generation principles would return (but not limited to) results specific to your location, thus improving the quality of your search.  Conversely, factors such as age or sex would play a major role.  With the information filtered and tailored to match the users attributes.

 

Another workable example would be to account for occupation.  Where as a student searching for “source code” may find planet-source-code.com (library of completed applications in a variety of languages) a computer programmer conducting the same search would instead find the MSDN.Microsoft.com website (more specific and abstract, not so many completed applications and requiring much more background knowledge in the field to understand the examples).

 

We could expand this practice to include results already searched.  Should a user search for “travel” and find no results, and then come back and do the same search a week later, third-generation engines would take a step further and possibly add another query to the string, resulting in “travel insurance”.

 

By building a profile relating to a user, we would be able to tailor information much closely.

 

1.2 Immediate Challenge… the Web

 

The internet is a heterogeneous assortment of totally uncontrolled and unrelated websites.  The first obstacle we face, is bringing order to this vast domain.

 

It was early to identify that many aspect of the system would have to take into account the unreliable human input.  Whether this is from searching for a word or term, or developing a website, which our system could read.

 

In reality, we had already established that a great deal of our search engine would have to be automated.  Which means humans should be able to communicate with our system with very little help.

 

Estimations have concluded that there are around 5.5 billion websites on the internet, which are accessible from the public.  PathSearch – yet relevantly small – has been designed to index around 10,000 - 20,000 of these websites.  While we appreciate the small numbers, the principles behind our work mean that (hardware permitting) we should be able to increase this considerably.

 

Even so, with human measures, 20,000 websites contributes a huge effort and drain on non-automotive resources to index the internet.  Hence the need of so-called web bots or crawlers. Preset computer programs designed to move through the link structure of the internet and store any information it comes across.

 

2.0 Design Goals

 

2.1 Intuitive Justification

 

For this prototype, we had to set limited, yet precise and ample amount of websites in which we would aim to crawl, process and search.  We also had to conjecture a possible ranking algorithm to which our results would be subject.

 

Without any prior knowledge in building such distributed system, we had to devise calculations manually.  While we appreciate that refinements or extensive improvements could be made to the system or its algorithms at a later date, a stepping stone to this development process would require some well-thought out initial intuition.

 

Based on how market leaders, and similar projects work, PathSearch had been pencilled out to rank its results by; how many users click [those] results, compared to how many websites link to a specific result, subject to the domain type and keyword match.

 

All of this information would be subject to one another to create a hierarchical structure of importance and relevance.

 

We also noted that all keywords in a multiword search should be included in all of its results.  Therefore, should users search for “UK Universities”; they would only receive results based on universities in the United Kingdom – and not general universities abroad.

 

3.0 The System Architecture

 

3.1 How PathSearch Works

 

Like most search engines, PathSearch adopts the accepted standard MVC architecture across a distributed network, which is outlined below.  This allows for greater functionality and purpose when the system is in full flow.  Conflicts between tasks and processes are closely channelled, so that they can be monitored and observed, giving administrators the ability to make changes where necessary.

 

This is – and has been – the benchmark architecture of both first and second-generation search engines.  The introduction of third-generation does not alter or change the core layout, but does compliment it by including a second set of indices, which solely monitor user’s actions and provide extra information to the search queries when appropriate.

 

The PathSearch architecture has two distinct processes;

 

  1. A back-end “loop”, which continuously checks the link map of the internet, and ensures that all data in our content repository is accurate and updated
  2. A front-end interface, for the (potentially millions of) users that could use the system, this main process deals with analysing queries sent to the engine and returning the relevant results

 

Figure 1, Overview of the PathSearch system architecture

 

The back-end process comprises of the fetchers, webDB and fetch lists.  These are the processes that the everyday end-user will never interact with.  In many cases, this whole process could be entirely automated with no – or very little – interaction from a system administrator.

 

The fetchers are a series of automated web crawler programs.  We named our program, FifoBot (Fifo relating to the way in which we store all our information in its data structure).  When the FifoBot boots up, it takes a list of websites from the webDB via the Fetch List procedure and starting with the first one in the list (fifo – first in, first out structure) the bot navigates via the inet control to that particular website.

 

At this stage, FifoBot takes two separate actions; firstly, it scans the document for information, relating to meta, title, and content keywords.  With this information temporarily held in a series of data stores, it passes the values directly to the content repository in real-time, where it is able to be searched by and end-user.  The next stage of the FifoBot is to extract as many website links as it can find (for this prototype, we are only dealing with external links to root domains, but in practice, we would want to store every website link we came across) these links are then passed to the webDB database via the update procedure, ready to be searched when FifoBot next calls websites.

 

The content repository is a real-time database, with a management program working alongside.  In order to keep search times at a minimum, this database is cached in short-term computer memory, RAM.  As hard-disks are the only mechanical device on a mainframe, it takes a minimum of 0.01 seconds to move the read/write head to the appropriate location on the hard-disk (should it not already be used by another process).  Therefore, the advantages of storing the database dynamically are primarily the speed in which it can be read.

 

From an end-user’s point of view, they access the PathSearch system through a web server, which will greet them with a simple to use interface, from which they are able to search for a word or terms.

 

The searcher program provides such an interface.  When the user submits the query to the server, it will reference all occurrences of the word or term in the indices.  (If the user submits a term, consisting of two or more words, than these words are split into a data structure and analysed each – more on this later in the document.)  Using the word(s), the indices (also referred to as the Vector Term Database, VTD) will return two key pieces of information; firstly, the location of where the website is in the content repository, and second, a string of values relating to the popularity of the website.

 

The searcher program then performs the ranking formula (later described) which reorders all of the keys into a meaningful order and stores in a new multi-dimensional data structure.  This data structure is then analysed, and returns the keys for the first 10 results (10 being how many we display on any one page, in the first instance, the searcher will only return the first 10 values, but the user is then free – assuming we have over 10 results – to click “page 2” or so on and display 10 results from a mid point value in the data structure).  With these 10 ordered keys, the searcher then queries the content repository to provide previews and snippets of websites, to give the results some visual meaning.

 

This process is known as the ‘Query Lifecycle’ and can be visually depicted using the following diagram, which outlines the paths a query takes.

 

 

Figure 2, Query life cycle through the system

 

3.2  Data Duplication

 

Earlier in this paper, we mentioned that the content repository ran side-by-side with a program.  This is the main communication gateway between the end-user side of the system and the back-end side.  This program not only submits data to the repository and its collateral indices, but it manages the occurrence of duplicated data which could be sent back from the FifoBot.

 

We stated that PathSearch would exist based on providing results which were indicative of the link structure of the internet.  Therefore specific web results will have a higher ranking if more websites link to it.

 

Should FifoBot send the repository a website which already exists within the indices, then PathSearch simply increments the websites popularity rating variables.   These variables are later used in calculating the result’s rank.

 

This is done automatically, and the FifoBot does not need to be informed that it has stumbled on a previously indexed website.  In fact, it is the essence that FifoBot doubles back on itself which creates the spirt of PathSearch.

 

4.0 Running & Intention

 

4.1 Populating an Accurate Database Using FifoBot

 

In order to keep all results precise and specific to a particular user, it is important to first populate the core data-stores with good quality website information.  How the PathSearch search engine went about this was through the use of a computer program, which would in essence crawl the structure of the internet indexing any readable information as it came across it.

 

More information relating to this specific program is detailed later on; however, in order to populate our database, we pointed our indexing program to a website which contained a link to every UK and US university.

 

As predicted, the bot wondered through the links that it found, indexing websites that it first came across.

 

5.0 Intricate Operation

 

5.1 Dealing and Parsing Queries

 

The user has two main ways of searching for information; firstly, they can submit a single word (also known as keyword) to the database, or they can submit a term, which consists of multiple keywords.

 

It is – as always – important to recognise how PathSearch parses and uses these strings to generate results which are accurate and specific to the users intentions.  PathSearch also has to take into account users who are unfamiliar with how to search.

 

Dealing with single keyword queries; PathSearch will pull a listing of all of the entries from the indices where that keyword was found and rank pages accordingly to the ranking algorithm.  This is a fairly straightforward method which easy to implement and maintain.  The first version of the PathSearch prototype, codenamed Leon could only handle single keyword queries.  It would – to some extent – work with multiple words, should the two or more words appear side by side in the indices.

 

For example, a user could search for “Prime Minister Blair”, and this would return a positive result, should the Downing Street website list that term – word for word – in their keywords meta tag in the same order as the query string.

 

However, the limitations would be obvious from the start.  “Prime Minister Tony Blair” would not appear, because of how the meta-tag was written, IE: not including the word “Tony”.

 

PathSearch now takes these values and parses them into an array, and analyses each value of the data structure to the indices.

 

First of all, the user submits “Prime Minister Tony Blair” to the searcher.  The first task is to identify whether the string submitted is a single word query or a multiple word.  This is identifiable by searching for the occurrence of a space mark in the string.

 

From here, it parses each word into a value in an array.

 

Array[0] = “Prime”

Array[1] = “Minister”

Array[2] = “Tony”

Array[3] = “Blair”

 

The searcher will then proceed to check whether the query string matches the first value of the array.  If it does, than the string was a single word query, and therefore will simply pull out all results in the indices where the indexed value equates to array[0].  Should the values not correspond, then the searcher becomes aware at this point that a data structure with N values exist in memory.  From here, it is important to upshift the value of the original search string to the front of the data structure and then move all subsequent values back one cell.

 

This produces the following array of words to search.

 

Array[0] = “Prime Minister Tony Blair”

Array[1] = “Prime”

Array[2] = “Minister”

Array[3] = “Tony

Array[4] = “Blair”

 

Despite the 0th value of the array equating to a multiple worded string as its value, PathSearch will still handle this value as a single word – in the same spirit as the Leon engine did.

 

At this stage, PathSearch will then find all references to websites in the indices where their stored keyword value equals array[0] OR array[1] AND array[2] .. array[N]

 

Results will only be returned after the OR clause where all of their values are found – thus the use of AND.  This stops random websites by an unrelated Tony being displayed.

 

Any values which are found to relate with an exact match of the 0th value of the array will be ranked higher when the results are displayed.

 

To summarise, if PathSearch is going to conduct a search using any of the values stored in array[1] to array[N] then all of these values in the respective array cells must all be present in the indices – no matter what order they are written.

 

5.2 The PathSearch Ranking Algorithm

 

Information given in this section maybe altered at anytime in an attempt to improve the order of results, all information accurate to 8 July 2004.

 

Despite information on the internet being diverse and more than one website could have equal relevance to the user, it is just as important to guide them into clicking the most appropriate links.

 

And while this total process is complete automated (therefore, system administrators being unable to “fix” results) it is important for the searcher to understand what the user actually wants.

 

We do this, by creating a popularity ranking for each fetched result from the indices.  This popularity ranking only exists for the users query lifetime, and is not a permanent fixture.  In-fact, due to the character of PathSearch, a results listing page could well be different from the same query a day later.

 

There are four main variables which are used in order to gain a popularity rank on the website.  Two of which are dependant on automotive processes, while the other two are flexible to end-user interaction.  It’s this mix of automotive and user ranking which we are able to use, so that we can find an average result between statistics and the immeasurable challenges humans can make.

 

The four pieces of useable information which the searcher analyses are as follows.

 

Automotive information;

n         How many websites link to that specific result

n         What type of website it is (IE; root domain or sub directory within a domain)

 

User dependant information;

n         How many hits that specific result has had from PathSearch

n         Whether the query string matches identically to the result keywords

 

The algorithm used for calculating popularity is as follows;

 

 

R[1] = DT ( M  ( L + H ) )    R[N] = DT ( M  ( L + H ) )

 

The page Result R is calculated by the sum of the number of websites linking to that particular page L and the number of outgoing hits that site has received from other searches H.  Multiplied by they keyword match M (1 for no exact match, 2 for exact match) multiplied by the domain type of the website DT (1 for subdirectory or nested file, 2 for root location).

 

This calculation is performed on every fetched result from the indices, up to the R[N] value.

 

While this algorithm is still effective, during the growth of PathSearch’s content repository, I have became aware that a lot of results can partially be manipulated by user interaction (via clicking the links).  Future work on a new algorithm should look at increasing the value of H over L, perhaps making H an exponent over L.

 

5.3 Viewing Results

 

This part of the system is moderately straight forward.  The ranking algorithm will pass an ordered data structure to the display procedures.  This procedure then moves a pointer to a value in the data structure.  When you first view the page results (first page) the pointer will be set at the 0th value.

 

The next 10 result keys from this structure are read and appropriate snippets and viewable previews are pulled from out content repository and displayed in the form of a results page.

 


6.0 Data Collection & Process

 

6.1 Complete Overview - FifoBot

 

The main back-end process of the search engine is the necessity to crawl the internet for information automatically.  This is a looped process which can run for an infinite amount of time, reading and checking information from the web.

 

We highlighted how the FifoBot reads and stores information in the webDB and content repository earlier in this paper.

 

However, as it stands, the FifoBot is a complex and intricate program(s) which reads information from a given website.  It has been adapted to account for changes in input.  As the bot cannot view websites as humans can, it therefore can only read the source code behind.

 

Early in the project, we became aware of different coding styles that varied from website-to-website.  Not all developers of websites abided by the WC3 standards, which meant pages could not be read.  FifoBot was adapted to take into account these differences in styles.

 

We also abided by the exclusion protocols, which can be inserted into meta-tags to either prevent the bot from indexing the website, or crawling its links – or both.

 

While the bot itself was carefully planned out, there were a number of problems which was not immediately identifiable.  These included the occurrence of the ampersand (&), which websites were using in the required information.  When our bot was sending a response to our server, the character was splitting the query, therefore information was being illegally parsed before it was stored.

 

In the short term, the searcher program had been modified to account for these problems and store appropriate information.  However, this “quick fix” would only be useful to a point.

 

Such problems only started to occur during the live-run of the bot itself and were subsequently fixed.  This highlights the need for continuous error checking and ensuring that all results are accurate.

 

More information, and the status of our bot can be found at its homepage: http://pathsearch.omegasoft.co.uk/fifobot/

 

 

7.0 Large-Scale Operation

 

7.1 Supporting a Large-Scale Search Engine

 

The PathSearch engine has been designed for at least two classes of machines; one for the back-end indexing and web crawling routines, and the other which will perform the searching and results pages.

 

For a back-end machine, PathSearch would peak with 1 gigabyte of RAM, a RAID controller and a 20 gigabyte hard-disk.  Note: should the FifoBot be configured to search subdirectories of domains and individual pages on a website, then the disk size may need to increase to accommodate the extra information.  The file-system mirrored (RAID level 1) would provide at least 1 terabyte of reliable storage.  A machine of this specification could easily be assembled for around £2,500 - £3,000.  Other expenditure would include an uplink to the internet.  This would be dictated by speed alone, but initial recommendations would suggest that the minimum bandwidth should be 3 megabyte.

 

On a large-scale industry, a server-mainframe would be better suited with the webDB housed on a centralised server (in a star-network configuration) with N number of client machines connected, each running a version of the FifoBot, allowing for multiple websites to be indexed simultaneously.  Increasing the processing power would mean the internet uplink bandwidth would have to amplify ten-fold.  This could be set-up on an industrial scale for about £25,000.

 

Meanwhile, the front-end machine(s) would operate at optimum efficiency for around 500,000 websites on a typical multiprocessor configuration, with 10 gigabytes of RAM and multiple hard drives.  Such computers can be assembled for £3000.  With the repository and indices stored in RAM, the main bottleneck would be the central-processor-units.  Again, this is overcome solely on upgrading the clocking times.

 

In an industrial mainframe, PathSearch would operate more effectively should the front-end computers be connected to a centralised repository on an immediate computer setup.

 

Again, an internet connection from the front-end to the internet should be hooked directly to the internet backbone.

 

It’s important to note, that while traffic to the search engine may increase over time, only the front-end setup would be affected and not the web crawling bot connection.

 

While developers can attempt to speed up the speed in which indices are read and maintained, the primary fix to a growing repository and increase on CPU usage will always be governed by the hardware speeds.

 

7.2 Future Work on PathSearch

 

PathSearch is just the start of a chain of planned and premeditated programme of development and improvement.  Such system wide improvements would take into account the following;

 

Improving query time.  While there are some restrictions on hardware, developers can always try improving and refining as much as possible the source code they write.  While PathSearch’s ranking algorithm and data structures have been written with speed in mind, revisiting the project a second time may yield better coding practices which can help speed up the query time.

 

One early recommendation would be to redevelop the whole front-end using C or C++, to further more speed up the execution process, as this development environment specifically deals with large data structures.

 

Compressing the data stored in the indices and content repository.  While there are some leading tools readily available on the market to compress and “zip” files into a smaller size, developers could aid this process by cutting the amount of data being send to the server.  HTML pages stored in the content repository could be compressed with irrelevant information, such as unseen comments removed from source code.

 

FifoBot, would benefit from being split into two separate programs, one primarily for data gathering from the internet, and the other for analysing results which it has found.  This would cut the fetching cycle time and improve all-round efficiently of the system.  In such cases, these to separate processes would be done on two different machines to ensure speed and accuracy.

 

8.0 Biography of the Author

Sebastian (Seb) Harvey currently studies Computer Science at Aston University, UK.  He has a keen interest in human-computer interaction, data-mining, data structures and development of server-side applications.

 

He first developed the Leon model (a scaled first attempt at a search engine) in late 2003, and later revisited his early findings to build PathSearch as part of a personal data-mining project.

 

More information can be found online at OmegaSoft, www.omegasoft.co.uk.

 

While this paper outlines most of the theory behind standard searching and applying the third-generation principles behind our ranking algorithm, Mr Harvey is already looking at rescaling the current PathSearch prototype with the end goal of roughly 100,000 indexed websites, connected to a series of refined web crawlers.

 

9.0 Conclusion

 

PathSearch demonstrates the capabilities of third-generation searching, and now to better manage a growing repository of data.

 

While the prototype demonstrated here gives an insight into how the industry deals with the same process, it is still far away from delivering a package which would rival current market leaders.

 

It is however hoped that this prototype engine can be used to exhibit what the future of searching can be like, in an environment where both end-user interaction and computer generated algorithms can work together to deliver tailored results.

 

We invite you to experiment with our search engine online at, http://pathsearch.omegasoft.co.uk/ (July 2004).

 

10.0 References

 

Google, July 2004, http://www.google.com/

Yahoo, July 2004, http://www.yahoo.com/

Overture’s Altavista, July 2004, http://www.altavista.com/

The PathSearch Search Engine, July 2004, http://pathsearch.omegasoft.co.uk/

FifoBot’s website, July 2004, http://pathsearch.omegasoft.co.uk/fifobot/

 

11.0 Glossary

 

FifoBot – Web crawler developed for searching the internet for information.

Third-Generation (G3) – Next level of search engines, using profiles to create bespoke results.

Leon – Preliminary search engine before PathSearch.

PathSearch – Large-scale G3 prototype.

Content Repository – Store of searchable websites on the PathSearch localhost.

Data Structures – Short term (ordered) data stores for information in computer memory.

 

 

 

www.omegasoft.co.uk

July 2004