Scaling Internet Search Engines - Methods and Analysis
Doctoral thesis
Permanent lenke
http://hdl.handle.net/11250/249870Utgivelsesdato
2004Metadata
Vis full innførselSamlinger
Sammendrag
This thesis focuses on methods and analysis for building scalable Internet Search Engines. In this work, we have developed a search kernel, an architecture framework and applications that are being used in industrial and commercial products. Furthermore, we present both analysis and design of key elements.
Essential to building a large-scale search engine is to understand the dynamics of the content in which we are searching. For the challenging case of searching the web, there are multiple dimensions of dynamics that should ideally be handled. In this thesis we start by examining some of these dimensions and the implications they have on search engine design.
When designing a search engine kernel, the focus has been on selection of algorithms and datastructures in the general case. Also, and even more important, we design worst-case characteristics into the search kernel that are very decisive from a scaling standpoint. A performance model to analyze the behaviour of the kernel is also developed.
The designed search engine kernel was realized as a predecessor of the current FAST Search kernel (the FMS kernel), and practical experiments and benchmarking demonstrate the correctness of the assumptions from the design of the kernel.
Then a framework for scaling shared-nothing systems based upon nodes working on separate portions of the data is introduced. The design of the framework is based on the general principles of replication and distribution. A performance model and an algorithm for cluster design are provided. This is in turn applied to construct a larger-scale web search engine and benchmarking of clusters indicate that the assumptions and models for the distributed architecture hold.
The scaling aspect of search engine is further studied in the context of the application itself. Query locality is explored and used to create an architecture that is a generalized type of caching (through partial replication) using the application behaviour and a configurable correctness trade-off to design super-linear scalable search engines.
Finally, a discussion of how linguistics are being used in web search engines is provided, focusing on the constraints that apply to ensure the desired scalability.
Består av
Risvik, Knut Magne; Michelsen, Rolf. Search Engines and Web Dynamics. .Risvik, Knut Magne; Egge, Tor. The FMS Search Kernel and its Performance Characteristics. .
Risvik, Knut Magne; Svingen, Børge; Egge, Tor; Halaas, Arne. The FAST Distributed Processing Architecture (DPA) and its Application for a Larger-Scale Search Engine. .
Risvik, Knut Magne; Aasheim, Yngve; Lidal, Mathias. Multi-tier Architecture for web Search Engines. .
Gulla, Jon Atle; Auran, Per Gunnar; Risvik, Knut Magne. Linguistics in Large-Scale Web Search. .