UK market researcher ButlerBloor Ltd claims its 535 page tome, Parallel Database Technology: An Evaluation And Comparison Of Scalable Systems, offers the first in-depth look under the parallel database processing hoods. And it depicts benchmark comparisons as simple marketing tools. The company took 11 parallel databases and 14 hardware offerings, and examined each against eight […]
UK market researcher ButlerBloor Ltd claims its 535 page tome, Parallel Database Technology: An Evaluation And Comparison Of Scalable Systems, offers the first in-depth look under the parallel database processing hoods. And it depicts benchmark comparisons as simple marketing tools. The company took 11 parallel databases and 14 hardware offerings, and examined each against eight key criteria: market presence and continuity; complex query; simple query; update; hybrid workload; performance accessibility; and portability. Database management systems evaluated were Adabas D, CA-OpenIngres, DB2/6000 Parallel Edition, Informix DSA, Sybase Navigation Server, Tandem NonStopSQL/MP, Oracle7, Oracle Rdb, Red Brick Warehouse, Teradata and WX9000 RDS. Perhaps its most important finding, however, is the yawning gap between the technical reality of the parallel database market and the hype that marketing departments generate. This gap provides the opportunity for database and hardware marketing people to confuse the market with technical mumbo-jumbo that they don’t fully understand themselves, warns the report’s authors, Dr Mike Norman and Dr Peter Thanisch of the Edinburgh Parallel Computing Centre.
An explicit example of this – what the authors call dis-information – is the distinction often drawn between shared-disk architectures, where each request from the server(s) can obtain any data item, regardless of where it is stored, and shared-nothing architectures, where the data in the database is partitioned between disks, giving servers or agents exclusive control of access to the data in their own disk partitions. Typically, massively-parallel vendors might suggest shared disk system configurations are flawed, while in reality there seems to be little clear difference in the scalability of the two approaches, the report says. Other features lead to far greater differences in both scalability and performance, including throughput, response time and what the report calls the chemistry between the hardware, the database management system, the volume and structure of the data in the database, and the workload. Another myth exposed by the report is the characterisation that performance problems are easily solved by throwing more processors at the system. This approach ignores database management system design, the way applications have been developed, and the processor interconnects, which together are far more important, it says. In the same breath the authors turn on the widely held notion that symmetric multiprocessing systems are suitable only for medium-sized applications, and massively parallel processing for the high end. Adding chips does not mean applications are able to exploit them, the report suggests, finding that some massively parallel processing systems it looked at such as the nCube 2 and nCube 3 were no match for the performance of many symmetric multiprocessing systems. The report describes symmetric multiprocessing as a shared processor memory system; massively parallel processing having private memory.
By Ray Hegerty
The report votes Informix Corp’s Dynamic Scalable Architecture, DSA, – which features the as-yet unavailable Online Extended Parallel Server 8.0, OnLine XPS – the best all-round parallel database technology, saying it has significant benefits over its competitors on all applications, and in particular, data warehousing. One criticism is Informix’s weak concurrency model. Informix DSA is a shared-nothing system, which means problems can also occur when the same transaction involves multiple processes, when all processes must know that all locks have been acquired before updates can be made. With a uniprocessor, or symmetric multiprocessing system, where a coherent access to memory exists, it is possible to run a single-phase commit protocol because all processes can read data in shared memory and know exactly what is happening at any time. In massively parallel systems, with distributed memory, it is necessary to run some other commitment. Most database systems employ a variant of the two-phase commit protocol. Informix has not implemented this two-phase commit and the authors were not given details of the mechanism, but the company told us it is using what it calls a one-and-a-half phase commit technique, which it is in the process of patenting; it declined to give further details. In the massively parallel camp the report champions Tandem Computers Inc’s NonStop SQL/MP, giving Informix, DB2 and White Cross special mention. It highlights Tandem’s long track record in the massively parallel market but warns against complacency over the competitive advantage it has traditionally enjoyed, as IBM Corp and others are hard on its heels. The authors consider both Informix and Oracle7 as competent across the board in the massively parallel arena; Oracle Rdb might be with them were it not for its weakness in parallelising complex queries. Sybase Navigation Server, now Sybase MPP, (CI No 2,770) disappointed the authors, although they concede that the System 11 release addresses many of the scalability issues. In particular, the report criticises Sybase’s complex query rating because it does not separate planned queries from ad hoc queries (although it accepts it is not targeted at this function). Other weaknesses include high overhead of inter-process communication, leading to performance problems in the workload, and a weak pipeline. One important difference between Informix DSA and Sybase MPP, the report notes, is the bottleneck generated around the co-ordinating process which partitions queries. With Informix, partitioning can be dealt with by any Virtual Processor. This can be done because the meta data that specifies how data is partitioned is held in a table that is visible – via a cacheing mechanism – to all virtual processors on all nodes. Sybase enables only one layer of each query to be active at any one time, with the results of intermediate joins stored as temporary tables stored in subsequent joins.
This means the result of intermediate joins are stored as temporary tables which are used in subsequent joins, enabling these tables, as well as the tables to which they are joined, to be redistributed between joins. This approach may be valid where the database is tightly coupled to the hardware, and where disparity between the relative speeds of disk and memory are not high, it says, but most database vendors now take the view that the most efficient option is to use pipeline parallelism and flow control inside the pipeline to avoid having large intermediate results present inside the system, thus avoiding the overheads of using disk. Informix enables join partitions to be made independently at each layer of the pipeline and independently of the distribution of the tables to the disk. This enables processing to be load-balanced even if data is skewed on the disk. John Spiers, Sybase European marketing director, said the report is fundamentally flawed, based as it is on a subjective, academic analysis of theoretical technology and not on users’ experiences or the authors’ own hands-on experience.