Here is the summary (written by me) of this paper.
Summary of Paper
Storage and Querying of E-Commerce Data
Muhammad Yahya
Computer and Information System Engineering Department
NED University of Engineering and Technology, Karachi.
muhammad.yahya@gmail.com
1 Abstract
New generation of e-commerce applications require data schemas that are constantly evolving and sparsely populated. The conventional horizontal row representation fails to meet these requirements. The Horizontal Data can be transformed in to Vertical representation, in which each row has Object Identifier and attribute name-value pair.
2 Introduction
In Large e-commerce systems, attributes in some table keeps on increasing with new items. And so the performance was becoming bottle-neck as most of the columns had no values. This problem was also giving rise to change the schema again and again to accommodate new commodities.
2.1 Issues
The following problems were faced in horizontal representation
· Large number of columns: As many columns required as the attributes of objects, and this count keeps on increasing with new objects
· Sparsity: Nulls in most of the fields
· Schema Evolution: Frequent alteration would be required in the design
· Performance: Only few columns are required from wide records
2.2 Vertical Representation
As a solution of above problem, Vertical representation was proposed to be used in large e-commerce systems. And the table will have the following columns in it.
Oid: Object Identifier
Key: Attribute Name
Val: Attribute Name
The vertical table contains tuples for only those attributes that are present in an object. Writing SQL queries against this vertical representation was difficult and error-prone. Also most of the tools written are for Horizontal representation.
The basic idea behind this research was to develop a logical layer over Vertical data, i.e. horizontal data would be visible to the user, and there would be no need to re-design previously built systems. Few alternatives to this design are given in paper.
Horizontal Oid | A1 | A2 | A3 | 1 | a | b | - | 2 | - | c | d | 3 | - | - | a | 4 | b | - | d | - corresponds to null value | Vertical Oid | Key | Val | 1 | A1 | a | 1 | A2 | b | 2 | A2 | c | 2 | A3 | d | 3 | A3 | a | 4 | A1 | b | 4 | A3 | d | |
The approach is to define a horizontal view; over a vertical table. The user poses regular SQL queries over these views, which are translated into queries that run against the underlying vertical table. Transformations are given in the form of Relational Algebra. Some standard algebraic operations are needed to be re-written. All the transformations and Re-writings are done without assumptions, so they all are valid and proved.
2.3 Alternative Representations
Following were the alternatives present
· Split a horizontal table into as many 2-ary tables as the number of columns i.e. as many tables as the number of columns
· Create one table for each new category
· Create one table for common attributes and per category separate tables for non-common attributes.
· Represent you data in 3-ary table i.e. the Vertical Representation
3 Transformations
The view to user is Horizontal, and all the queries made by user are against that view, behind this logical view is the vertical representation. This vertical view is transparent to the user.
A Transformation is made to these queries that allow fetching of data from vertical table. In this manner the user and the old running systems are saved from re-writing the queries.
4 Implementation
A non-intrusive enablement layer is built on top of the database engine.
The enablement layer parses the SQL query, validates it, and transforms it to another SQL query that runs against the underlying vertical table.
Three types of implementations are proposed that work with Vertical representation.
· Vertical SQL
· Vertical UDF
· Schema SQL
5 Performance Experiments
Very Large numbers of experiments were performed on different alternatives, and results for projection, selection, join and aggregation operations are given in [1], and the results are shown with the help of graphs.
The performance results showed that VerticalSQL uniformly outperforms HorizontalSQL but slightly under performs Binary. The vertical representation uniformly outperforms horizontal representation for sparse data
The performance of the vertical representation is sensitive to the choice made for clustering the data. Clustering on Key has much higher performance than clustering on Oid.
The Vertical representation can outperform binary representations for other operations as well, if table function could provide some extra functionality.
6 Conclusion
Emerging applications such as e-commerce and portals are creating new threats and opportunities for database technology.
The new generation of applications requires data schemas that are rapidly evolving and sparsely populated.
The flexible design of vertical representation makes it easily manageable as compared to sparsely populated horizontal table. Design of an enablement layer hides the complexity of the queries over the vertical table and gives a horizontal view of the vertical representation to the user (application).
The performance of Vertical System also depends on the type of clustering of data.
Comparison of the three representations
| Horizontal | Vertical | Binary |
Manageability | + | + | - |
Flexibility | - | + | - |
Performance | - | + | + |
The following enhancements can improve the performance of Vertical representation.
· Partial indices
· Enhanced table functions
· First class treatment of table functions
· Native support for v2h and h2v operations
7 References
[1] Storage and Querying of E-Commerce Data
Rakesh Agrawal Amit Somani Yirong Xu
IBM Almaden Research Center
650 Harry Road, San Jose, CA 95120
http://www.almaden.ibm.com/software/quest/Publications/papers/vldb01_ecom.pdf