Optimizing Redis for Locality and Capacity
Students: Kevin Chang, Lavanya Subramanian, Yoongu Kim
Source Code: https://github.com/kevincha/15799_project
Traditionally, databases have been housed in disks, while using main memory only as a cache. Correspondingly, there has been a significant amount work on techniques to exploit characteristics unique to disk (e.g., avoiding disk seeks, coalescing writes). More recently, with increasing DRAM densities, main memory has been used to house entire databases. For in-memory databases, it is important to suchoptimize how data in main memory is laid out and accessed. In this project, we employed a two-pronged approach to Redis: (i) data-mapping and (ii) data-compression. Our results show that there is significant potential for improving performance with the first approach. On the other hand, we saw little room for improvement beyond existing mechanisms to compress each key-value pair separately, at least for the text corpus from Wikipedia.
Is Graph Database truly the Best for Graph Data? A Comparison Among Different Databases
Students: Atreyee Maiti, Qing Zheng
Source Code: https://bitbucket.org/Atreyee/big_data_project
Perhaps the biggest common problem both application developers and database administrators are facing right now is big data. While big data certainly means big opportunities and big money, it also implies choosing a set of appropriate database management systems that are able to accommodate emerging application domains and are capable of handling huge and dynamic datasets in a scalable and cost-effective manner. Another important aspect within big data is data analysis. Being able to collect and serve data is perhaps just a beginning, how to intelligently leverage existing data to produce value-added services is the key to win market and show differences. As such, big data requires each organization to give solutions to two questions: 1) how to design algorithms to effectively process data to transform low-level bits into high-level insights? and 2) how to choose the right data model and the right data management system for data analysis?
Incremental Computation
Students: Thomas Marshall, Alex Degtiar, Ram Raghunathan
Source Code: http://github.com/twmarshall/incremental
In recent years, big data processing has become a hot topic in computer science as the result of technology reaching the point in terms of both power and price to make computation on extremely large data sets feasible. Many different distributed systems have been created to ease application development for processing these data sets, but most of these systems make the fundamental assumption that the data will not change. As a result, applications built using these frameworks cannot efficiently respond to changes to their input. To address this issue, we built ThomasDB, which supports efficient updates to computations in response to changing input through the concept of incremental computation.
Non-Volatile Memory Database Systems
Students: Joy Arulraj
Source Code: https://github.com/jarulraj/h-store
Database design is significantly influenced by the properties of the underlying storage system as these properties have significant impact on performance of the database system. In the last decade, main-memory database systems have become a viable alternative to traditional disk-oriented database systems due to the availability of affordable high capacity main-memory devices. The trend seems to be repeating again with the emergence of non-volatile memory (NVM) devices. To better understand the impact of the unique properties of these storage devices on the performance of a database system, we modified the storage layer of a main-memory database system and evaluated the system on an NVM device hardware emulator. Our experiments suggest that NVM devices can have substantial impact on the performance of main-memory database systems. We also observed that the performance of disk-oriented database systems does not vary substantially with NVM latency. We believe that it would be interesting to build a system that
can provide instantaneous recovery with no or minimal logging and plan to explore this in future work.
IterativenessâAware Optimization for Big Data Analytics
Students: Henggang Cui, Lianghong Xu
Source Code: Unavailable
Parallel big data analytical tasks often involve cross-machine communication, since computation instances running on different machines might concurrently read and update some shared global states. This project proposes a mechanism to reduce this cross-machine communication overhead by exploiting the iterative nature of big data analytical algorithms. We find many of these algorithms are iterative convergent, and they iteratively access the same set of shared states in the whole execution, and the access patterns are only dependent on the input dataset. We use a mechanism called virtual iteration to collect these access patterns, and we co-locate each shard of the shared states with the appropriate client based on the known access patterns. In order to support explicitly parameter data partitioning, we add a cluster of distributed metadata servers to an existing parameter server implementation. We implement and evaluate different data placement policies, and experiment results show that the partitioning
policy that uses the collected access information can significantly reduce the fraction of remote accesses while placing balanced loads on tablet servers.
Graph Mining on Big Data Systems
Students: Rui Zhang, Hefu Chai, Jian Fang
Source Code: https://github.com/richardzhangrui/GraphMining
Although data mining and machine learning has delivered the notion of graph mining for a long time, the increasing complex data with explosion size has imposed a new challenge on computation of data mining. Existing computing platforms and computational models like MapReduce spare no efforts on exploiting parallelism; however, graph mining algorithms are characterized as iteration-intensive computation, where data are supposed to be highly interactive. So, models like Pregel are designed to meet the requirements of such graph mining algorithms. Also, we find that many parallel database systems are powerful enough to solve these problems. In this paper, we compare several parallel database systems and computation platforms; furthermore, we evaluate the performance of these systems on typical graph mining algorithms with both distributed and local mode. Our results reveal that: Hadoop is fastest in loading data, taking advantage of simple data layout optimization in Hadoop Distributed File System (HDFS).
Vertica performs much better than other systems in computation by using hybrid store model and fully exploits memory use. SciDB takes the notion of array-based logic and physical store, which enables it to be efficient in matrix operations. PostgreSQL is stable in its performance. Apache Hama takes the notion of thinking like a vertex, thus it is expressive in Graph Mining algorithms than other platforms and easy to get start.
Quilting for Distributed Machine Learning System
Students: Mu Li, Jinliang Wei
Source Code: https://github.com/jinliangwei/quiltdb
Distributed machine learning applications produces a huge amount of network traffic. However, network bandwidth is one of the most scarce resources of datacenters. In this paper we proposal QuiltDB, a distributed database execution engine optimized for network topology. QuiltDB trades off latency for high bandwidth. It executes as a discrete streaming system but with the difference that users can specify desirable network topologies. We evaluate this system on several machine learning applications with real datasets.