MIT scientists prepare paradigm shift in data processing22. August 2020
MIT scientists prepare paradigm shift in data processing
New York, 22. 8. 2020
By 2025, all the data in the world will add up to an estimated 175 trillion gigabytes. This amount of data, stored on DVDs, will result in a data stack so high that you can circle the earth 222 times.
The efficient storage and processing of this data is one of the greatest challenges. A team at the MIT Laboratory for Computer Science and Artificial Intelligence (CSAIL) believes that this will only be possible with so-called “instance-optimized systems”.
Conventional storage and database systems are designed for a wide range of applications. Creating them takes months, often even several years.
In contrast, the goal of instance-optimized systems is to build systems that optimize and, in part, reorganize themselves for the data they store and the workload they serve. “It’s like building a database system from scratch for each application, which is not economically feasible with traditional system designs,” says MIT Professor Tim Kraska, explaining the difference.
As a first step towards this vision, Kraska and his colleagues developed Tsunami and Bao. Tsunami ( https://arxiv.org/pdf/2006.13282.pdf ) uses machine learning to automatically reorganize the storage layout of a data set based on the types of queries its users make. Tests show that it can execute queries up to 10 times faster than state-of-the-art systems. In addition, its data sets can be organized through a set of “learned indexes” that are up to 100 times smaller than the indexes used in traditional systems.
Kraska has been working on the topic of “learned indices” for several years. Among them several years with colleagues at Google. For Harvard University professor Stratos Idreos, who was not involved in the tsunami project, sees the unique advantage of learned indices in their small size. In addition to saving space, they also enable significant performance improvements.
“I think this kind of work represents a paradigm shift that will have a long-term impact on system design,” says Idreos. “I believe that model-based approaches will be one of the key components at the heart of a new wave of adaptive systems.
Bao ( https://arxiv.org/abs/2004.03814 ), meanwhile, focuses on improving the efficiency of query optimization through machine learning. A query optimizer rewrites a high-level declarative query into a query plan that can actually be executed on the data to calculate the result of the query. Often, however, there is more than one query plan to answer a query. If an incorrect query plan is selected, the response calculation can take days instead of seconds.
Traditional query optimizers have significant drawbacks: they take years to build, are very difficult to maintain, and, most importantly, do not learn from their mistakes. Bao is the first learning-based approach to query optimization fully integrated into the popular PostgreSQL database management system. Lead author Ryan Marcus, a postdoc in Kraska’s group, says that Bao creates query plans that run up to 50 percent faster than those created by the PostgreSQL optimizer, which means it could help significantly reduce the cost of cloud services such as Amazon’s Redshift that are based on PostgreSQL.
By merging the two systems, Kraska hopes to build the first instance-optimized database system that can deliver the best possible performance for each application without manual tuning. The goal is not only to relieve developers of the daunting and tedious process of tuning database systems, but also to achieve performance and cost advantages that are not possible with conventional systems.