WORKSHOP (INTERNATIONAL) Yosegi: Columnar format for efficient nested data processing by schema conversion
Yasunori Oto, Kouji Ijima, Kotaro Terada, Makoto Onizuka (Osaka University)
The 1st Workshop on Distributed Infrastructure, Systems, Programming and AI (DISPA 2020)
August 31, 2020
In the big data era, data-intensive computing frameworks are in great demand. Many web companies save nested data from their services in columnar formats and querying on the processing layer to analyze them. Conventional formats have the following problems: (1) It is difficult to cope with the changes in input data structure since the schema needs to be defined before data loading, (2) The processing layer cannot utilize efficient methods for columnar formats because it treats array-type nested data row-wisely. We propose a new columnar format, Yosegi, which solves these two problems. To solve problem (1), Yosegi provides two functions: the first function is to save input data without defining schema; the second function is a schema conversion, which enables us to transform data in any schema by specifying an output schema and column name mapping for data transformation. To solve problem (2), Yosegi provides flat schema to the processing layer by converting from nest structures with index operations, so Yosegi is able to skip reading unnecessary data by pushing down predicates. We evaluate the efficiency of Yosegi using nested version TPC-H and a real data set. The result validates the efficiency of Yosegi.
Paper : Yosegi: Columnar format for efficient nested data processing by schema conversion (external link)
Software : https://github.com/yahoojapan/yosegi (external link)