Overview

Oushu Database (OushuDB for short) is a new generation of cloud-native data warehouse created by the founding team of Apache HAWQ. This product adopts the technical architecture of separation of storage and computing. Besides all the advantages of MPP, It is also elastic and highly scalable, supporting mixed workloads. Both public cloud and on-premise are supported. OushuDB is highly scalable, follows the ANSI-SQL standard, and has a blazing fast execution engine, which supports interactive queries upon PB level data. OushuDB also supports descriptive analysis and advanced machine learnings so as to get its smooth integration with widely used BI tools. Compatible with Oracle, GPDB and PostgreSQL, Oushu Database is a good candidate for replacing traditional data warehouses including Teradata, Oracle, DB2, Greenplum and SQL-on-Hadoop engines. Targeting cloud environment, Oushu Database natively supports Kubernetes platforms to enable enterprises to migrate to the latest cloud computing platform seamlessly. As of now, OushuDB has been widely deployed and applied in a large number of industries including finance, telecommunication, manufacturing, healthcare and the Internet, etc.

Improvements to Apache HAWQ

  • Brand new executor, fully utilizing all features of underlying hardware: 5 - 10 times higher performance than Apache HAWQ

  • Supports Update, Delete, and Index

  • C++ pluggable external storage

    • A Replacement of JAVA PXF. It is several times faster and there is no need to install and deploy additional PXF components. This feature greatly simplifies installation, deployment, operation and maintenance.
    • Natively supports CSV/TEXT formats
    • Can be used to share and transfer data among clusters, such as a data warehouse and a data mart
    • Can be used for high-speed data importing and exporting
    • Provides high-speed backup and recovery
    • Supports pluggable file systems: such as S3, Ceph, etc.
    • Supports pluggable file formats: such as ORC, Parquet, etc.
  • Supports ORC/TEXT/CSV as internal table formats, and support ORC as an external format (via C++ pluggable storage interface)

  • Supports PaaS / CaaS cloud platform natively

    • The world’s first MPP++ analytic database that can run in native PaaS container cloud platforms
    • Supports Kubernetes cluster containter orchestration and deployment
  • For CSV and TEXT file formats, non-ASCII and multi-character delimiters are supported

  • Some critical bug fixes

Main features

  • Blazing fast new executor: 5 - 10 times faster than traditional data warehouses/MPP and 5 - 30 times faster than Hadoop SQL engines
  • On-premise or cloud deployment. Supports both Amazon and AliCloud, and also supports deployment on popular PaaS container cloud platforms (such as Kubernetes) and docker.
  • Robust SQL standard compliance: ANSI SQL, OLAP extension, standard connectivity: JDBC/ODBC. More comprehensive than Hadoop SQL engines.
  • Extremely mature parallel optimizer. The optimizer is a key component of a parallel SQL engine and has a large impact on performance, especially for complex queries.
  • ACID transaction support: many existing Hadoop-based SQL engines don’t have transaction support, which is critical to ensure data consistency. It can effectively reduce the burden of developments and operations.
  • Dynamic data flow engine through high speed UDP based interconnect network
  • Elastic scheduling execution: on-demand allocation of virtual segments, based on size of the queries
  • Supports multi-level partitioning and List/Range based partitioned tables. Partition tables can greatly improve performance. If users only want to access the hot data of the last month, the query only needs to scan the partition where the data of the last month is located.
  • Multiple compression method support: snappy, gzip, zlib, zstd, lz4, RLE, etc.
  • Multiple procedure language support for user defined procedures: python, c/c++, perl, etc.
  • Dynamic node expansion: on-demand, adding nodes in seconds according based on size of storage or computing needs
  • Multi-level resource and load management: integrates with YARN, manages CPU and memory, etc., hierarchical resource queues with convenient DDL managing interface
  • Native machine learning and data mining functionalities support through MADLib: high performance and easy to use
  • Seamless integration with Hadoop systems: storage (HDFS), resource management (YARN), deployment (Ambari), data format and access, etc.
  • Comprehensive authentication and permission control: Kerberos, database level and table level authorization
  • Supports most third party tools: Tableau, SAS, Apache Zeppelin, etc.