Apache Spark 2.0.0 发布，APIs 更新

snake_l 发表于 2016-10-26 09:07:57

欢迎加入运维网交流群：263444886
　　Apache Spark 2.0.0 发布了，Apache Spark 是一种与 Hadoop 相似的开源集群计算环境，但是两者之间还存在一些不同之处，这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越，换句话说，Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还可以优化迭代工作负载。
　　该版本主要更新APIs，支持SQL 2003，支持R UDF ，增强其性能。300个开发者贡献了2500补丁程序。
　　Apache Spark 2.0.0 APIs更新记录如下：

[*]　　Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
[*]　　SparkSession: new entry point that replaces the old SQLContext andHiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
[*]　　A new, streamlined configuration API for SparkSession
[*]　　Simpler, more performant accumulator API
[*]　　A new, improved Aggregator API for typed aggregation in Datasets
　　Apache Spark 2.0.0 SQL更新记录如下：

[*]　　A native SQL parser that supports both ANSI-SQL as well as Hive QL
[*]　　Native DDL command implementations
[*]　　Subquery support, including

[*]　　Uncorrelated Scalar Subqueries
[*]　　Correlated Scalar Subqueries
[*]　　NOT IN predicate Subqueries (in WHERE/HAVING clauses)
[*]　　IN predicate subqueries (in WHERE/HAVING clauses)
[*]　　(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)

[*]　　View canonicalization support
　　一些新特性：

[*]　　Native CSV data source, based on Databricks’ spark-csv module
[*]　　Off-heap memory management for both caching and runtime execution
[*]　　Hive style bucketing support
[*]　　Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.
　　性能增强：

[*]　　Substantial (2 - 10X) performance speedups for common operators inSQL and DataFrames via a new technique called whole stage code generation.
[*]　　Improved Parquet scan throughput through vectorization
[*]　　Improved ORC performance
[*]　　Many improvements in the Catalyst query optimizer for common workloads
[*]　　Improved window function performance via native implementations for all window functions
[*]　　Automatic file coalescing for native data sources
　　更多发布信息，可查看发布说明。
　　下载地址：http://spark.apache.org/downloads.html
　　

页: [1]

运维网's Archiver

Apache Spark 2.0.0 发布，APIs 更新