Apache Spark 1.5.0 正式发布

想你了的他他 发表于 2015-9-22 12:01:54

欢迎加入运维网交流群:263444886
　　Spark 1.5.0 是 1.x 系列的第六个版本，收到 230+ 位贡献者和 80+ 机构的努力，总共 1400+ patches。值得关注的改进如下：
　　

[*]　　APIs：RDD, DataFrame 和 SQL
[*]　　后端执行：DataFrame 和 SQL
[*]　　集成：数据源，Hive, Hadoop, Mesos 和集群管理
[*]　　R 语言
[*]　　机器学习和高级分析
[*]　　Spark Streaming
[*]　　Deprecations, Removals, Configs 和 Behavior 改进

[*]　　Spark Core
[*]　　Spark SQL & DataFrames
[*]　　Spark Streaming
[*]　　MLlib

[*]　　已知问题解决
　　

[*]　　SQL/DataFrame
[*]　　Streaming

[*]　　Credits
　　下载：spark-1.5.0.tgz
　　详细改进请看发行说明和更新日志。
　　
　　新特性列表：
　　

[*]　　 - Provide memory-and-local-disk RDD checkpointing
[*]　　 - Support decimals with precision > 18 in Parquet
[*]　　 - Support dynamic allocation for standalone mode
[*]
　　 ->
[*]　　 - Feature Importance for Random Forests
[*]　　 - Python API for MQTT streaming
[*]　　 - Python support for Power Iteration Clustering
[*]　　 - Create MLlib metrics user guide with algorithm definitions and complete code examples.
[*]　　 - Add MatrixUDT in PySpark
[*]　　 - Add sequential pattern mining algorithm PrefixSpan to Spark MLlib
[*]　　 - SparkR style guide
[*]　　 - Convert NAs to null type in SparkR DataFrames
[*]　　 - Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.
[*]　　 - Support Cancellation in the Thrift Server
[*]　　 - Binary processing dimensional join
[*]　　 - Extend PIC to handle Graphs directly
[*]　　 - Report memory used in aggregations and joins
[*]　　 - add QR decomposition for RowMatrix
[*]　　 - CrossValidator example code in Python
[*]　　 - Add argmax to Vector, SparseVector
[*]　　 - Remove physical Distinct operator in favor of Aggregate
[*]　　 - Example code for ElasticNet
[*]　　 - Python API for PCA and PCAModel
[*]　　 - Python API for ElementwiseProduct
[*]　　 - Add Python API for Statistics.kernelDensity
[*]
　　 - MulticlassClassificationEvaluator for tuning Multiclass>
[*]　　 - KMeans API for spark.ml Pipelines
[*]　　 - Be able to disable intercept in Linear Regression in ML package
[*]　　 - Mechanism to control receiver scheduling
[*]　　 - Create worker R processes with a command other then Rscript
[*]　　 - Created more examples on SparkR DataFrames
[*]　　 - Securely pass auth secrets to executors in standalone cluster mode
[*]　　 - Add StopWordsRemover as a transformer
[*]　　 - Support heterogeneous cluster nodes on YARN
[*]　　 - Support Spark Packages containing R code with --packages
[*]　　 - Add internal metrics / logging for DAGScheduler to detect long pauses / blocking
[*]　　 - Add in operator to DataFrame Column
[*]　　 - Add crosstab to SparkR DataFrames
[*]　　 - Add in operator to DataFrame Column in SparkR
[*]　　 - Add helper functions for testing physical SparkPlan operators
[*]　　 - Python API for N-Gram Feature Transformer
[*]　　 - Add numNonzeros and numActives to linalg.Matrices
[*]　　 - Add TrainValidationSplit to ml.tuning
[*]　　 - Disable feature scaling in Linear and Logistic Regression
[*]
　　 - LinearRegressionResults>
[*]
　　 - LinearRegressionSummary>
[*]　　 - Python example code for elastic net
[*]　　 - Add the Python API for Kinesis
[*]　　 - Support arbitrary object in UnsafeRow
[*]　　 - Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs
[*]　　 - Naive Bayes API for spark.ml Pipelines
[*]　　 - Add isotonic regression to the pipeline API
[*]　　 - Add missing methods in StandardScaler (ML and PySpark)
[*]　　 - Implement Pylint / Prospector checks for PySpark
[*]　　 - Add additional methods to JavaModel wrappers in trees
[*]　　 - Add R model formula with basic support as a transformer
[*]　　 - Add random data generation test utilities to Spark SQL
[*]　　 - GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[*]　　 - Allow additional uris to be fetched with mesos
[*]　　 - Add between operator in SparkR
[*]　　 - String concatination with column in SparkR
[*]　　 - Show the UDF usage for user.
[*]　　 - Add missing methods in Word2Vec ML
[*]　　 - A New Receiver Scheduling Mechanism
[*]　　 - Hyperparameter estimation in LDA
[*]　　 - Implement @since as an annotation
[*]　　 - Add Python API for Kolmogorov-Smirnov Test
[*]　　 - UnsafeProject
[*]　　 - UnsafeExchange
[*]　　 - Unsafe HashJoin
[*]　　 - Add CountVectorizer as an estimator to generate CountVectorizerModel
[*]　　 - Implement LogisticRegressionSummary similar to LinearRegressionSummary
[*]　　 - date/time function: dayInYear
[*]　　 - Add planner rule for automatically inserting UnsafeSafe row format converters
[*]　　 - UTF8String empty string method
[*]　　 - Integrate MLlib with SparkR using RFormula
[*]　　 - SparkR RFormula should support StringType features
[*]　　 - DistributedLDAModel method for top topics per document
[*]　　 - DistributedLDAModel predict top topic per doc-term instance
[*]　　 - DistributedLDAModel predict top docs per topic
[*]　　 - Add Spark Submit flag to exclude dependencies when using --packages
[*]　　 - Migrate JSON data source to the new partitioning data source
[*]　　 - Support minus, dot, and intercept operators in SparkR RFormula
[*]　　 - LocalLDAModel should save docConcentration, topicConcentration, and gammaShape
[*]　　 - Add property-based tests for UTF8String
[*]
　　 - Multilayer perceptron>
[*]　　 - RFormula in Python
[*]　　 - PrefixSpan getMaxPatternLength should return an Int
[*]　　 - Add `ifelse` Column function to SparkR
　　Apache Spark 是一种与 Hadoop 相似的开源集群计算环境，但是两者之间还存在一些不同之处，这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越，换句话说，Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还可以优化迭代工作负载。
　　Spark 是在 Scala 语言中实现的，它将 Scala 用作其应用程序框架。与 Hadoop 不同，Spark 和 Scala 能够紧密集成，其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。
　　尽管创建 Spark 是为了支持分布式数据集上的迭代作业，但是实际上它是对 Hadoop 的补充，可以在 Hadoo 文件系统中并行运行。通过名为Mesos 的第三方集群框架可以支持此行为。Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发，可用来构建大型的、低延迟的数据分析应用程序。

页: [1]

运维网's Archiver

Apache Spark 1.5.0 正式发布