Apache Spark 1.5.0 正式发布
欢迎加入运维网交流群:263444886Spark 1.5.0 是 1.x 系列的第六个版本,收到 230+ 位贡献者和 80+ 机构的努力,总共 1400+ patches。值得关注的改进如下:
[*] APIs:RDD, DataFrame 和 SQL
[*] 后端执行:DataFrame 和 SQL
[*] 集成:数据源,Hive, Hadoop, Mesos 和集群管理
[*] R 语言
[*] 机器学习和高级分析
[*] Spark Streaming
[*] Deprecations, Removals, Configs 和 Behavior 改进
[*] Spark Core
[*] Spark SQL & DataFrames
[*] Spark Streaming
[*] MLlib
[*] 已知问题解决
[*] SQL/DataFrame
[*] Streaming
[*] Credits
下载:spark-1.5.0.tgz
详细改进请看发行说明和更新日志。
新特性列表:
[*] - Provide memory-and-local-disk RDD checkpointing
[*] - Support decimals with precision > 18 in Parquet
[*] - Support dynamic allocation for standalone mode
[*]
->
[*] - Feature Importance for Random Forests
[*] - Python API for MQTT streaming
[*] - Python support for Power Iteration Clustering
[*] - Create MLlib metrics user guide with algorithm definitions and complete code examples.
[*] - Add MatrixUDT in PySpark
[*] - Add sequential pattern mining algorithm PrefixSpan to Spark MLlib
[*] - SparkR style guide
[*] - Convert NAs to null type in SparkR DataFrames
[*] - Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.
[*] - Support Cancellation in the Thrift Server
[*] - Binary processing dimensional join
[*] - Extend PIC to handle Graphs directly
[*] - Report memory used in aggregations and joins
[*] - add QR decomposition for RowMatrix
[*] - CrossValidator example code in Python
[*] - Add argmax to Vector, SparseVector
[*] - Remove physical Distinct operator in favor of Aggregate
[*] - Example code for ElasticNet
[*] - Python API for PCA and PCAModel
[*] - Python API for ElementwiseProduct
[*] - Add Python API for Statistics.kernelDensity
[*]
- MulticlassClassificationEvaluator for tuning Multiclass>
[*] - KMeans API for spark.ml Pipelines
[*] - Be able to disable intercept in Linear Regression in ML package
[*] - Mechanism to control receiver scheduling
[*] - Create worker R processes with a command other then Rscript
[*] - Created more examples on SparkR DataFrames
[*] - Securely pass auth secrets to executors in standalone cluster mode
[*] - Add StopWordsRemover as a transformer
[*] - Support heterogeneous cluster nodes on YARN
[*] - Support Spark Packages containing R code with --packages
[*] - Add internal metrics / logging for DAGScheduler to detect long pauses / blocking
[*] - Add in operator to DataFrame Column
[*] - Add crosstab to SparkR DataFrames
[*] - Add in operator to DataFrame Column in SparkR
[*] - Add helper functions for testing physical SparkPlan operators
[*] - Python API for N-Gram Feature Transformer
[*] - Add numNonzeros and numActives to linalg.Matrices
[*] - Add TrainValidationSplit to ml.tuning
[*] - Disable feature scaling in Linear and Logistic Regression
[*]
- LinearRegressionResults>
[*]
- LinearRegressionSummary>
[*] - Python example code for elastic net
[*] - Add the Python API for Kinesis
[*] - Support arbitrary object in UnsafeRow
[*] - Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs
[*] - Naive Bayes API for spark.ml Pipelines
[*] - Add isotonic regression to the pipeline API
[*] - Add missing methods in StandardScaler (ML and PySpark)
[*] - Implement Pylint / Prospector checks for PySpark
[*] - Add additional methods to JavaModel wrappers in trees
[*] - Add R model formula with basic support as a transformer
[*] - Add random data generation test utilities to Spark SQL
[*] - GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
[*] - Allow additional uris to be fetched with mesos
[*] - Add between operator in SparkR
[*] - String concatination with column in SparkR
[*] - Show the UDF usage for user.
[*] - Add missing methods in Word2Vec ML
[*] - A New Receiver Scheduling Mechanism
[*] - Hyperparameter estimation in LDA
[*] - Implement @since as an annotation
[*] - Add Python API for Kolmogorov-Smirnov Test
[*] - UnsafeProject
[*] - UnsafeExchange
[*] - Unsafe HashJoin
[*] - Add CountVectorizer as an estimator to generate CountVectorizerModel
[*] - Implement LogisticRegressionSummary similar to LinearRegressionSummary
[*] - date/time function: dayInYear
[*] - Add planner rule for automatically inserting UnsafeSafe row format converters
[*] - UTF8String empty string method
[*] - Integrate MLlib with SparkR using RFormula
[*] - SparkR RFormula should support StringType features
[*] - DistributedLDAModel method for top topics per document
[*] - DistributedLDAModel predict top topic per doc-term instance
[*] - DistributedLDAModel predict top docs per topic
[*] - Add Spark Submit flag to exclude dependencies when using --packages
[*] - Migrate JSON data source to the new partitioning data source
[*] - Support minus, dot, and intercept operators in SparkR RFormula
[*] - LocalLDAModel should save docConcentration, topicConcentration, and gammaShape
[*] - Add property-based tests for UTF8String
[*]
- Multilayer perceptron>
[*] - RFormula in Python
[*] - PrefixSpan getMaxPatternLength should return an Int
[*] - Add `ifelse` Column function to SparkR
Apache Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。
Spark 是在 Scala 语言中实现的,它将 Scala 用作其应用程序框架。与 Hadoop 不同,Spark 和 Scala 能够紧密集成,其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。
尽管创建 Spark 是为了支持分布式数据集上的迭代作业,但是实际上它是对 Hadoop 的补充,可以在 Hadoo 文件系统中并行运行。通过名为Mesos 的第三方集群框架可以支持此行为。Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发,可用来构建大型的、低延迟的数据分析应用程序。
页:
[1]