Apache Spark 1.5.0 正式发布

想你了的他他 · 发表于 2015-9-22 12:01:54

欢迎加入运维网交流群:263444886

　　Spark 1.5.0 是 1.x 系列的第六个版本，收到 230+ 位贡献者和 80+ 机构的努力，总共 1400+ patches。值得关注的改进如下：
　　

　　APIs：RDD, DataFrame 和 SQL
　　后端执行：DataFrame 和 SQL
　　集成：数据源，Hive, Hadoop, Mesos 和集群管理
　　R 语言
　　机器学习和高级分析
　　Spark Streaming
　　Deprecations, Removals, Configs 和 Behavior 改进
- 　　Spark Core
- 　　Spark SQL & DataFrames
- 　　Spark Streaming
- 　　MLlib
　　已知问题解决
　　
- 　　SQL/DataFrame
- 　　Streaming
　　Credits

　　下载：spark-1.5.0.tgz
　　详细改进请看发行说明和更新日志。
　　
　　新特性列表：
　　

　　[SPARK-1855] - Provide memory-and-local-disk RDD checkpointing
　　[SPARK-4176] - Support decimals with precision > 18 in Parquet
　　[SPARK-4751] - Support dynamic allocation for standalone mode
　　[SPARK-4752] ->
　　[SPARK-5133] - Feature Importance for Random Forests
　　[SPARK-5155] - Python API for MQTT streaming
　　[SPARK-5962] - [MLLIB] Python support for Power Iteration Clustering
　　[SPARK-6129] - Create MLlib metrics user guide with algorithm definitions and complete code examples.
　　[SPARK-6390] - Add MatrixUDT in PySpark
　　[SPARK-6487] - Add sequential pattern mining algorithm PrefixSpan to Spark MLlib
　　[SPARK-6813] - SparkR style guide
　　[SPARK-6820] - Convert NAs to null type in SparkR DataFrames
　　[SPARK-6833] - Extend `addPackage` so that any given R file can be sourced in the worker before functions are run.
　　[SPARK-6964] - Support Cancellation in the Thrift Server
　　[SPARK-7083] - Binary processing dimensional join
　　[SPARK-7254] - Extend PIC to handle Graphs directly
　　[SPARK-7293] - Report memory used in aggregations and joins
　　[SPARK-7368] - add QR decomposition for RowMatrix
　　[SPARK-7387] - CrossValidator example code in Python
　　[SPARK-7422] - Add argmax to Vector, SparseVector
　　[SPARK-7440] - Remove physical Distinct operator in favor of Aggregate
　　[SPARK-7547] - Example code for ElasticNet
　　[SPARK-7604] - Python API for PCA and PCAModel
　　[SPARK-7605] - Python API for ElementwiseProduct
　　[SPARK-7639] - Add Python API for Statistics.kernelDensity
　　[SPARK-7690] - MulticlassClassificationEvaluator for tuning Multiclass>
　　[SPARK-7879] - KMeans API for spark.ml Pipelines
　　[SPARK-7888] - Be able to disable intercept in Linear Regression in ML package
　　[SPARK-7988] - Mechanism to control receiver scheduling
　　[SPARK-8019] - [SparkR] Create worker R processes with a command other then Rscript
　　[SPARK-8124] - Created more examples on SparkR DataFrames
　　[SPARK-8129] - Securely pass auth secrets to executors in standalone cluster mode
　　[SPARK-8169] - Add StopWordsRemover as a transformer
　　[SPARK-8302] - Support heterogeneous cluster nodes on YARN
　　[SPARK-8313] - Support Spark Packages containing R code with --packages
　　[SPARK-8344] - Add internal metrics / logging for DAGScheduler to detect long pauses / blocking
　　[SPARK-8348] - Add in operator to DataFrame Column
　　[SPARK-8364] - Add crosstab to SparkR DataFrames
　　[SPARK-8431] - Add in operator to DataFrame Column in SparkR
　　[SPARK-8446] - Add helper functions for testing physical SparkPlan operators
　　[SPARK-8456] - Python API for N-Gram Feature Transformer
　　[SPARK-8479] - Add numNonzeros and numActives to linalg.Matrices
　　[SPARK-8484] - Add TrainValidationSplit to ml.tuning
　　[SPARK-8522] - Disable feature scaling in Linear and Logistic Regression
　　[SPARK-8538] - LinearRegressionResults>
　　[SPARK-8539] - LinearRegressionSummary>
　　[SPARK-8551] - Python example code for elastic net
　　[SPARK-8564] - Add the Python API for Kinesis
　　[SPARK-8579] - Support arbitrary object in UnsafeRow
　　[SPARK-8598] - Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs
　　[SPARK-8600] - Naive Bayes API for spark.ml Pipelines
　　[SPARK-8671] - Add isotonic regression to the pipeline API
　　[SPARK-8704] - Add missing methods in StandardScaler (ML and PySpark)
　　[SPARK-8706] - Implement Pylint / Prospector checks for PySpark
　　[SPARK-8711] - Add additional methods to JavaModel wrappers in trees
　　[SPARK-8774] - Add R model formula with basic support as a transformer
　　[SPARK-8777] - Add random data generation test utilities to Spark SQL
　　[SPARK-8782] - GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
　　[SPARK-8798] - Allow additional uris to be fetched with mesos
　　[SPARK-8807] - Add between operator in SparkR
　　[SPARK-8847] - String concatination with column in SparkR
　　[SPARK-8867] - Show the UDF usage for user.
　　[SPARK-8874] - Add missing methods in Word2Vec ML
　　[SPARK-8882] - A New Receiver Scheduling Mechanism
　　[SPARK-8936] - Hyperparameter estimation in LDA
　　[SPARK-8967] - Implement @since as an annotation
　　[SPARK-8996] - Add Python API for Kolmogorov-Smirnov Test
　　[SPARK-9022] - UnsafeProject
　　[SPARK-9023] - UnsafeExchange
　　[SPARK-9024] - Unsafe HashJoin
　　[SPARK-9028] - Add CountVectorizer as an estimator to generate CountVectorizerModel
　　[SPARK-9112] - Implement LogisticRegressionSummary similar to LinearRegressionSummary
　　[SPARK-9115] - date/time function: dayInYear
　　[SPARK-9143] - Add planner rule for automatically inserting Unsafe Safe row format converters
　　[SPARK-9178] - UTF8String empty string method
　　[SPARK-9201] - Integrate MLlib with SparkR using RFormula
　　[SPARK-9230] - SparkR RFormula should support StringType features
　　[SPARK-9231] - DistributedLDAModel method for top topics per document
　　[SPARK-9245] - DistributedLDAModel predict top topic per doc-term instance
　　[SPARK-9246] - DistributedLDAModel predict top docs per topic
　　[SPARK-9263] - Add Spark Submit flag to exclude dependencies when using --packages
　　[SPARK-9381] - Migrate JSON data source to the new partitioning data source
　　[SPARK-9391] - Support minus, dot, and intercept operators in SparkR RFormula
　　[SPARK-9440] - LocalLDAModel should save docConcentration, topicConcentration, and gammaShape
　　[SPARK-9464] - Add property-based tests for UTF8String
　　[SPARK-9471] - Multilayer perceptron>
　　[SPARK-9544] - RFormula in Python
　　[SPARK-9657] - PrefixSpan getMaxPatternLength should return an Int
　　[SPARK-10106] - Add `ifelse` Column function to SparkR

　　Apache Spark 是一种与 Hadoop 相似的开源集群计算环境，但是两者之间还存在一些不同之处，这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越，换句话说，Spark 启用了内存分布数据集，除了能够提供交互式查询外，它还可以优化迭代工作负载。
　　Spark 是在 Scala 语言中实现的，它将 Scala 用作其应用程序框架。与 Hadoop 不同，Spark 和 Scala 能够紧密集成，其中的 Scala 可以像操作本地集合对象一样轻松地操作分布式数据集。
　　尽管创建 Spark 是为了支持分布式数据集上的迭代作业，但是实际上它是对 Hadoop 的补充，可以在 Hadoo 文件系统中并行运行。通过名为Mesos 的第三方集群框架可以支持此行为。Spark 由加州大学伯克利分校 AMP 实验室 (Algorithms, Machines, and People Lab) 开发，可用来构建大型的、低延迟的数据分析应用程序。

账号		自动登录	找回密码
密码			立即注册

Centos6.5×64安装配置openmeetings3.0.3详

大疆运维招人啦，

C++ :try 语句块和异常处理

C++的多态

Red Hat RHCE 8 (EX294) Cert Guide

Java/C++ 区别：看完这一篇，就够用！

别再用过时库了！这 13 个顶级 C++ 库才是

[软件发布] Apache Spark 1.5.0 正式发布

浏览过的版块

扫码加入运维网微信交流群