最具影响力的数字化技术在线社区

168大数据

 找回密码
 立即注册

QQ登录

只需一步,快速开始

1 2 3 4 5
打印 上一主题 下一主题
开启左侧

Apache Spark 2.2.0 正式发布,建议所有2.x用户升级

[复制链接]
跳转到指定楼层
楼主
发表于 2017-7-14 13:38:54 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式

马上注册,结交更多数据大咖,获取更多知识干货,轻松玩转大数据

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
Apache Spark 2.2.0 是2.x系列的第三个版本,该发行版移除了Structured Streaming的实验标签,处理了1100多个问题,更关注可用性、稳定性和性能优化。
建议所有2.x用户更新至2.2.0版本,点击访问下载页面,用户可以在JIRA中查询更多细节。以下按照主要模块,对更新内容进行了分组:
  • 核心 & Spark SQL
  • Structured Streaming
  • MLlib
  • SparkR
  • GraphX
  • 过期功能
  • 行为变化
  • 已知问题
核心 & Spark SQL
API升级
  • SPARK-19107: Support creating hive table with DataFrameWriter and Catalog
  • SPARK-13721: Add support for LATERAL VIEW OUTER explode()
  • SPARK-18885: Unify CREATE TABLE syntax for data source and hive serde tables
  • SPARK-16475: Added Broadcast Hints BROADCAST, BROADCASTJOIN, and MAPJOIN, for SQL Queries
  • SPARK-18350: Support session local timezone
  • SPARK-19261: Support ALTER TABLE table_name ADD COLUMNS
  • SPARK-20420: Add events to the external catalog
  • SPARK-18127: Add hooks and extension points to Spark
  • SPARK-20576: Support generic hint function in Dataset/DataFrame
  • SPARK-17203: Data source options should always be case insensitive
  • SPARK-19139: AES-based authentication mechanism for Spark
性能及稳定性
  • Cost-Based Optimizer
    • SPARK-17075 SPARK-17076 SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join, aggregate, project and limit/sample operators
    • SPARK-17080: Cost-based join re-ordering
    • SPARK-17626: TPC-DS performance improvements using star-schema heuristics
  • SPARK-17949: Introduce a JVM object based aggregate operator
  • SPARK-18186: Partial aggregation support of HiveUDAFFunction
  • SPARK-18362 SPARK-19918: File listing/IO improvements for CSV and JSON
  • SPARK-18775: Limit the max number of records written per file
  • SPARK-18761: Uncancellable / unkillable tasks shouldn’t starve jobs of resources
  • SPARK-15352: Topology aware block replication
其他值得注意的变化
  • SPARK-18352: Support for parsing multi-line JSON files
  • SPARK-19610: Support for parsing multi-line CSV files
  • SPARK-21079: Analyze Table Command on partitioned tables
  • SPARK-18703: Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables
  • SPARK-18209: More robust view canonicalization without full SQL expansion
  • SPARK-13446: [SPARK-18112] Support reading data from Hive metastore 2.0/2.1
  • SPARK-18191: Port RDD API to use commit protocol
  • SPARK-8425:Add blacklist mechanism for task scheduling
  • SPARK-19464: Remove support for hadoop 2.5 and earlier
  • SPARK-19493: Remove Java 7 support
Structured Streaming
General Availablity
  • SPARK-20844: The Structured Streaming APIs are now GA and is no longer labeled experimental
Kafka改进
  • SPARK-19719: Support for reading and writing data in streaming or batch to/from Apache Kafka
  • SPARK-19968: Cached producer for lower latency kafka to kafka streams.
API升级
  • SPARK-19067: Support for complex stateful processing and timeouts using [flat]MapGroupsWithState
  • SPARK-19876: Support for one time triggers
其他值得注意的变化
  • SPARK-20979: Rate source for testing and benchmarks
MLlib
DataFrame API新增算法
  • SPARK-14709: LinearSVC (Linear SVM Classifier) (Scala/Java/Python/R)
  • SPARK-19635: ChiSquare test in DataFrame-based API (Scala/Java/Python)
  • SPARK-19636: Correlation in DataFrame-based API (Scala/Java/Python)
  • SPARK-13568: Imputer feature transformer for imputing missing values (Scala/Java/Python)
  • SPARK-18929: Add Tweedie distribution for GLMs (Scala/Java/Python/R)
  • SPARK-14503: FPGrowth frequent pattern mining and AssociationRules (Scala/Java/Python/R)
已有算法添至 Python & R APIs
  • SPARK-18239: Gradient Boosted Trees ®
  • SPARK-18821: Bisecting K-Means ®
  • SPARK-18080: Locality Sensitive Hashing (LSH) (Python)
  • SPARK-6227: Distributed PCA and SVD for PySpark (in RDD-based API)
主要错误修复
  • SPARK-19110: DistributedLDAModel.logPrior correctness fix
  • SPARK-17975: EMLDAOptimizer fails with ClassCastException (caused by GraphX checkpointing bug)
  • SPARK-18715: Fix wrong AIC calculation in Binomial GLM
  • SPARK-16473: BisectingKMeans failing during training with “java.util.NoSuchElementException: key not found” for certain inputs
  • SPARK-19348: pyspark.ml.Pipeline gets corrupted under multi-threaded use
  • SPARK-20047: Box-constrained Logistic Regression
SparkR
2.2.0版本中SparkR的主要焦点在于对Spark SQL现有特性提供了广泛支持:
主要特性
  • SPARK-19654: Structured Streaming API for R
  • SPARK-20159: Support complete Catalog API in R
  • SPARK-19795: column functions to_json, from_json
  • SPARK-19399: Coalesce on DataFrame and coalesce on column
  • SPARK-20020: Support DataFrame checkpointing
  • SPARK-18285: Multi-column approxQuantile in R
编程指南:SparkR (R on Spark)
GraphX
漏洞修复
  • SPARK-18847: PageRank gives incorrect results for graphs with sinks
  • SPARK-14804: Graph vertexRDD/EdgeRDD checkpoint results ClassCastException
优化
  • SPARK-18845: PageRank initial value improvement for faster convergence
  • SPARK-5484: Pregel should checkpoint periodically to avoid StackOverflowError
过期功能
MLlib
  • SPARK-18613: spark.ml LDA classes should not expose spark.mllib in APIs. In spark.ml.LDAModel, deprecated oldLocalModel and getModel.
SparkR
  • SPARK-20195: deprecate createExternalTable
行为变化
MLlib
  • SPARK-19787: DeveloperApi ALS.train() uses default regParam value 0.1 instead of 1.0, in order to match regular ALS API’s default regParam setting.
SparkR
  • SPARK-19291: This added log-likelihood for SparkR Gaussian Mixture Models, but doing so introduced a SparkR model persistence incompatibility: Gaussian Mixture Models saved from SparkR 2.1 may not be loaded into SparkR 2.2. We plan to put in place backwards compatibility guarantees for SparkR in the future.
已知问题


参考链接:

楼主热帖
分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 转播转播 分享分享 分享淘帖 赞 踩

168大数据 - 论坛版权1.本主题所有言论和图片纯属网友个人见解,与本站立场无关
2.本站所有主题由网友自行投稿发布。若为首发或独家,该帖子作者与168大数据享有帖子相关版权。
3.其他单位或个人使用、转载或引用本文时必须同时征得该帖子作者和168大数据的同意,并添加本文出处。
4.本站所收集的部分公开资料来源于网络,转载目的在于传递价值及用于交流学习,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。
5.任何通过此网页连接而得到的资讯、产品及服务,本站概不负责,亦不负任何法律责任。
6.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源,若标注有误或遗漏而侵犯到任何版权问题,请尽快告知,本站将及时删除。
7.168大数据管理员和版主有权不事先通知发贴者而删除本文。

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

关于我们|小黑屋|Archiver|168大数据 ( 京ICP备14035423号|申请友情链接

GMT+8, 2024-4-20 04:24

Powered by BI168大数据社区

© 2012-2014 168大数据

快速回复 返回顶部 返回列表