最具影响力的数字化技术在线社区

168大数据

 找回密码
 立即注册

QQ登录

只需一步,快速开始

1 2 3 4 5
打印 上一主题 下一主题
开启左侧

[基础] Hadoop集群上的大规模分布式机器学习

[复制链接]
跳转到指定楼层
楼主
发表于 2015-10-4 11:27:48 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式

马上注册,结交更多数据大咖,获取更多知识干货,轻松玩转大数据

您需要 登录 才可以下载或查看,没有帐号?立即注册

x

 By Cyprien Noel, Jun Shi and Andy Feng (@afeng76), Yahoo Big ML Team

  Introduction

  In the last 10 years, Yahoo has progressively invested in building and scaling Apache hadoop clusters with a current footprint of more than 40,000 servers and 600 petabytes of storage spread across 19 clusters. As discussed at the 2015 Hadoop Summit , we have developed scalable machine learning algorithms on these clusters for classification, ranking, and word embedding based on a home-grown parameter server. Hadoop clusters have now become the preferred platform for large-scale machine learning at Yahoo.

  Deep learning (DL) is a critical capability demanded by many Yahoo products. At 2015 RE.WORK Deep Learning Summit, the Yahoo Flickr team ( Simon Osindero andPierre Garrigues ) explained how deep learning is getting applied for scene detection, object recognition, and computational aesthetics. Deep learning empowers Flickr to automatically tag all user photos, enabling Flickr end users to organize and find photos easily.

  To enable more Yahoo products to benefit from the promise of deep learning, we have recently introduced this capability natively into our Hadoop clusters. Deep learning on Hadoop provides the following major benefits:

  Deep learning can be directly conducted on Hadoop clusters, where Yahoo stores most of its data. We avoid unnecessary data movement between Hadoop clusters and separate deep learning clusters.

  Deep learning can be defined as first-class steps in Apache Oozie workflows with Hadoop for data processing and Spark pipelines for machine learning.

  YARN works well for deep learning. Multiple experiments of deep learning can be conducted concurrently on a single cluster. It makes deep learning extremely cost effective as opposed to conventional methods. In the past, we had teams use “notepad” to schedule GPU resources manually, which was painful and worked only for a small number of users.

  Deep learning on Hadoop is a novel approach for deep learning. Existing approaches in the industry require dedicated clusters whereas Deep learning on Hadoop enables the same level of performance as with dedicated clusters while simultaneously providing all the benefits listed above.

  Enhancing Hadoop Clusters

  To enable deep learning, we added GPU nodes into our Hadoop clusters (illustrated below). Each of these nodes have 4 Nvidia Tesla K80 cards , each card with two GK210 GPUs. These nodes have 10x processing power than the traditional commodity CPU nodes we generally use in our Hadoop clusters.

  In a Hadoop cluster, GPU nodes have two separate network interfaces, Ethernet and Infiniband. While Ethernet acts as the primary interface for external communication, Infiniband provides 10X faster connectivity among the GPU nodes in the cluster and supports direct access to GPU memories over RDMA.

  By leveraging YARN’s recently introduced node label capabilities ( YARN-796 ), we enable jobs to state whether containers should be launched in CPU or GPU nodes. Containers on GPU nodes use Infiniband to exchange data at a very high speed.

  Distributed Deep Learning: Caffe-on-Spark

  To enable deep learning on these enhanced Hadoop clusters, we developed a comprehensive distributed solution based upon open source software libraries, Apache Spark and Caffe . One can now submit deep learning jobs onto a cluster of GPU nodes via a command as illustrated below.

  spark-submit –master yarn –deploy-mode cluster

  –files solver.prototxt, net.prototxt

  –num-executors <# of EXECUTORS>

  –archives caffe_on_grid.tgz

  –conf spark.executorEnv.LD_LIBRARY_PATH=“./caffe_on_grid.tgz/lib64”

  –class com.yahoo.ml.CaffeOnSpark caffe-on-spark-1.0-jar-with-dependencies.jar

  -devices <# of GPUs PER EXECUTOR>

  -conf solver.prototxt

  -input hdfs://

  -model hdfs://

  In the command above, users specify the number of Spark executor processes to be launched (–num-executors), the number of GPUs to be allocated for each executor (-devices), the location of training data on HDFS, and the HDFS path where the model should be saved. Users use standard caffe configuration files to specify their caffe solver and deep network topology (ex. solver.prototxt, net.prototxt).

  As illustrated above, Spark on YARN launches a number of executors. Each executor is given a partition of HDFS-based training data, and launches multiple Caffe-based training threads. Each training thread is executed by a particular GPU. After back-propagation processing of a batch of training examples, these training threads exchange the gradients of model parameters. The gradient exchanged is carried out in an MPI Allreduce fashion across all GPUs on multiple servers. We have enhanced Caffe to use multiple GPUs on a server and benefit from RDMA to synchronize DL models.

  Caffe-on-Spark enables us to use the best of Spark and Caffe for large scale deep learning. DL tasks are launched easily as any other Spark application. Multiple GPUs in a cluster of machines are used to train models from HDFS-based large datasets.

  Benchmarks

  Caffe-on-Spark enables (a) multiple GPUs, and (b) multiple machines to be used for deep learning. To understand the benefits of our approach, we performed benchmarks on ImageNet 2012 dataset .

  First, we looked into the progress of deep learning for AlexNet with 1 GPU, 2 GPUs, 4 GPUs and 8 GPUs with a single Spark executor. As illustrated in the diagram below, training time decreases as we add more GPUs. With 4 GPUs, we reached 50% accuracy in about 15/43=35% the time required by a single GPU. All these executions use identical total batch size 256. The setup with 8 GPUs didn’t show significant improvement over 4, as the overall batch size was too small on each GPU to use the hardware efficiently.

  Next, we conducted a distributed benchmark with GoogLeNet , which is much deeper and uses more convolutions than AlexNet, and thus requires more computation power. In every run, we arrange each GPU to handle batches of size 32, for an effective batch size of 32n when n GPUs are used. Our distributed algorithm is designed to produce models and end-result precision equivalent to running on a single GPU. 80% top-5 accuracy (20% error) was reached in 10 hours of training with 4 servers (4x8 GPUs). Notice that 1 GPU training reached only 60% top-5 accuracy (40% error) after 40 hours.

  GoogLeNet scales further with the number of GPUs. For top-5 accuracy 60% (40% error), 8 GPUs achieved 680% speedup over 1 GPU. Table below also shows the speedup for top-5 accuracy 70% and 80%. The speedup could be larger if we adjust batch size carefully (instead of total batch size 32n).

  Open Source

  Continuing Yahoo’s commitment to open source, we have released some of our code into github.com/BVLC/caffe :

  #2114 … Allow Caffe to use multiple GPUs within a computer

  #1148 … RDMA transfers across computers

  #2386 … Improved Caffe’s data pipeline and prefetching

  #2395 … Added timing information

  #2402 … Make Caffe’s IO dependencies optional

  #2397 … Refactored Caffe solvers code

  In a follow-up post in the coming weeks, we will share the detailed design and implementation of Caffe-on-Spark. If there is enough interest from the community, we may open source our implementation. Please let us know what you think atbigdata@yahoo-inc.com .

  Conclusion


  The post describes early steps in bringing Apache Hadoop ecosystem and deep learning together on the same heterogeneous (GPU+CPU) cluster. We are encouraged by the early benchmark results and plan to invest further in Hadoop, Spark, and Caffe to make deep learning more effective on our clusters. We look forward to working closely with the open source communities in related areas.



楼主热帖
分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏 转播转播 分享分享 分享淘帖 赞 踩

168大数据 - 论坛版权1.本主题所有言论和图片纯属网友个人见解,与本站立场无关
2.本站所有主题由网友自行投稿发布。若为首发或独家,该帖子作者与168大数据享有帖子相关版权。
3.其他单位或个人使用、转载或引用本文时必须同时征得该帖子作者和168大数据的同意,并添加本文出处。
4.本站所收集的部分公开资料来源于网络,转载目的在于传递价值及用于交流学习,并不代表本站赞同其观点和对其真实性负责,也不构成任何其他建议。
5.任何通过此网页连接而得到的资讯、产品及服务,本站概不负责,亦不负任何法律责任。
6.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源,若标注有误或遗漏而侵犯到任何版权问题,请尽快告知,本站将及时删除。
7.168大数据管理员和版主有权不事先通知发贴者而删除本文。

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

关于我们|小黑屋|Archiver|168大数据 ( 京ICP备14035423号|申请友情链接

GMT+8, 2024-5-10 02:24

Powered by BI168大数据社区

© 2012-2014 168大数据

快速回复 返回顶部 返回列表