设为首页 收藏本站
查看: 399|回复: 0

[经验分享] [转]hadoop at ebay

[复制链接]

尚未签到

发表于 2016-12-7 07:16:35 | 显示全部楼层 |阅读模式
  http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-the-elephant/
  Hadoop – The Power of the Elephant
by Anil Madan on 10/29/2010
  in Machine Learning
  In a previous post, Junling discussed data mining and our need to process petabytes of data to gain insights from information. We use several tools and systems to help us with this task; the one I’ll discuss here is Apache Hadoop.
  Created by Doug Cutting in 2006 who named it after his son’s stuffed yellow elephant, and based on Google’s MapReduce paper in 2004, Hadoop is an open source framework for fault tolerant, scalable, distributed computing on commodity hardware.
  MapReduce is a flexible programming model for processing large data sets:
Map takes key/value pairs as input and generates an intermediate output of another type of key/value pairs, while Reduce takes the keys produced in the Map step along with a list of values associated with the same key to produce the final output of key/value pairs.
  Map (key1, value1) -> list (key2, value2)
Reduce (key2, list (value2)) -> list (key3, value3)
  Ecosystem
  
Athena, our first large cluster was put in use earlier this year.
Let’s look at the stack from bottom to top:
  Core – The Hadoop runtime, some common utilities, and the Hadoop Distributed File System (HDFS). The File System is optimized for reading and writing big blocks of data (128 MB to 256 MB).
MapReduce – provides the APIs and components to develop and execute jobs.
Data Access – the most prominent data access frameworks today are HBase, Pig and Hive.
HBase – Column oriented multidimensional spatial database inspired by Google’s BigTable. HBase provides sorted data access by maintaining partitions or regions of data. The underlying storage is HDFS.
Pig (Latin) – A procedural language which provides capabilities to load, filter, transform, extract, aggregate, join and group data. Developers use Pig for building data pipelines and factories.
Hive – A declarative language with SQL syntax used to build data warehouse. The SQL interface makes Hive an attractive choice for developers to quickly validate data, for product managers and for analysts.
Tools & Libraries – UC4 is an enterprise scheduler used by eBay to automate data loading from multiple sources.
Libraries: Statistical (R), machine learning (Mahout), and mathematical libraries (Hama), and eBay’s homegrown library for parsing web logs (Mobius).
Monitoring & Alerting – Ganglia is a distributed monitoring system for clusters. Nagios is used for alerting on key events like servers being unreachable or disks being full.
Infrastructure
Our enterprise servers run 64-bit RedHat Linux.
  NameNode is the master server responsible for managing the HDFS.
JobTracker is responsible for coordination of the Jobs and Tasks associated to the Jobs.
HBaseMaster stores the root storage for HBase and facilitates the coordination with blocks or regions of storage.
Zookeeper is a distributed lock coordinator providing consistency for HBase.
The storage and compute nodes are 1U units running Cent OS with 2 quad core machines and storage space of 12 to 24TB. We pack our racks with 38 to 42 of these units to have a highly dense grid.
  On the networking side, we use top of rack switches with a node bandwidth of 1Gbps. The rack switches uplink to the core switches with a line rate of 40Gpbs to support the high bandwidth necessary for data to be shuffled around.
  Scheduling
Our cluster is used by many teams within eBay, for production as well as one-time jobs. We use Hadoop’s Fair Scheduler to manage allocations, define job pools for teams, assign weights, limit concurrent jobs per user and team, set preemption timeouts and delayed scheduling.
  Data Sourcing
  
On a daily basis we ingest about 8 to 10 TB of new data.
  Road Ahead
Here are some of the challenges we are working on as we build out our infrastructure:
  Scalability
In its current incarnation, the master server NameNode has scalability issues. As the file system of the cluster grows, so does the memory footprint as it keeps the entire metadata in memory. For 1 PB of storage approximately 1 GB of memory is needed. Possible solutions are hierarchical namespace partitioning or leveraging Zookeeper in conjunction with HBase for metadata management.
Availability
NameNode’s availability is critical for production workloads. The open source community is working on several cold, warm, and hot standby options like Checkpoint and Backup nodes; Avatar nodes switching avatar from the Secondary NameNode; journal metadata replication techniques. We are evaluating these to build our production clusters.
Data Discovery
Support data stewardship, discovery, and schema management on top of a system which inherently does not support structure. A new project is proposing to combine Hive’s metadata store and Owl into a new system, called Howl. Our effort is to tie this into our analytics platform so that our users can easily discover data across the different data systems.
Data Movement
We are working on publish/subscription data movement tools to support data copy and reconciliation across our different subsystems like the Data Warehouse and HDFS.
Policies
Enable good Retention, Archival, and Backup policies with storage capacity management through quotas (the current Hadoop quotas need some work). We are working on defining these across our different clusters based on the workload and the characteristics of the clusters.
Metrics, Metrics, Metrics
We are building robust tools which generate metrics for data sourcing, consumption, budgeting, and utilization. The existing metrics exposed by some of the Hadoop enterprise servers are either not enough, or transient which make patterns of cluster usage hard to see.
eBay is changing how it collects, transforms, and uses data to generate business intelligence. We’re hiring, and we’d love to have you come help.
  Anil Madan
Director of Engineering, Analytics Platform Development

运维网声明 1、欢迎大家加入本站运维交流群:群②:261659950 群⑤:202807635 群⑦870801961 群⑧679858003
2、本站所有主题由该帖子作者发表,该帖子作者与运维网享有帖子相关版权
3、所有作品的著作权均归原作者享有,请您和我们一样尊重他人的著作权等合法权益。如果您对作品感到满意,请购买正版
4、禁止制作、复制、发布和传播具有反动、淫秽、色情、暴力、凶杀等内容的信息,一经发现立即删除。若您因此触犯法律,一切后果自负,我们对此不承担任何责任
5、所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其内容的准确性、可靠性、正当性、安全性、合法性等负责,亦不承担任何法律责任
6、所有作品仅供您个人学习、研究或欣赏,不得用于商业或者其他用途,否则,一切后果均由您自己承担,我们对此不承担任何法律责任
7、如涉及侵犯版权等问题,请您及时通知我们,我们将立即采取措施予以解决
8、联系人Email:admin@iyunv.com 网址:www.yunweiku.com

所有资源均系网友上传或者通过网络收集,我们仅提供一个展示、介绍、观摩学习的平台,我们不对其承担任何法律责任,如涉及侵犯版权等问题,请您及时通知我们,我们将立即处理,联系人Email:kefu@iyunv.com,QQ:1061981298 本贴地址:https://www.iyunv.com/thread-310605-1-1.html 上篇帖子: hadoop自定义Counter 下篇帖子: hadoop知识点整理
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

扫码加入运维网微信交流群X

扫码加入运维网微信交流群

扫描二维码加入运维网微信交流群,最新一手资源尽在官方微信交流群!快快加入我们吧...

扫描微信二维码查看详情

客服E-mail:kefu@iyunv.com 客服QQ:1061981298


QQ群⑦:运维网交流群⑦ QQ群⑧:运维网交流群⑧ k8s群:运维网kubernetes交流群


提醒:禁止发布任何违反国家法律、法规的言论与图片等内容;本站内容均来自个人观点与网络等信息,非本站认同之观点.


本站大部分资源是网友从网上搜集分享而来,其版权均归原作者及其网站所有,我们尊重他人的合法权益,如有内容侵犯您的合法权益,请及时与我们联系进行核实删除!



合作伙伴: 青云cloud

快速回复 返回顶部 返回列表