Hadoop的相关资料

判官007 发表于 2018-10-29 11:16:04

　　1 HDFS
　　1.1 概念
　　Hadoop分布式文件系统(HDFS)被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统
　　1.2 特点
　　
　　- 高度容错性
　　- 硬件要求低
　　- 能提供高吞吐量的数据访问
　　1.3 文件系统命令行
　　1.3.1 获取帮助
hadoop fs -help　　1.3.2 ls命令
hadoop fs -ls /　　
hadoop fs -ls -R /user
　　1.3.3 getconf命令
hdfs getconf -help　　
hdfs getconf -namenodes
　　1.3.4 版本信息
hdfs version　　注：由于与linux系统指令用法接近，详细请参阅文后的官方链接。
　　2 MapReduce
　　2.1 MapReduce的简介
　　MapReduce是一种编程模型，用于大规模数据集（大于1TB）的并行运算。
　　2.2 工作原理
　　假若一个盘子中有黑豆、黄豆、绿豆、红豆，你现在想挑出其中的红豆。
　　MapReduce方法则是：
　　step1 找一个团队来处理（相当于一群服务器组成的集群）
　　step2 把豆子平均分配给团队里的每成员（相当于给群集中的服务器分配数据）
　　step3 让团队的成员开始挑选出其中的红豆（相当于群集的计算机并行地处理数据）
　　step4 把团队成员挑出来的豆子汇聚（相当于群集汇总并输出结果）
　　3 Hive
　　3.1 Hive的简介
　　3.1.1 概念
　　Hive是一个基于Hadoop的数据仓库平台。
　　3.1.2 Hive的作用
　　通过hive，我们可以方便地进行ETL的工作
　　hive定义了一个类似于SQL的查询语言
　　HQL能够将用户编写的QL转化为相应的Mapreduce程序基于Hadoop执行
　　3.1.3 Hive项目的历史
　　Hive是Facebook 2008年8月刚开源的一个数据仓库框架，其系统目标与Pig有相似之处，但它有一些Pig目前还不支持的机制。
　　比如：更丰富的类型系统、更类似SQL的查询语言、Table/Partition元数据的持久化等。
　　4 impala
　　4.1 Impala的简介
　　Impala 是 Cloudera 在受到 Google 的 Dremel 启发下开发的实时交互 SQL 大数据查询工具，Impala 没有再使用缓慢的 Hive+MapReduce 批处理，而是通过使用与商用并行关系数据库中类似的分布式查询引擎（由 Query Planner、Query Coordinator 和 Query Exec Engine 三部分组成），可以直接从 HDFS 或 HBase 中用 SELECT、JOIN 和统计函数查询数据，从而大大降低了延迟。
　　4.2 Impala的shell
　　4.2.1 启动shell
impala-shell　　4.2.2 版本查询
select version();　　4.3 库的操作
　　4.3.1 查询数据库
show databases;　　4.3.2 创建数据库
create database testdb;　　
create database testdb2;
　　数据库存储路径：
hdfs dfs -ls /user/hive/warehouse/　　4.3.3 使用数据库
use testdb;　　4.3.4 显示当前数据库
select current_database();　　4.3.5 删除数据库
drop database testdb;　　4.4 表操作
　　4.4.1 创建表
　　
create table t1 (x int);　　
create table t3 (id int, word string);
　　
create table city (id int,name string,countrycode string,district string,population int);
　　4.4.2 显示数据库中的表
show tables;　　
show tables in testdb;
　　
show tables in testdb like 't*';
　　4.4.3 表结构描述
describe city;　　4.4.4 修改表名称
alter table t3 rename to t2;　　4.4.5 插入数据
insert into t1 values (1),(3),(2),(4);　　
insert into t2 values (1, "one"), (3, "three"), (5, 'five');
　　4.4.6 数据查询
select min(x), max(x), sum(x), avg(x) from t1;　　
select word from t1 join t2 on (t1.x = t2.id);
　　5 sentry
　　5.1 开启权限
　　5.1.1 开启权限
　　Hive/Impala > Configuration > Service-Wide > Sentry Service > 选择“sentry”
　　5.1.2 指定认证服务器
　　Hive > Configuration > Service-Wide > Advanced > Server Name for Sentry Authorization(hive.sentry.server) > 填写sentry服务器名称或IP地址
　　5.1.3 设置特权用户
　　Hive > Configuration > Service-Wide > Security > Bypass Sentry Authorization Users(sentry.metastore.service.users) > 填写绕过的linux用户名（hive,impala,hue,hdfs等）
　　5.1.4 配置Hive的代理用户
　　HDFS > Configuration > Service-Wide > Proxy > Hive Proxy User Groups（hadoop.proxyuser.hive.groups） > 填写代理的linux用户名（hive,impala,hue,hdfs等）
　　5.1.5 重启服务
　　重启Hive/Impala的服务
　　5.2 授权
　　5.2.1 创建数据库用户和组
groupadd gp1　　
useradd user1 -G gp1
　　
useradd user2 -G gp1
　　5.2.2 切换执行用户
su - impala　　5.2.3 创建数据库
　　切换到hive shell
hive　　新建库
create database testdb;　　退出hive shell
quit;　　5.2.4 创建角色
　　切换到impala shell
impala-shell　　创建角色
create role ro1;　　5.2.5 确认创建的角色
show roles;　　5.2.6 用户组和角色的关联
grant role ro1 to group gp1;　　5.2.7 角色授权
grant all on database testdb to role ro1;　　参阅资料：
　　==================================================
　　Docs:
　　----------------
　　http://hadoop.apache.org/docs/current/
　　Hadoop Common Guide:
　　---------------------
　　http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
　　File System Shell Guide:
　　http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#Overview
　　MapReduce Common Guide:
　　------------------------
　　http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html
　　Hive Docs
　　-------------------------
　　http://hive.apache.org
　　LanguageManual:
　　https://cwiki.apache.org/confluence/display/Hive/LanguageManual
　　GettingStarted:
　　https://cwiki.apache.org/confluence/display/Hive/GettingStarted
　　User Documentation:
　　https://cwiki.apache.org/confluence/display/Hive/Home#Home-UserDocumentation
　　Impala Docs
　　--------------------------
　　Impala SQL
　　http://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_langref_sql.html#langref_sql
　　Impala Tutorials
　　http://www.cloudera.com/documentation/enterprise/latest/topics/impala_tutorial.html
　　Impala Explore
　　http://www.cloudera.com/documentation/enterprise/latest/topics/impala_tutorial.html#tutorial_explore
　　Sentry Docs
　　----------------------------------
　　Overview of Impala Security
　　http://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_security.html#security
　　Enabling Sentry Authorization for Impala
　　http://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_authorization.html#authorization
　　Impala Grant
　　http://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_grant.html#grant
　　Hive Grant
　　http://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_hive_sql.html#concept_c2q_4qx_p4__col_level_auth_sentry
　　Disabling Hive CLI
　　http://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_sentry_overview.html
　　======================================
　　其他参考：
　　======================================
　　ETL的概念：
　　----------
　　http://www.cnblogs.com/elaron/archive/2012/04/09/2438372.html
　　Apache Sentry架构介绍
　　http://blog.javachen.com/2015/04/29/apache-sentry-architecture.html
　　启用Kerberos认证
　　http://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_intro_kerb.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--76dd
　　Impala的架构介绍
　　http://www.mutouxiaogui.cn/blog/?p=319

页: [1]

运维网's Archiver

Hadoop的相关资料