python + hadoop （案例）

luoson1 发表于 2017-12-17 07:43:21

　　python如何链接hadoop，并且使用hadoop的资源，这篇文章介绍了一个简单的案例！

一、python的map/reduce代码
　　首先认为大家已经对haoop已经有了很多的了解，那么需要建立mapper和reducer，分别代码如下：
　　1、mapper.py
　　

#!/usr/bin/env python　　
import sys
　　
for line in sys.stdin:
　　
line = line.strip()
　　
words = line.split()
　　
for word in words:
　　
print '%s\t%s' %(word, 1)
　　

　　2、reducer.py
　　

#!/usr/bin/env python　　
from operator import itemgetter
　　
import sys
　　

　　
current_word = None
　　
current_count = 0
　　
word = None
　　

　　
for line in sys.stdin:
　　
words = line.strip()
　　
word, count = words.split('\t')
　　

　　
try:
　　
count = int(count)
　　
except ValueError:
　　
continue
　　

　　
if current_word == word:
　　
current_count += count
　　
else:
　　
if current_word:
　　
print '%s\t%s' %(current_word, current_count)
　　
current_count = count
　　
current_word = word
　　

　　
if current_word == word:
　　
print '%s\t%s' %(current_word, current_count)
　　

　　建立了两个代码之后，测试一下：
　　

$ echo "I like python hadoop , hadoop very good" | ./mapper.py | sort -k 1,1 | ./reducer.py　　
,
1　　
good
1　　
hadoop
2　　
I
1　　
like
1　　
python
1　　
very
1　　

二、上传文件
　　发现没啥问题，那么成功一半了，下面上传几个文件到hadoop做进一步测试。我在线上找了几个文件，命令如下：
　　

wget http://www.gutenberg.org/ebooks/20417.txt.utf-8　　
wget http://www.gutenberg.org/files/5000/5000-8.txt
　　
wget http://www.gutenberg.org/ebooks/4300.txt.utf-8
　　

　　查看下载的文件：
　　

$ ls　　

20417.txt.utf-84300.txt.utf-85000-8.txtmapper.pyreducer.pyrun.sh　　

　　上传文件到hadoop上面，命令如下：hadoop dfs -put ./*.txt /user/ticketdev/tmp （hadoop是配置好的，目录也是建立好的）
　　建立run.sh
　　

hadoop jar $STREAM\　　

-files ./mapper.py,./reducer.py \　　

-mapper ./mapper.py \　　

-reducer ./reducer.py \　　

-input /user/ticketdev/tmp/*.txt \　　

-output /user/ticketdev/tmp/output　　

　　查看结果：
　　

$ hadoop dfs -cat /user/ticketdev/tmp/output/part-00000 | sort -nk 2 | tail　　
DEPRECATED: Use of
this script to execute hdfs command is deprecated.　　
Instead use the hdfs command
for it.　　

　　
it
2387　　
which
2387　　
that
2668　　
a
3797　　
is 4097
　　
to 5079
　　
in 5226
　　
and 7611
　　
of 10388
　　
the 20583
　　

　　三、参考文献：
　　http://www.cnblogs.com/wing1995/p/hadoop.html?utm_source=tuicool&utm_medium=referral

页: [1]

运维网's Archiver

python + hadoop （案例）