Hadoop集群基础操作

本文遵循BY-SA版权协议,转载请附上原文出处链接。


本文作者: 黑伴白

本文链接: http://heibanbai.com.cn/posts/2cfc941a/

Hadoop集群基础操作

Hadoop集群基本信息查看

集群存储信息查看

登录HDFS监控web查看运行情况及相关存储信息,默认端口为50070,具体以hdfs-site.xml文件中配置为准

1
2
3
4
5
<!-- 定义namenode界面的访问地址 -->
<property>
<name>dfs.http.address</name>
<value>node1:50070</value>
</property>

image-20220512104801237

当然也可以在后台服务器通过命令的方式进行查看:

1
2
3
Usage: hdfs dfsadmin
Note: Administrative commands can only be run as the HDFS superuser.
[-report [-live] [-dead] [-decommissioning] [-enteringmaintenance] [-inmaintenance]]

集群计算资源查看

登录8088端口(默认)查看集群的计算资源信息,具体地址以yarn-site.xml文件中配置为准:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<!-- yarn的web访问地址 -->
<property>
<description>
The http address of the RM web application.
If only a host is provided as the value,
the webapp will be served on a random port.
</description>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>

image-20220512112122750

进入8042端口(默认)查看节点的各项资源信息,具体以yarn-site.xml文件中配置为准:

1
2
3
4
5
<property>
<description>NM Webapp address.</description>
<name>yarn.nodemanager.webapp.address</name>
<value>${yarn.nodemanager.hostname}:8042</value>
</property>

image-20220512112933576

HDFS文件系统操作

查看HDFS文件系统

可以登录50070端口通过web浏览hdfs文件系统基本信息,和正常在Linux操作系统目录结构基本一样:

image-20220512113458371

HDFS基本操作

通过HDFS命令可以完成对HDFS文件系统的大部分管理操作,相关命令信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
[hadoop@node1 ~]$ hdfs dfs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]

Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

常用操作命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 创建目录
hdfs dfs -mkdir /tmp # 在HDFS的根目录(/)下创建一个tmp目录
hdfs dfs -mkdir -p /test1/test2 # 创建多级目录,加入参数“-p”

# 显示文件相关信息
hdfs dfs -ls / # 列出HDFS上的所有目录

# 显示文件内容
hdfs dfs -cat /tmp/test.txt # 显示文件内容

# 文件上传到HDFS中
hdfs dfs -put /home/hadoop/test.txt /tmp # 将本地/home/hadoop/test.txt文件上传到HDFS中的/tmp目录下
hdfs dfs -appendToFile /home/hadooop/test.txt /tmp # 若文件存在,则追加到文件末尾
hdfs dfs -copyFromLocal /home/hadoop/test.txt /tmp # 若HDFS文件已存在,则覆盖原有文件

# 下载HDFS中的文件
hdfs dfs -get /tmp/test.txt /home/hadoop #HDFS中的文件test.txt下载到本地的/home/hadoop目录下
hdfs dfs -copyToLocal /tmp/test.txt /home/hadoop/test1.txt #若本地存在该文件,对文件重命名

# 在HDFS中移动文件
hdfs dfs -mv /tmp/test.txt /test/ # 将test.txt移动到test目录下

# 删除HDFS中的指定文件
hdfs dfs -rm /tmp/test.txt

在web页面也可查看文件的相关信息:

image-20220512140332957

运行MapReduce任务

官方示例程序包

$HADOOOP_HOME/share/hadoop/mapreduce/目录下有个官方示例程序包hadoop-mapreduce-examples-2.10.1.jar,其中封装了一些常用的测试模块:

程序名称 用途
aggregatewordcount 一个基于聚合的map/reduce程序,它对输入文件中的单词进行计数。
aggregatewordhist 一个基于聚合的map/reduce程序,用于计算输入文件中单词的直方图。
bbp 一个使用Bailey Borwein Plouffe计算PI精确数字的map/reduce程序。
dbcount 一个计算页面浏览量的示例作业,从数据库中计数。
distbbp 一个使用BBP型公式计算PI精确比特的map/reduce程序。
grep 一个在输入中计算正则表达式匹配的map/reduce程序。
join 一个影响连接排序、相等分区数据集的作业
multifilewc 一个从多个文件中计算单词的任务。
pentomino 一个地图/减少瓦片铺设程序来找到解决PotoMimo问题的方法。
pi 一个用拟蒙特卡洛方法估计PI的MAP/Relp程序。
randomtextwriter 一个map/reduce程序,每个节点写入10GB的随机文本数据。
randomwriter 一个映射/RADIUS程序,每个节点写入10GB的随机数据。
secondarysort 定义一个次要排序到减少的例子。
sort 一个对随机写入器写入的数据进行排序的map/reduce程序。
sudoku 数独求解者。
teragen 为terasort生成数据
terasort 运行terasort
teravalidate terasort的检查结果
wordcount 一个映射/缩小程序,计算输入文件中的单词。
wordmean map/reduce程序,用于计算输入文件中单词的平均长度。
wordmedian map/reduce程序,用于计算输入文件中单词的中值长度。

提交MapReduce任务运行

示例1:wordcount

执行命令及日志信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
[hadoop@node1 ~]$ hadoop jar /app/hadoop-2.10.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount /tmp/test.txt /tmp/output
22/05/11 23:19:17 INFO client.RMProxy: Connecting to ResourceManager at node1/199.188.166.111:8032
22/05/11 23:19:19 INFO input.FileInputFormat: Total input files to process : 1
22/05/11 23:19:19 INFO mapreduce.JobSubmitter: number of splits:1
22/05/11 23:19:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1652322858586_0001
22/05/11 23:19:20 INFO conf.Configuration: resource-types.xml not found
22/05/11 23:19:20 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
22/05/11 23:19:20 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
22/05/11 23:19:20 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
22/05/11 23:19:21 INFO impl.YarnClientImpl: Submitted application application_1652322858586_0001
22/05/11 23:19:21 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1652322858586_0001/
22/05/11 23:19:21 INFO mapreduce.Job: Running job: job_1652322858586_0001
22/05/11 23:19:38 INFO mapreduce.Job: Job job_1652322858586_0001 running in uber mode : false
22/05/11 23:19:38 INFO mapreduce.Job: map 0% reduce 0%
22/05/11 23:19:50 INFO mapreduce.Job: map 100% reduce 0%
22/05/11 23:20:02 INFO mapreduce.Job: map 100% reduce 100%
22/05/11 23:20:03 INFO mapreduce.Job: Job job_1652322858586_0001 completed successfully
22/05/11 23:20:03 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=2274
FILE: Number of bytes written=421473
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3213
HDFS: Number of bytes written=1928
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=8053
Total time spent by all reduces in occupied slots (ms)=8758
Total time spent by all map tasks (ms)=8053
Total time spent by all reduce tasks (ms)=8758
Total vcore-milliseconds taken by all map tasks=8053
Total vcore-milliseconds taken by all reduce tasks=8758
Total megabyte-milliseconds taken by all map tasks=8246272
Total megabyte-milliseconds taken by all reduce tasks=8968192
Map-Reduce Framework
Map input records=38
Map output records=335
Map output bytes=4379
Map output materialized bytes=2274
Input split bytes=95
Combine input records=335
Combine output records=87
Reduce input groups=87
Reduce shuffle bytes=2274
Reduce input records=87
Reduce output records=87
Spilled Records=174
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=222
CPU time spent (ms)=2630
Physical memory (bytes) snapshot=396931072
Virtual memory (bytes) snapshot=3804737536
Total committed heap usage (bytes)=194383872
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=3118
File Output Format Counters
Bytes Written=1928

执行完成后可在HDFS文件系统中查看到执行结果:

image-20220512142722893

image-20220512142955562

在output文件中生成了两个新文件:一个是_SUCCESS,这是一个标识文件,表示这个任务执行完成;另一个是part-r-00000,即任务执行完成后产生的结果文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
[hadoop@node1 ~]$ hdfs dfs -cat /tmp/output/part-r-00000
-rw-rw-r--. 4
00:15 1
00:20 1
00:26 1
00:29 1
00:39 1
00:47 1
00:51 1
00:54 1
01:00 1
01:04 1
01:07 1
01:09 1
01:16 1
07:55 1
08:04 1
08:15 1
09:32 1
1 4
11 9
16 1
17 5
18:59 6
19:33 4
19:34 4
2 8
20:24 1
20:33 1
20:46 1
21 18
23:05 1
23:06 2
28 1
29 1
3 23
30 2
35 2
4 7
5 17
54 1
6 7
9 6
Apr 4
Jetty_0_0_0_0_50070_hdfs____w2cu08 1
Jetty_0_0_0_0_50090_secondary____y6aanv 1
Jetty_0_0_0_0_8042_node____19tj0x 1
Jetty_localhost_32873_datanode____t7p7lo 1
Jetty_localhost_33735_datanode____jksu74 1
Jetty_localhost_34961_datanode____.fpendy 1
Jetty_localhost_36015_datanode____.lhrbt4 1
Jetty_localhost_38151_datanode____.rhd829 1
Jetty_localhost_39677_datanode____.s4r2y1 1
Jetty_localhost_40461_datanode____.d6iqau 1
Jetty_localhost_40969_datanode____1moe5j 1
Jetty_localhost_41457_datanode____snit9c 1
Jetty_localhost_42109_datanode____.mhhtgd 1
Jetty_localhost_42315_datanode____.wlr1a8 1
Jetty_localhost_42845_datanode____.422dr2 1
Jetty_localhost_43529_datanode____.iybvi4 1
Jetty_localhost_43811_datanode____vzpazk 1
Jetty_localhost_44775_datanode____2kxto 1
Jetty_node1_50070_hdfs____.8fa0c 1
Jetty_node1_8088_cluster____uqk9cr 1
May 33
drwx------. 11
drwxr-xr-x. 2
drwxrwxr-x. 20
hadoop 52
hadoop-hadoop-datanode.pid 1
hadoop-hadoop-namenode.pid 1
hsperfdata_hadoop 1
hsperfdata_root 1
root 22
systemd-private-0abe12489c264785bd8088f6e33eeb83-ModemManager.service-aaM0Jf 1
systemd-private-0abe12489c264785bd8088f6e33eeb83-bluetooth.service-7X05Qi 1
systemd-private-0abe12489c264785bd8088f6e33eeb83-chronyd.service-HmOY5i 1
systemd-private-0abe12489c264785bd8088f6e33eeb83-colord.service-VyCTLg 1
systemd-private-0abe12489c264785bd8088f6e33eeb83-rtkit-daemon.service-3U58pj 1
total 1
tracker-extract-files.1000 1
vmware-root_916-2689078442 1
vmware-root_918-2697532712 1
vmware-root_921-3980298495 1
vmware-root_925-3988621690 1
vmware-root_927-3980167416 1
yarn-hadoop-nodemanager.pid 1
yarn-hadoop-resourcemanager.pid 1

示例2:计算圆周率Π的值

执行命令及日志信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
[hadoop@node1 ~]$ hadoop jar /app/hadoop-2.10.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar pi 10 100
Number of Maps = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
22/05/11 23:35:11 INFO client.RMProxy: Connecting to ResourceManager at node1/199.188.166.111:8032
22/05/11 23:35:12 INFO input.FileInputFormat: Total input files to process : 10
22/05/11 23:35:12 INFO mapreduce.JobSubmitter: number of splits:10
22/05/11 23:35:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1652322858586_0002
22/05/11 23:35:13 INFO conf.Configuration: resource-types.xml not found
22/05/11 23:35:13 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
22/05/11 23:35:13 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
22/05/11 23:35:13 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
22/05/11 23:35:13 INFO impl.YarnClientImpl: Submitted application application_1652322858586_0002
22/05/11 23:35:13 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1652322858586_0002/
22/05/11 23:35:13 INFO mapreduce.Job: Running job: job_1652322858586_0002
22/05/11 23:35:24 INFO mapreduce.Job: Job job_1652322858586_0002 running in uber mode : false
22/05/11 23:35:24 INFO mapreduce.Job: map 0% reduce 0%
22/05/11 23:35:42 INFO mapreduce.Job: map 20% reduce 0%
22/05/11 23:36:00 INFO mapreduce.Job: map 20% reduce 7%
22/05/11 23:36:28 INFO mapreduce.Job: map 30% reduce 7%
22/05/11 23:36:29 INFO mapreduce.Job: map 50% reduce 7%
22/05/11 23:36:30 INFO mapreduce.Job: map 70% reduce 7%
22/05/11 23:36:31 INFO mapreduce.Job: map 100% reduce 7%
22/05/11 23:36:32 INFO mapreduce.Job: map 100% reduce 100%
22/05/11 23:36:33 INFO mapreduce.Job: Job job_1652322858586_0002 completed successfully
22/05/11 23:36:33 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=226
FILE: Number of bytes written=2297625
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2620
HDFS: Number of bytes written=215
HDFS: Number of read operations=43
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=10
Launched reduce tasks=1
Data-local map tasks=10
Total time spent by all maps in occupied slots (ms)=528176
Total time spent by all reduces in occupied slots (ms)=48013
Total time spent by all map tasks (ms)=528176
Total time spent by all reduce tasks (ms)=48013
Total vcore-milliseconds taken by all map tasks=528176
Total vcore-milliseconds taken by all reduce tasks=48013
Total megabyte-milliseconds taken by all map tasks=540852224
Total megabyte-milliseconds taken by all reduce tasks=49165312
Map-Reduce Framework
Map input records=10
Map output records=20
Map output bytes=180
Map output materialized bytes=280
Input split bytes=1440
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=280
Reduce input records=20
Reduce output records=0
Spilled Records=40
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=9894
CPU time spent (ms)=13350
Physical memory (bytes) snapshot=2232963072
Virtual memory (bytes) snapshot=20908384256
Total committed heap usage (bytes)=1540988928
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1180
File Output Format Counters
Bytes Written=97
Job Finished in 82.602 seconds
Estimated value of Pi is 3.14800000000000000000

查看MapReduce任务计算资源情况

  • 在下面页面可以实时看到集群资源的使用情况(因为执行完成,所以参数为初始参数)

image-20220512144327517

  • MapReduce任务列表

image-20220512144447933

  • 查看任务的详细信息

image-20220512144548355


蚂蚁再小也是肉🥩!


Hadoop集群基础操作
http://heibanbai.com.cn/posts/2cfc941a/
作者
黑伴白
发布于
2022年5月1日
许可协议

“您的支持,我的动力!觉得不错的话,给点打赏吧 ୧(๑•̀⌄•́๑)૭”

微信二维码

微信支付

支付宝二维码

支付宝支付