hadoop(二)yarn部署

This is about jdk

Posted by PsycheLee on 2015-11-08

hadoop(二)yarn部署

配置修改

mapred-site.xml

1
2
3
4
5
6
7
8
9
10
[bigdata@hadoop001 hadoop]$ pwd
/home/bigdata/app/hadoop/etc/hadoop
[bigdata@hadoop001 hadoop]$ cp mapred-site.xml.template mapred-site.xml
[bigdata@hadoop001 hadoop]$ vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[bigdata@hadoop001 hadoop]$ vi yarn-site.xml 
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop001</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:7776</value>
</property>
</configuration>

##8088端口暴露外网,会把病毒感染 挖矿程序 计算比特币

open: 提前把7776端口号加入到安全组
http://hadoop001:7776/cluster
##配置windows hosts 外网ip
106.14.23.129 hadoop001

启动

1
2
3
4
5
6
7
8
9
10
11
[bigdata@hadoop001 hadoop]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/bigdata/app/hadoop-2.6.0-cdh5.16.2/logs/yarn-bigdata-resourcemanager-hadoop001.out
hadoop001: starting nodemanager, logging to /home/bigdata/app/hadoop-2.6.0-cdh5.16.2/logs/yarn-bigdata-nodemanager-hadoop001.out
[bigdata@hadoop001 hadoop]$ jps
5216 NodeManager
3430 SecondaryNameNode
3271 DataNode
3144 NameNode
5112 ResourceManager
5517 Jps

词频统计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
[bigdata@hadoop001 hadoop]$ hadoop jar\
./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
wordcount input output2
20/11/28 13:27:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/28 13:27:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/11/28 13:27:55 INFO input.FileInputFormat: Total input paths to process : 2
20/11/28 13:27:55 INFO mapreduce.JobSubmitter: number of splits:2
####splits数量
20/11/28 13:27:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606541061047_0001
20/11/28 13:27:56 INFO impl.YarnClientImpl: Submitted application application_1606541061047_0001
20/11/28 13:27:56 INFO mapreduce.Job: The url to track the job: http://hadoop001:7776/proxy/application_1606541061047_0001/
20/11/28 13:27:56 INFO mapreduce.Job: Running job: job_1606541061047_0001
20/11/28 13:28:02 INFO mapreduce.Job: Job job_1606541061047_0001 running in uber mode : false
20/11/28 13:28:02 INFO mapreduce.Job: map 0% reduce 0%
20/11/28 13:28:08 INFO mapreduce.Job: map 50% reduce 0%
20/11/28 13:28:09 INFO mapreduce.Job: map 100% reduce 0%
20/11/28 13:28:12 INFO mapreduce.Job: map 100% reduce 100%
20/11/28 13:28:13 INFO mapreduce.Job: Job job_1606541061047_0001 completed successfully
20/11/28 13:28:13 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=138
FILE: Number of bytes written=429308
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=336
HDFS: Number of bytes written=88
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=5087
Total time spent by all reduces in occupied slots (ms)=2057
Total time spent by all map tasks (ms)=5087
Total time spent by all reduce tasks (ms)=2057
Total vcore-milliseconds taken by all map tasks=5087
Total vcore-milliseconds taken by all reduce tasks=2057
Total megabyte-milliseconds taken by all map tasks=5209088
Total megabyte-milliseconds taken by all reduce tasks=2106368
Map-Reduce Framework
Map input records=15
Map output records=18
Map output bytes=184
Map output materialized bytes=144
Input split bytes=222
Combine input records=18
Combine output records=11
Reduce input groups=11
Reduce shuffle bytes=144
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=179
CPU time spent (ms)=1230
Physical memory (bytes) snapshot=774582272
Virtual memory (bytes) snapshot=8328560640
Total committed heap usage (bytes)=759169024
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=114
File Output Format Counters
Bytes Written=88
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[bigdata@hadoop001 hadoop]$ hadoop fs -ls
20/11/28 13:30:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxr-xr-x - bigdata supergroup 0 2020-11-28 12:14 input
drwxr-xr-x - bigdata supergroup 0 2020-11-28 12:15 output
drwxr-xr-x - bigdata supergroup 0 2020-11-28 13:28 output2
[bigdata@hadoop001 hadoop]$ hadoop fs -get output2 output2
20/11/28 13:30:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[bigdata@hadoop001 hadoop]$ cd output2
[bigdata@hadoop001 output2]$ cat part-r-00000
daa 1
dfs 1
dfs1 1
dfs2 1
dfs444 3
dfs5 1
dfs5dfsssss 2
dfsssss 1
fa 3
faifa 1
fdsadf 3

在map阶段读取数据前,FileInputFormat会将输入文件分割成split。split的个数决定了map的个数。影响map个数(split个数)的主要因素有:

  1. 文件的大小。当块(dfs.block.size)为128m时,如果输入文件为128m,会被划分为1个split;当块为256m,会被划分为2个split。

  2. 文件的个数。FileInputFormat按照文件分割split,并且只会分割大文件,即那些大小超过HDFS块的大小的文件。如果HDFS中dfs.block.size设置为128m,而输入的目录中文件有100个,则划分后的split个数至少为100个。

  3. splitsize的大小。分片是按照splitszie的大小进行分割的,一个split的大小在没有设置的情况下,默认等于hdfs block的大小。但应用程序可以通过两个参数来对splitsize进行调节

​ InputSplit=Math.max(minSize, Math.min(maxSize, blockSize)

​ 其中:

​ minSize=mapred.min.split.size

​ maxSize=mapred.max.split.size

​ 我们可以在MapReduce程序的驱动部分添加如下代码:

​ TextInputFormat.setMinInputSplitSize(job,1024L); // 设置最小分片大小

​ TextInputFormat.setMaxInputSplitSize(job,1024×1024×10L); // 设置最大分片大小

Uber mode简单地可以理解成JVM重用,该模式是2.x开始引入的;以Uber模式运行MR作业,所有的Map Tasks和Reduce Tasks将会在ApplicationMaster所在的容器(container)中运行,也就是说整个MR作业运行的过程只会启动AM container,因为不需要启动mapper 和 reducer containers,所以AM不需要和远程containers通信,整个过程简单了。

不是所有的MR作业都可以启用Uber模式,如果我们的MR作业输入的数据量非常小,启动Map container或Reduce container的时间都比处理数据要长,那么这个作业就可以考虑启用Uber模式运行,一般情况下,对小作业启用Uber模式运行会得到2x-3x的性能提升。

挖矿

监控服务器cpu load太高
top命令查看进程

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1065 root 20 0 289744 15084 6696 S 300% 0.2 14:50.30 jdog-monitor.1.

ps -ef| grep 1065 查看进程内容
kill -9 1065 进程杀死 过一会挖矿的程序又被自动启动

JPS命令

位置

1
2
[bigdata@hadoop001 hadoop]$ which jps
/usr/java/jdk1.8.0_181/bin/jps

使用

1
2
3
4
5
6
7
[bigdata@hadoop001 hadoop]$ jps
8320 Jps
3430 SecondaryNameNode
3271 DataNode
3144 NameNode
7852 NodeManager
7741 ResourceManager

对应的标识文件存储在哪里

1
2
3
4
5
6
7
8
[bigdata@hadoop001 hadoop]$ cd /tmp/hsperfdata_bigdata/ 
[bigdata@hadoop001 hsperfdata_bigdata]$ ll
total 160
-rw------- 1 bigdata bigdata 32768 Nov 28 14:17 3144
-rw------- 1 bigdata bigdata 32768 Nov 28 14:17 3271
-rw------- 1 bigdata bigdata 32768 Nov 28 14:17 3430
-rw------- 1 bigdata bigdata 32768 Nov 28 14:17 7741
-rw------- 1 bigdata bigdata 32768 Nov 28 14:17 7852

哪个用户可以查看

1
2
3
4
5
6
7
[root@hadoop001 ~]# jps
3430 SecondaryNameNode
3271 DataNode
8248 Jps
3144 NameNode
7852 NodeManager
7741 ResourceManager
1
2
3
[root@hadoop001 ~]# su - tom
[tom@hadoop001 ~]$ jps
8301 Jps

process information unavailable描述

1
2
3
4
5
6
7
8
[root@hadoop001 ~]# kill -9 3430
[root@hadoop001 ~]# jps
3430 -- process information unavailable
3271 DataNode
3144 NameNode
8346 Jps
7852 NodeManager
7741 ResourceManager

当看见 process information unavailable
不能代表进程是存在 或者不存在,要当心,尤其使用jps命令来做脚本状态检测的
一般使用经典的 ps -ef | grep xxx命令去查看进程是否存在,
这才是真正的状态检测。

但是: 比如spark thriftserver +hive 会启动一个driver 进程 110,
默认端口号 10001。由于该程序的内存泄露或者某种bug,导致
进程ps是存在的,10001端口号下线了,就不能够对外提供服务。

总结: 未来做任何程序的状态检测,必须通过端口号来。

CDH root用户,jps命令查看会有很多的 process information unavailable
ps -ef| grep xxx 查看是正确的
想要看到正常的表述,需要切换对应的用户,
比如su - hdfs(有可能你切换不过去,需要/etc/passwd文件的修正)
再执行jps命令

JPS文件删除之后不影响服务重启

PID文件

1
2
3
4
5
6
7
8
9
[bigdata@hadoop001 ~]$ cd /tmp
[bigdata@hadoop001 tmp]$ ll
total 60
srwxr-xr-x 1 root root 0 Nov 28 08:52 Aegis-<Guid(5A2C30A2-A87D-490A-9281-6765EDAD7CBA)>
-rw-rw-r-- 1 bigdata bigdata 5 Nov 28 11:36 hadoop-bigdata-datanode.pid
-rw-rw-r-- 1 bigdata bigdata 5 Nov 28 11:36 hadoop-bigdata-namenode.pid
-rw-rw-r-- 1 bigdata bigdata 5 Nov 28 11:36 hadoop-bigdata-secondarynamenode.pid
-rw-rw-r-- 1 bigdata bigdata 5 Nov 28 14:01 yarn-bigdata-nodemanager.pid
-rw-rw-r-- 1 bigdata bigdata 5 Nov 28 14:01 yarn-bigdata-resourcemanager.pid

测试删除PID文件 影响重启

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[bigdata@hadoop001 tmp]$ mv hadoop-bigdata-datanode.pid hadoop-bigdata-datanode.pid.bak
[bigdata@hadoop001 tmp]$ jps
3271 DataNode
3144 NameNode
7852 NodeManager
7741 ResourceManager
8365 Jps
[bigdata@hadoop001 tmp]$ cd
[bigdata@hadoop001 ~]$ sh stop-dfs.sh
20/11/28 14:25:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [hadoop001]
hadoop001: stopping namenode
hadoop001: no datanode to stop
Stopping secondary namenodes [hadoop001]
hadoop001: no secondarynamenode to stop
20/11/28 14:25:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[bigdata@hadoop001 ~]$ jps
3271 DataNode
8680 Jps
7852 NodeManager
7741 ResourceManager
[bigdata@hadoop001 ~]$ ps -ef|grep 3271
bigdata 3271 1 0 11:36 ? 00:00:14 /usr/java/jdk1.8.0_181/bin/java -Dproc_datanode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/bigdata/app/hadoop-2.6.0-cdh5.16.2/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/bigdata/app/hadoop-2.6.0-cdh5.16.2 -Dhadoop.id.str=bigdata -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/bigdata/app/hadoop-2.6.0-cdh5.16.2/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/bigdata/app/hadoop-2.6.0-cdh5.16.2/logs -Dhadoop.log.file=hadoop-bigdata-datanode-hadoop001.log -Dhadoop.home.dir=/home/bigdata/app/hadoop-2.6.0-cdh5.16.2 -Dhadoop.id.str=bigdata -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/home/bigdata/app/hadoop-2.6.0-cdh5.16.2/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
bigdata 8693 3680 0 14:26 pts/1 00:00:00 grep --color=auto 3271

修改pid文件位置

1
2
[bigdata@hadoop001 hadoop]$ vi hadoop-env.sh
export HADOOP_PID_DIR=/home/hadoop/tmp

块 block

dfs.blocksize 128M
大文件 小文件

一缸水 260ml
瓶子的规格 128ml ==》 dfs.blocksize
260/128=2 …4ml
1 128ml
2 128ml
3 4ml

一个大文件 260m
1 128m
2 128m
3 4m

存储到伪分布式hdfs上是 3个块 实际存储260m1=260m
集群hdfs上>=3节点 多副本机制3
一个块会被连他自己 复制是3份 3
3= 9个块 实际存储260m*3=780m

10个小文件 每个小文件 10m,那么伪分布式:
10个块

namenode:维护一个文件被切割哪个块,这些块被存放到哪些机器

10条元数据

10个小文件 合并 1个文件100m, 那么伪分布式:
1个块

1条元数据

10个小文件10m VS 1个大文件100m
结果: 1个大文件对 nn的存储压力较小

再举例:
假设1亿个小文件,每个小文件10kb, 集群3副本机制,3亿个block,3亿个元数据
假如1亿个小文件,合并为1kw个 100m文件 , 集群3副本机制,3kw block,3kw元数据

nn维护 3亿个元数据 还是 3kw元数据 的压力,谁轻松?
3kw轻松

元数据是存储在nn进程的内存里 ,那内存是一定的 8g

所以生产上:
尽量规避小文件在hdfs上的存储
a.数据传输到hdfs之前,提前合并
b.数据已经到hdfs,就定时的 业务低谷期,去合并冷文件
写个脚本 每一天合并
11-1 合并09-30数据
11-2 合并10-1数据

​ 一天卡一天去处理