Finding out the difference in %
# su - hdfs
$ hdfs dfsadmin -report > dfsadmin.out
Setting balancer bandwidth to be 50mb/s (1024x1024x50) from 100mb/s (104857600)
$ hdfs dfsadmin -setBalancerBandwidth 52428800
# ethtool em1
Settings for em1:
We will be able to use 10% of bandwidth speed:
which is 100MB. for above we have set 50MB.
The balancer ran, but it wound things up pretty quickly, because it found that
all nodes in the cluster have a usage that’s within the threshold value.
The cluster is already balanced!
In our case, for balancing to occur, you must specify a threshold value that’s <=2.
Here’s one way to run:
$ nohup hdfs balancer -threshold 2 > balance.log 2>&1
When to Run the Balancer
A couple of guidelines as to when to run the balancer are appropriate. In a large
cluster, run the balancer regularly. You can schedule a cron job to perform the
balancing, instead of manually running it yourself. If a scheduled balancer job is
still running when the next job needs to start, no harm’s done, as the second
balancer job won’t start.
How to increase HDFS Balancer network bandwidth for faster movement
on all the DN and the client we ran the command below
hdfs balancer -Dfs.defaultFS=hdfs://<NN_HOSTNAME>:8020 -Ddfs.balancer.movedWinWidth=5400000 -Ddfs.balancer.moverThreads=1000 -Ddfs.balancer.dispatcherThreads=200 -Ddfs.datanode.balance.max.concurrent.moves=5 -Ddfs.balance.bandwidthPerSec=100000000 -Ddfs.balancer.max-size-to-move=10737418240 -threshold 5
This will faster balance your HDFS data between datanodes and do this when
the cluster is not heavily used. Hope this helps you.
dfs.datanode.balance.max.concurrent.moves, default is 5
This configuration is to limit the maximum number of concurrent block moves that a
Datanode is allowed for balancing the cluster. If this configuration is set in a
Datanode, the Datanode will throw an exception if the limit is exceeded. If it is
set in the Balancer, the Balancer will schedule concurrent block movements within
dfs.datanode.balance.bandwidthPerSec, default is 1048576 (=1MB/s)
This configuration is to limit the bandwidth in each Datanode using for balancing
dfs.balancer.moverThreads, default is 1000
It is the number of threads in Balancer for moving blocks. Each block move requires a
thread so that this configuration limits the number of total concurrent moves for
balancing in the entire cluster.
3 New Configurations
Allow Balancer to run faster:
dfs.balancer.max-size-to-move, default is 10737418240 (=10GB)
In each iteration, Balancer chooses datanodes in pairs and then moves data between
the datanode pairs. This configuration is to limit the maximum size of data that the
Balancer will move between a chosen datanode pair. When the network and disk are not
saturated, increasing this configuration can increase the data transfer between
datanode pair in each iteration while the duration of an iteration remains about the
Do not use small blocks for balancing the cluster:
dfs.balancer.getBlocks.size, default is 2147483648 (=2GB)
dfs.balancer.getBlocks.min-block-size, default is 10485760 (=10MB)
After Balancer decided to move a certain amount of data between two datanodes
(source and destination), it repeatedly invokes the getBlocks(..) rpc to the Namenode in
order to get lists of blocks from the source datanode until the required amount of data
is scheduled. dfs.balancer.getBlocks.size is the total data size of the block list
returned by a getBlocks(..) rpc. dfs.balancer.getBlocks.min-block-size is the minimum
block size that the blocks will be used for balancing the cluster.
dfs.datanode.block-pinning.enabled, default is false
When creating a file, a user application may specify a list of favorable datanodes via
the file creation API in DistributedFileSystem. Namenode uses best effort allocating
blocks to the favorable datanodes. When dfs.datanode.block-pinning.enabled is set to
true, if a block replica is written to a favorable datanode, it will be “pinned” to
that datanode. The pinned replicas will not be moved for cluster balancing in order
to keep them stored in the specified favorable datanodes. This feature is useful for
block distribution aware user applications such as HBase.