Lustre

From UF HPC Wiki

Jump to: navigation, search

Contents

Description

Monitoring

Debugging

MDT Problems

When mounting MDT filesystem, kernel crashes -- The Dilger Procedure

  • First, try mounting the filesystem with "-o abort_recovery" as an option.
  • If this does not work, test and see if you are able to mount the filesystem as "-t ldiskfs". If that works, you can try truncating the last_rcvd file:
mount -t ldiskfs /dev/MDSDEV /mnt/mds
cp /mnt/mds/last_rcvd /mnt/mds/last_rcvd.sav
cp /mnt/mds/last_rcvd /tmp/last_rcvd.sav
dd if=/mnt/mds/last_rcvd.sav of=/mnt/mds/last_rcvd bs=8k count=1
umount /mnt/mds

mount -t lustre /dev/MSDDEV /mnt/mds

Apparently this problem has been fixed in newer versions of Lustre (>1.6.5)

- Thanks to Andreas Dilger for this solution.

Once the procedure has been completed, the recovery process should occur. It may take some amount of time for the recovery to actually start as there appears to be a time delay.

Full OST

One of the problems we have run into with the Lustre filesystem is that an OST can get full, at which point some files simply fail to write. This can obviously be a problem when you are dealing with a cluster of machines that are doing a lot of automation.

Unfortunately, Lustre does not yet have the tools to move files around from one OST to another easily, so the only real solution is to shut down an OST when it gets to a somewhat full status and wait for it to drain files to some reasonable level.

Let us say that we are getting out of space errors on a Lustre filesystem. We do a df on the client, and see that there is still PLENTY of space on the filesystem:

10.13.24.40@o2ib:/ufhpc   29966190744 22033440760 7932294688  74% /ufhpc/scratch

Just 74% utilization, plenty of space! And the inodes on that filesystem are just fine as well:

10.13.24.40@o2ib:/ufhpc   61049728 18347646 42702082   31% /ufhpc/scratch

So what gives? Well... if you look at the individual OST's for that filesystem, you will see a different story:

[root@submit ~]# lfs df
UUID                 1K-blocks      Used Available  Use% Mounted on
ufhpc-MDT0000_UUID   213655168  15047008 198608160    7% /ufhpc/scratch[MDT:0]
ufhpc-OST0004_UUID   1426961464 870523164 556438300   61% /ufhpc/scratch[OST:4]
ufhpc-OST0005_UUID   1426961464 887714308 539247156   62% /ufhpc/scratch[OST:5]
ufhpc-OST0006_UUID   1426961464 1226810216 200151248   85% /ufhpc/scratch[OST:6]
ufhpc-OST0007_UUID   1426961464 818768276 608193188   57% /ufhpc/scratch[OST:7]
ufhpc-OST0008_UUID   1426961464 1413423592  13537872   99% /ufhpc/scratch[OST:8]
ufhpc-OST0009_UUID   1426961464 989720988 437240476   69% /ufhpc/scratch[OST:9]
ufhpc-OST000a_UUID   1426961464 969032156 457929308   67% /ufhpc/scratch[OST:10]
ufhpc-OST000b_UUID   1426961464 1304376908 122584556   91% /ufhpc/scratch[OST:11]
ufhpc-OST000c_UUID   1426961464 976593032 450368432   68% /ufhpc/scratch[OST:12]
ufhpc-OST000d_UUID   1426961464 1038331812 388629652   72% /ufhpc/scratch[OST:13]
ufhpc-OST000e_UUID   1426961464 1361363496  65597968   95% /ufhpc/scratch[OST:14]
ufhpc-OST000f_UUID   1426961464 948013064 478948400   66% /ufhpc/scratch[OST:15]
ufhpc-OST0010_UUID   1426961464 928503504 498457960   65% /ufhpc/scratch[OST:16]
ufhpc-OST0011_UUID   1426961464 895868424 531093040   62% /ufhpc/scratch[OST:17]
ufhpc-OST0012_UUID   1426961464 834059576 592901888   58% /ufhpc/scratch[OST:18]
ufhpc-OST0013_UUID   1426961464 862286124 564675340   60% /ufhpc/scratch[OST:19]
ufhpc-OST0014_UUID   1426961464 999123524 427837940   70% /ufhpc/scratch[OST:20]
ufhpc-OST0015_UUID   1426961464 798103228 628858236   55% /ufhpc/scratch[OST:21]
ufhpc-OST0016_UUID   1426961464 889373700 537587764   62% /ufhpc/scratch[OST:22]
ufhpc-OST0017_UUID   1426961464 979535156 447426308   68% /ufhpc/scratch[OST:23]
ufhpc-OST0018_UUID   1426961464 935184472 491776992   65% /ufhpc/scratch[OST:24]

Uh-oh! Look at that! OST:8 is at 99%! (This was taken from a filesystem that had not yet run into this problem, but was darned close) So that is why this file is not writing, but other files will... when writing this file it is trying to write to that OST and running out of space. So what do we do? Well, the answer is to simply turn the OST off for a while until it gets back in line with the others, utilization-wise.

What we would like to do in this case is to disable the OST for new-file creation, but allow reads, writing to current files, and deletions. Particularly deletions. We can do this with the following procedure:

  • Login to the MDS of the filesystem.
  • Go into the lctl facility:
[root@mds ~]# lctl
lctl > 
  • Get a listing of the current OST's. We need this list because the device numbering does not necessarily correspond to the OST number one-to-one
lctl > dl
  0 UP mgs MGS MGS 849
  1 UP mgc MGC10.13.24.40@o2ib 
  2 UP mdt MDS MDS_uuid 3
  3 UP lov ufhpc-mdtlov ufhpc-mdtlov_UUID 4
  4 UP mds ufhpc-MDT0000 ufhpc-MDT0000_UUID 835
  5 UP osc ufhpc-OST0004-osc ufhpc-mdtlov_UUID 5
  6 UP osc ufhpc-OST0005-osc ufhpc-mdtlov_UUID 5
  7 UP osc ufhpc-OST0006-osc ufhpc-mdtlov_UUID 5
  8 UP osc ufhpc-OST0007-osc ufhpc-mdtlov_UUID 5
  9 UP osc ufhpc-OST0008-osc ufhpc-mdtlov_UUID 5
 10 UP osc ufhpc-OST0009-osc ufhpc-mdtlov_UUID 5
 11 UP osc ufhpc-OST000a-osc ufhpc-mdtlov_UUID 5
 12 UP osc ufhpc-OST000b-osc ufhpc-mdtlov_UUID 5
 13 UP osc ufhpc-OST000c-osc ufhpc-mdtlov_UUID 5
 14 UP osc ufhpc-OST000d-osc ufhpc-mdtlov_UUID 5
 15 UP osc ufhpc-OST000e-osc ufhpc-mdtlov_UUID 5
 16 UP osc ufhpc-OST000f-osc ufhpc-mdtlov_UUID 5
 17 UP osc ufhpc-OST0010-osc ufhpc-mdtlov_UUID 5
 18 UP osc ufhpc-OST0011-osc ufhpc-mdtlov_UUID 5
 19 UP osc ufhpc-OST0012-osc ufhpc-mdtlov_UUID 5
 20 UP osc ufhpc-OST0013-osc ufhpc-mdtlov_UUID 5
 21 UP osc ufhpc-OST0014-osc ufhpc-mdtlov_UUID 5
 22 UP osc ufhpc-OST0015-osc ufhpc-mdtlov_UUID 5
 23 UP osc ufhpc-OST0016-osc ufhpc-mdtlov_UUID 5
 24 UP osc ufhpc-OST0017-osc ufhpc-mdtlov_UUID 5
 25 UP osc ufhpc-OST0018-osc ufhpc-mdtlov_UUID 5
  • Now that we have the ID for OST 8 (it's 9 in this case) we can disable it. It actually takes two commands. First you have to designate which device you are going to work on, then you have to deactivate it.
lctl > device 9
lctl > deactivate
  • Now we want to check and make sure that OST is in fact deactivated. Logout of lctl, then take a look at the active file in /proc for that particular OST:
lctl > exit
[root@hpcmds ~]# cat /proc/fs/lustre/osc/ufhpc-OST0008-osc/active 
0

At this point, we are done with the deactivation portion of things. Next we would take periodic looks at the fullness of the OST, waiting until it comes down to a reasonable level. Once it does, we basically reverse this process by logging into the MDS node again, going into lctl, and invoking the "activate" command instead of the "deactivate" command.

OST File usage stats

There are not yet any good tools for file usage statistics (something like du for a single OST), but you can get a list of the largest files in an OST (somewhat manually) by running a couple of different scripts (one to get a list of all files and stripe info in the filesystem, another to get the file sizes for matching OSTs)

Personal tools