Lustre
From UF HPC Wiki
Contents |
Description
Monitoring
- Collectl_docs - collectl general howto
- Lustre_proc - lustre internals for lustre /proc info
- Lustre OST Watch - Monitoring script for Lustre OST's.
Debugging
MDT Problems
When mounting MDT filesystem, kernel crashes -- The Dilger Procedure
- First, try mounting the filesystem with "-o abort_recovery" as an option.
- If this does not work, test and see if you are able to mount the filesystem as "-t ldiskfs". If that works, you can try truncating the last_rcvd file:
mount -t ldiskfs /dev/MDSDEV /mnt/mds cp /mnt/mds/last_rcvd /mnt/mds/last_rcvd.sav cp /mnt/mds/last_rcvd /tmp/last_rcvd.sav dd if=/mnt/mds/last_rcvd.sav of=/mnt/mds/last_rcvd bs=8k count=1 umount /mnt/mds mount -t lustre /dev/MSDDEV /mnt/mds
Apparently this problem has been fixed in newer versions of Lustre (>1.6.5)
- Thanks to Andreas Dilger for this solution.
Once the procedure has been completed, the recovery process should occur. It may take some amount of time for the recovery to actually start as there appears to be a time delay.
Full OST
One of the problems we have run into with the Lustre filesystem is that an OST can get full, at which point some files simply fail to write. This can obviously be a problem when you are dealing with a cluster of machines that are doing a lot of automation.
Unfortunately, Lustre does not yet have the tools to move files around from one OST to another easily, so the only real solution is to shut down an OST when it gets to a somewhat full status and wait for it to drain files to some reasonable level.
Let us say that we are getting out of space errors on a Lustre filesystem. We do a df on the client, and see that there is still PLENTY of space on the filesystem:
10.13.24.40@o2ib:/ufhpc 29966190744 22033440760 7932294688 74% /ufhpc/scratch
Just 74% utilization, plenty of space! And the inodes on that filesystem are just fine as well:
10.13.24.40@o2ib:/ufhpc 61049728 18347646 42702082 31% /ufhpc/scratch
So what gives? Well... if you look at the individual OST's for that filesystem, you will see a different story:
[root@submit ~]# lfs df UUID 1K-blocks Used Available Use% Mounted on ufhpc-MDT0000_UUID 213655168 15047008 198608160 7% /ufhpc/scratch[MDT:0] ufhpc-OST0004_UUID 1426961464 870523164 556438300 61% /ufhpc/scratch[OST:4] ufhpc-OST0005_UUID 1426961464 887714308 539247156 62% /ufhpc/scratch[OST:5] ufhpc-OST0006_UUID 1426961464 1226810216 200151248 85% /ufhpc/scratch[OST:6] ufhpc-OST0007_UUID 1426961464 818768276 608193188 57% /ufhpc/scratch[OST:7] ufhpc-OST0008_UUID 1426961464 1413423592 13537872 99% /ufhpc/scratch[OST:8] ufhpc-OST0009_UUID 1426961464 989720988 437240476 69% /ufhpc/scratch[OST:9] ufhpc-OST000a_UUID 1426961464 969032156 457929308 67% /ufhpc/scratch[OST:10] ufhpc-OST000b_UUID 1426961464 1304376908 122584556 91% /ufhpc/scratch[OST:11] ufhpc-OST000c_UUID 1426961464 976593032 450368432 68% /ufhpc/scratch[OST:12] ufhpc-OST000d_UUID 1426961464 1038331812 388629652 72% /ufhpc/scratch[OST:13] ufhpc-OST000e_UUID 1426961464 1361363496 65597968 95% /ufhpc/scratch[OST:14] ufhpc-OST000f_UUID 1426961464 948013064 478948400 66% /ufhpc/scratch[OST:15] ufhpc-OST0010_UUID 1426961464 928503504 498457960 65% /ufhpc/scratch[OST:16] ufhpc-OST0011_UUID 1426961464 895868424 531093040 62% /ufhpc/scratch[OST:17] ufhpc-OST0012_UUID 1426961464 834059576 592901888 58% /ufhpc/scratch[OST:18] ufhpc-OST0013_UUID 1426961464 862286124 564675340 60% /ufhpc/scratch[OST:19] ufhpc-OST0014_UUID 1426961464 999123524 427837940 70% /ufhpc/scratch[OST:20] ufhpc-OST0015_UUID 1426961464 798103228 628858236 55% /ufhpc/scratch[OST:21] ufhpc-OST0016_UUID 1426961464 889373700 537587764 62% /ufhpc/scratch[OST:22] ufhpc-OST0017_UUID 1426961464 979535156 447426308 68% /ufhpc/scratch[OST:23] ufhpc-OST0018_UUID 1426961464 935184472 491776992 65% /ufhpc/scratch[OST:24]
Uh-oh! Look at that! OST:8 is at 99%! (This was taken from a filesystem that had not yet run into this problem, but was darned close) So that is why this file is not writing, but other files will... when writing this file it is trying to write to that OST and running out of space. So what do we do? Well, the answer is to simply turn the OST off for a while until it gets back in line with the others, utilization-wise.
What we would like to do in this case is to disable the OST for new-file creation, but allow reads, writing to current files, and deletions. Particularly deletions. We can do this with the following procedure:
- Login to the MDS of the filesystem.
- Go into the lctl facility:
[root@mds ~]# lctl lctl >
- Get a listing of the current OST's. We need this list because the device numbering does not necessarily correspond to the OST number one-to-one
lctl > dl 0 UP mgs MGS MGS 849 1 UP mgc MGC10.13.24.40@o2ib 2 UP mdt MDS MDS_uuid 3 3 UP lov ufhpc-mdtlov ufhpc-mdtlov_UUID 4 4 UP mds ufhpc-MDT0000 ufhpc-MDT0000_UUID 835 5 UP osc ufhpc-OST0004-osc ufhpc-mdtlov_UUID 5 6 UP osc ufhpc-OST0005-osc ufhpc-mdtlov_UUID 5 7 UP osc ufhpc-OST0006-osc ufhpc-mdtlov_UUID 5 8 UP osc ufhpc-OST0007-osc ufhpc-mdtlov_UUID 5 9 UP osc ufhpc-OST0008-osc ufhpc-mdtlov_UUID 5 10 UP osc ufhpc-OST0009-osc ufhpc-mdtlov_UUID 5 11 UP osc ufhpc-OST000a-osc ufhpc-mdtlov_UUID 5 12 UP osc ufhpc-OST000b-osc ufhpc-mdtlov_UUID 5 13 UP osc ufhpc-OST000c-osc ufhpc-mdtlov_UUID 5 14 UP osc ufhpc-OST000d-osc ufhpc-mdtlov_UUID 5 15 UP osc ufhpc-OST000e-osc ufhpc-mdtlov_UUID 5 16 UP osc ufhpc-OST000f-osc ufhpc-mdtlov_UUID 5 17 UP osc ufhpc-OST0010-osc ufhpc-mdtlov_UUID 5 18 UP osc ufhpc-OST0011-osc ufhpc-mdtlov_UUID 5 19 UP osc ufhpc-OST0012-osc ufhpc-mdtlov_UUID 5 20 UP osc ufhpc-OST0013-osc ufhpc-mdtlov_UUID 5 21 UP osc ufhpc-OST0014-osc ufhpc-mdtlov_UUID 5 22 UP osc ufhpc-OST0015-osc ufhpc-mdtlov_UUID 5 23 UP osc ufhpc-OST0016-osc ufhpc-mdtlov_UUID 5 24 UP osc ufhpc-OST0017-osc ufhpc-mdtlov_UUID 5 25 UP osc ufhpc-OST0018-osc ufhpc-mdtlov_UUID 5
- Now that we have the ID for OST 8 (it's 9 in this case) we can disable it. It actually takes two commands. First you have to designate which device you are going to work on, then you have to deactivate it.
lctl > device 9 lctl > deactivate
- Now we want to check and make sure that OST is in fact deactivated. Logout of lctl, then take a look at the active file in /proc for that particular OST:
lctl > exit [root@hpcmds ~]# cat /proc/fs/lustre/osc/ufhpc-OST0008-osc/active 0
At this point, we are done with the deactivation portion of things. Next we would take periodic looks at the fullness of the OST, waiting until it comes down to a reasonable level. Once it does, we basically reverse this process by logging into the MDS node again, going into lctl, and invoking the "activate" command instead of the "deactivate" command.
OST File usage stats
There are not yet any good tools for file usage statistics (something like du for a single OST), but you can get a list of the largest files in an OST (somewhat manually) by running a couple of different scripts (one to get a list of all files and stripe info in the filesystem, another to get the file sizes for matching OSTs)
