[vdr] mdadm software raid5 arrays?

Simon Baxter linuxtv at nzbaxters.com
Tue Nov 10 09:46:52 CET 2009


What about a simple raid 1 mirror set?

----- Original Message ----- 
From: "H. Langos" <henrik-vdr at prak.org>
To: "VDR Mailing List" <vdr at linuxtv.org>
Sent: Tuesday, November 10, 2009 6:49 AM
Subject: Re: [vdr] mdadm software raid5 arrays?


> Hi Simon,
>
> On Sat, Nov 07, 2009 at 07:38:03AM +1300, Simon Baxter wrote:
>> Hi
>>
>> I've been running logical volume management (LVMs) on my production VDR
>> box for years, but recently had a drive failure.  To be honest, in the
>> ~20 years I've had PCs in the house, this is the first time a drive
>> failed!
>>
>> Anyway, I've bought 3x 1.5 TB SATA disks which I'd like to put into a
>> software (mdadm) raid 5 array.
>>
> ...
>>
>> I regularly record 3 and sometimes 4 channels simultaneously, while
>> watching a recording.  Under regular LVM, this sometimes seemed to cause
>> some "slow downs".
>
> I know I risk a flame war here but I feel obliged to say it:
> Avoid raid5 if you can avoid it! It is fun to play with but
> if you care for your data buy a fourth drive and do raid1+0
> (mirroring and striping) instead.
>
> Raid 5 is very fast on linear read operations because basically
> the load will be spread onto all the available drives.
> But if you are going to run vdr on that drive array, you are going
> to do a lot of write operations, and raid5 is bad if you do a lot
> of writes for a very simple reason.
>
> Take a raid5 array with X devices. If you want to write just one
> block, you need to read 2 blocks (the old data that you are
> going to overwrite and the old parity) and you need to write 2
> blocks (one with the actual data and one with the new parity).
>
> In the best of case, the disk block that you are going to
> overwrite is already in ram, but the parity block almost never
> will be. Only if you keep writing the same block over and over,
> you'll have data and parity blocks cached.
> In most cases (and certainly in the case of writing data streams
> on disk) you'll need to read two blocks before you can calculate
> the new parity and write it back to the disks along with your data.
>
> So in short you do two reads and two writes for every write operation.
> There goes your performance...
>
> Now about drive failures... if one of X disks fails, you can still
> read blocks on the OK drives with just one read operation but you
> need X-1 read operations for every read operation on the failed drive.
> Writes on OK drives have the same two reads/two writes as before,
> (only if the failed drive contained the parity for this block you
> can skip the additional two reads and one write).
> If however you need to write on the the failed drive, then you need
> to read every other X-1 drive in the array to first reconstruct
> the missing data and then you can calculate and write the new
> parity. (and then you throw away the actual data that you were
> going to write because the drive that you could write it to is
> gone...)
>
> Example: You have your three 1.5TB drives A B C in an array
> and C fails. In this situation you'd want to treat your drives as
> carefully as possible because one more failure and all your data
> is gone. Unfortunately continued operating in fail condition will
> put your remaining drives under much more stress than usually.
>
> Reading will cause twice the read operations on your remaining
> drives.
>
> block    : n   n+1 n+2
> OK State : a   b   c
> Failstate: a   b   ab
>
> Writing (on a small array) will produce the same load of two reads
> and two writes average per write.
>
> block: n     n+1    n+2
> OK:    acAC  baBA   cbCB
> FAIL:  A     baBA   baB
>
>
> Confusingly enough the read load per drive doesn't change if
> you have more than three drives in your array. Reads will still
> produce on average double the load in failed state.
>
> Writes on a failed array seem to produce the same load as on
> an OK array. But this is only true for very small arrays.
> If you add more disks you'll see that the "read penalty" grows
> for writing blocks where the data disk is missing and you need
> to read all other drives in order to update th parity.
>
>
> Reconstruction of you array after adding a new drive will take
> a long time and most of complete array failures (i.e. data lost
> forever) occure during the rebuilding phase, not during in the
> fail state. Thats simply because you put a lot of stress on
> your drives (that probably come from same batch as the one
> that already failed).
>
> Depending on the number and nature of your drives and the
> host connection they have, the limiting factor can read
> performance (you need to read X-1 drives completely) or
> it can be the write performance if your disk is slower on
> sustained writing than on reading.
>
> Remember that you need to read and write a whole disks worth
> of data, not just the used parts.
>
> Example: Your drives have 1.5tb and we assume that you have
> a whoopin 100MB/s on read as well as on write. (pretty much the
> fastest there currently is).
>
> You need to read 3tb as well as write 1.5tb. if your system can
> handle the load in parallel you can treat it as just writing one
> 1.5tb drive. 1500000mb/100mb/s/60s/m makes 250 minutes or 4 hours
> and 10 minutes. I am curious if you can still use the system under
> such an io load. Anybody with experience on this? Anyway the
> reconstruction rate can be tuned via the proc fs.
>
>
> Now for the raid 1+0 alternative with the same resulting storage
> capacity you'll need 4 instead of 3 drives.
>
> In OK state one read command will result in one read operation but the
> operation can be completed on any drive that is part of the mirror set.
> So seek performance will be much better as the io-scheduler will select
> the drive that is currently not busy and/or who's head is closer to the
> requested block. As you do mirroring and striping you can use all four 
> drives'
> performance for linear reading. You end up with 33% more read performance
> than with the raid5 setup (but hey you paid 33% more as well :-) )
>
> Writing one block requires two write operations instead of two reads
> and two writes and since you don't need to read the old data before 
> writing
> the new stuff, you don't need to wait for the heads to move around, and 
> the
> disk to rotate to the right place and the read operation to get the data
> from the disk to ram first. You can simply write to the disk and let the
> disk's controller handle the rest. In other words: Your write performance
> will be much better than with raid5.
>
>
> In failed state (lets assume drive C of A=B+C=D fails), reading
> performance will drop by 33% as one drive is missing. The mirror
> drive of C will have to handle the load by itself:
>
> block: n   n+1 n+2 n+3
> ok:    a   c   b   d
> fail:  a   d   b   d
>
> This again assumes that the load is shared equally between the drives of
> a mirror set and is probably true for long sustained reads. In reality
> the scheduler would select the drive that is currently not busy and/or
> who's head is closer to the region you want to read. So if you are
> reading two streams of data that are stored in different regions
> of the disk, the disk in a raid5 array would have to do a lot of seek
> operations while the raid1+0 would keep one head on each stream's
> location and only quietly jump from one track to the next (assuming
> your disk is not heavily fragmented). If one of the two disks in a mirror
> set fails you'll have the heads jumping again.
>
> Writing on an array with a failed drived maintains the same for load
> for each individual drive and the performance will also stay the same.
>
> block: n   n+1 n+2 n+3
> ok:    AB  CD  AB  CD
> fail:  AB  D   AB  D
>
> Rebuilding will require to read the mirrored drive and write the new one.
> So you'll need to read 1.5tb and write 1.5tb. It will take the same time
> but produce less system load than in the raid5 example and only one old
> disk will be put under a lot of stress instead of all remaining drives.
>
> Btw: Your raid 1+0 array can handle two drive failures as long as they
> don't occure in the same mirror set. so A and C or B and D could fail
> and you'd still have all your data. Naturally Murphy's law applies and
> if you continue reading from that array you will stress that single
> remaining drive more than the others and its chances to fail will
> increase.
>
> But if you are worried about double faults you might as well run raid6
> on those 4 drives ... but don't ask for performance there.
>
> In all this I assume that you have a backup on another drive of all data
> that you care about. If you don't, WHAT THE F*** ARE YOU DOING? You are
> trusting your data to microscopic particles of rotating rust...
>
> Use two of the three drives as raid1 device that will quickly get your 
> data
> in and out and use the third as a backup device that will hold copies of 
> the
> data that you care about. That way you are safe against single drive 
> failure
> and against stupid users/software. Assuming that your backup drive is not
> mounted/accesible all the time.
>
> If you have a lot of data that you don't realy care about, you can use two
> of the three drives as raid0 device and use the third to only backup the 
> data
> that is important to you.
>
>
> I know you could use LVM to create one big volumegroup on to manage all 
> three
> disks and create the logical volumes that you store important data on with
> a "--mirrors" argument proportional to your paranoia but this would still 
> only
> protect you from hardware failures. To have protection against 
> software/user
> failures you'd need to do snapshots as well and I don't like the way you 
> have
> to do their growing and shrinking manually plus it would still all be 
> "online"
> and vulnerable to typos in "dd" commands..
>
> enough time wasted.. just one more thing ... all those RAID thingies 
> assume
> that you trust your disks to fail silently, i.e. return nothing instead of
> returning wrong data. if you wanted to protect against this you'd have
> to forget about improved performance and instead be content with the
> performance of your slowest drive. for each read you'd have to read the a
> block from each of your X drives in a raid array and compare the computed
> parity with the one read from disk, or in the simple raid1+0 you'd have to
> read both copies and compare them.
>
> cheers
> -henrik
>
>
> _______________________________________________
> vdr mailing list
> vdr at linuxtv.org
> http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr
> 




More information about the vdr mailing list