[linux-dvb] CX18 Oops
hverkuil at xs4all.nl
Sun Aug 17 22:01:57 CEST 2008
On Sunday 17 August 2008 21:12:50 Andy Walls wrote:
> On Sun, 2008-08-17 at 11:41 +0200, Hans Verkuil wrote:
> > On Sunday 17 August 2008 04:13:24 Andy Walls wrote:
> > > On Mon, 2008-08-11 at 17:33 -0400, Brandon Jenkins wrote:
> > > > On Fri, Aug 8, 2008 at 10:18 AM, Andy Walls <awalls at radix.net>
> > wrote:
> > > I have updated my repo at
> > >
> > > http://linuxtv.org/hg/~awalls/v4l-dvb
> > >
> > > with 3 changes:
> > >
> > > 1. Back out the original band aid fix
> > > 2. Simplify the queue flush routines (you will not see that oops
> > > again)
> > Nice! Cleans it up considerably.
> > > 3. Fix the interrupt handler to obtain a queue lock (prevents
> > > queue corruption)
> > No, that is not the bug.
> Yeah, it is not *the* bug I was after, but pending further discussion
> below I'll maintain it is a bug.
> I really want to do a full audit of all the queue manipulations in
> the driver. I'll try to find some time when I'm offline this week.
> (Thanks for all the comments BTW!)
> > I'm pretty sure that the real bug is that the
> > old cx18_queue_move() function didn't use a spin_lock_irqsave(). I
> > think that it was possible for an interrupt to arrive when the CPU
> > was in the middle of a cx18_queue_move().
> On the surface it did look OK since all the interrupts for that
> stream *should* have been stopped. But after some hypothetical
> thought about how the encoder might not stop right away and looking
> at Jeff's logs where a new capture may have been started before the
> queue flush was done, it's safer just to have the queue flush routine
> to acquire the lock. Since the queue flush routine in question is
> called very infrequently, there's not much of a time penalty.
> > A spinlock in an interrupt
> > handler is usually bogus (and that includes the one in the ivtv irq
> > handler, I've just realized).
> Chapter 5 of _Understanding_the_Linux_Kernel_ (2nd edition) on page
> 185 & 188 mentions that to protect data structures, being accessed by
> both exceptions (like open/close/read/write/poll induced INT 0x80
> exceptions on Intel) and interrupt handlers on a MP system, a
> spinlock needs to be used. The book also mentions that a semaphore
> (now a mutex) is sometimes preferable in this case, where the
> interrupt routine polls the semaphore in a tight loop like a
> spinlock, but the system calls use the semaphore normally and are
> allowed to sleep.
> > What I am uneasy about, though, is why an interrupt could arrive
> > while in the cx18_queue_move() function. In principle this function
> > should only be called when the capture has stopped. I think it
> > might be a good idea to debug this: is it possible for interrupts
> > to arrive after the capture was stopped? Or is it possible for
> > cx18_queue_move() to be called when a capture is still in progress?
> Right. I should really test to see if this actually happens. Since
> spinlocks are supposedly optimized for the case of the lock being
> available and the cx18_queue_move() is called infrequently, leaving
> it in for now, should be OK.
> I also have difficulty reproducing the original oops, so my test
> results could be misleading. I have no good criteria for terminating
> the experiment/testing and declaring "interrupts can't happen when we
> decide to flush queues".
> > I do think it is a good idea to rename cx18_queue_find_buf to
> > cx18_queue_get_buf_irq to denote that it 1) not just finds a buffer
> > but also removes it from the queue, and 2) it can only be called
> > safely from interrupt context.
> Agree. I'll make that part of the final change.
> > > >From most of the output you provided, it was pretty obvious that
> > > > q_full
> > >
> > > was always claiming to have more buffers that it actually did. I
> > > hypothesized this could come about at the end of a capture when
> > > the encoder hadn't really stopped transferring buffers yet (after
> > > we told it to stop) and then we try to clear q_full while the
> > > interrupt handler is still trying to add buffers. This could
> > > happen because the interrupt handler never (ever) properly
> > > obtained a lock for manipulating the queues. This could have
> > > been causing the queue corruption.
> > >
> > > Please test. I need feedback that I haven't introduced a
> > > deadlock.
> > >
> > > It also appears that the last change requiring the interrupt
> > > handler to obtain a lock, completely mitigates me having to use
> > > the "-cache 8192" option to mplayer for digital captures, and
> > > greatly reduces the amount of cache I need to have mplayer use
> > > for analog captures.
> > I suspect that it is the change before that one: adding a spinlock
> > to cx18_queue_move().
> I have a recollection that in my incremental testing that this was
> not the case. It was actually the spinlock in the irq handler that
> made things better. But it was late and I was tired. I'll retest
> and run my blocking read timing test to see if I can see a difference
> in the numbers for a few cases.
> > The spinlock in the interrupt handler doesn't do
> > anything. It would only be useful if you could have two independent
> > interrupt handlers that both needed access to that resource. But
> > that is not the case here.
> I agree that 2+ interrupt handlers could not access that resource
> concurrently. But, AFAICT, a system call on one processor and the
> interrupt handler on the other processor can access the queues
> It's my understanding that spin_lock_irqsave() and spin_lock_irq()
> only disable local interrupts for the particular CPU and not
> globally. So here's the case I think needs a spinlock lock in the
> irq handler:
> 1. A capture is in progress
> 2. Application on CPU #0 issues a read(). cx18_dequeue() is
> ultimately called on q_full. cx18_dequeue() calls
> spin_lock_irqsave(), disabling preemption, disabling local
> interrupts, and acquiring the lock, and begins manipulating q_full
> 3. At an inopportune time, an interrupt arrives from the encoder and
> is sent to CPU #1 for servicing. In the cx18 driver, in an
> interrupt context, epu_dma_done() is eventually invoked. Without
> obtaining a lock, epu_dma_done() calls cx18_queue_find_buf() and
> manipulates q_full. This manipulation of q_full happens while the
> system call on CPU #0 holds the lock and thinks things are safe.
> Given the above situation I think the interrupt handler does need to
> acquire the spinlock. So, here's where people get to hit me with the
> clue-stick: in the above case, what do I have wrong?
> > > Hans,
> > > (or anyone else with expertise in using spinlocks withing an
> > > interrupt handler),
> > >
> > > Could you please provide comments on if I'm doing something wrong
> > > with the way I obtain the spinlock in the interrupt handler?
> > See above :-)
> > > http://linuxtv.org/hg/~awalls/v4l-dvb/rev/f3ada35200c0
> > >
> > > >From reading Bovet and Cesati's _Understanding_the_Linux_Kernel_
> > > > and
> > >
> > > Corbet, Rubini, and Kroah-Hartman's _Linux_Device_Drivers_ I
> > > think I've got it right.
> > >
> > > When the stream queues (q_full, q_io, and q_free) are accessed
> > > from the system call exception handler, I need to do a
> > > spin_lock_irqsave() to disable local CPU interrupts and protect
> > > access to the queues by kernel control paths on other CPU's.
> > > When they stream queues are accessed by the interrupt handler on
> > > any CPU, the interrupt handler is serialized with respect to
> > > itself and need not disable any interrupts and simply obtain the
> > > lock via spin_lock() to protect against access from system call
> > > exceptions.
> > System call exceptions? Not sure what you mean.
> As I understand it, on Intel platforms, when a user land application
> invokes read(), write(), or some other system call, said system call
> eventually invokes an INT 0x80 software exception to make the
> transition to kernel code and data space with the proper privileges.
Ah, I just call them 'system calls' :-)
> > AFAIK the interrupt
> > handler doesn't have to protect against anything.
> See my concern above. In brief, AFAICT, a system call on one
> processor concurrent with interrupt service on another processor
> requires the irq handler to obtain the proper lock before mucking
> with the shared data structure.
You are completely right and I stand corrected. cx18_queue_find_buf (aka
cx18_queue_get_buf_irq) must have a spin_lock. So that spin_lock in
ivtv wasn't bogus either :-)
Damn, it's so easy to get confused with locking, even you've implemented
it several times already.
That's a serious bug which needs to go into 2.6.27 (and probably to the
2.6.26 stable series as well).
Andy, can you make a patch that adds the spin_lock to
cx18_queue_find_buf(). It's better to do it there rather than in the
Then that patch can go into v4l-dvb and from there to 2.6.27. The other
changes can come later.
Apologies for probably confusing you. I certainly confused myself.
More information about the linux-dvb