[pulseaudio-discuss] alsa sink latency - how to account for startup delay

Mon Apr 11 21:07:26 UTC 2016

On 11.04.2016 21:13, Tanu Kaskinen wrote:
> On Fri, 2016-04-08 at 20:23 +0200, Georg Chini wrote:
>> On 08.04.2016 18:01, Tanu Kaskinen wrote:
>>> On Wed, 2016-04-06 at 21:17 +0200, Georg Chini wrote:
>>>> On 06.04.2016 15:55, Tanu Kaskinen wrote:
>>>>> On Tue, 2016-04-05 at 21:54 +0200, Georg Chini wrote:
>>>>>> On 05.04.2016 17:42, Tanu Kaskinen wrote:
>>>>>>> I can't follow that line of reasoning. In the beginning the ring buffer
>>>>>>> is filled to max, and once you call snd_pcm_start(), data starts to
>>>>>>> move from the ring buffer to other buffers (I'll call the other buffers
>>>>>>> the "not-ring-buffer"). Apparently the driver "sees" the not-ring-
>>>>>>> buffer only partially, since it reports a larger latency than just the
>>>>>>> ring buffer fill level, but it still doesn't report the full latency.
>>>>>>> The time between snd_pcm_start() and the point where the reported delay
>>>>>>> does not any more equal the written amount tells the size of the
>>>>>>> visible part of the not-ring-buffer - it's the time it took for the
>>>>>>> first sample to travel from the ring buffer to the invisible part of
>>>>>>> the not-ring-buffer. I don't understand how the time could say anything
>>>>>>> about the size of the invisible part of the not-ring-buffer. Your logic
>>>>>>> "works" only if the visible and invisible parts happen to be of the
>>>>>>> same size.
>>>>>>>
>>>>>>> You should get the same results by calculating
>>>>>>>
>>>>>>>       adjusted delay = ring buffer fill level + 2 * (reported delay - ring buffer fill level)
>>>>>>>
>>>>>>> That formula doesn't make sense, but that's how I understand your logic
>>>>>>> works, with the difference that your fix is based on one measurement
>>>>>>> only, so it's constant over time, while my formula recalculates the
>>>>>>> adjustment every time the delay is queried, so the adjustment size
>>>>>>> varies somewhat depending on the granularity at which audio moves to
>>>>>>> and from the visible part of the not-ring-buffer.
>>>>>>>
>>>>>>> In any case, even if your logic actually makes sense and I'm just
>>>>>>> misunderstanding something, I don't see why the correction should be
>>>>>>> done in pulseaudio instead of the alsa driver.
>>>>>>    
>>>>>> Well, now I don't understand what you mean. The logic is very simple:
>>>>>> If there is a not reported delay between the time snd_pcm_start() is
>>>>>> called and the time when the first sample is delivered to the DAC, then
>>>>>> this delay will persist and become part of the continuous latency.
>>>>>> That's all, what causes the delay is completely irrelevant.
>>>>>    
>>>>> The code can't know when the first sample hits the DAC. The delay
>>>>> reported by alsa is supposed to tell that, but if the reported delay is
>>>>> wrong, I don't think you have any way to know the real delay.
>>>>    
>>>> Yes, the code can know when the first sample hits the DAC. I explained it
>>>> already. Before the first sample hits the DAC, the delay is growing and
>>>> larger or equal than the number of samples you have written to the
>>>> buffer.
>>>> At the moment the delay is smaller than the write count, you can be
>>>> sure that at least some audio has been delivered. Since the delay is
>>>> decreased by the amount of audio that has been delivered to the DAC,
>>>> you can work back in time to the moment when the first sample has been
>>>> played.
>>>   
>>> Yes, you explained that already, but you didn't give a convincing
>>> explanation of why the point in time when the delay stops growing would
>>> indicate the point when the first sample hit the DAC.
>>   
>> See below. The precondition for my thoughts naturally is that no
>> samples vanish from the latency reports, maybe that is where
>> we are thinking differently.
>>
>>>>>> Maybe what I said above was not complete. At the point in time when
>>>>>> the first audio is played, there are two delays: First the one that is
>>>>>> reported
>>>>>> by alsa and the other is the difference between the time stamps minus
>>>>>> the played audio. If these two delays don't match, then there is an
>>>>>> "extra delay" that has to be taken into account.
>>>>>    
>>>>> The difference between the time stamps is not related to how big the
>>>>> invisible part of the buffer is. I'll try to illustrate:
>>>>>
>>>>> In the beginning, pulseaudio has written 10 ms of audio to the ring
>>>>> buffer, and snd_pcm_start() hasn't been called:
>>>>>
>>>>> DAC <- ssssssssss|sss|dddddddddd <- pulseaudio
>>>>>
>>>>> Here "ssssssssss|sss|ddddddddd" is the whole buffer between the DAC and
>>>>> pulseaudio. It's divided into three parts; the pipe characters separate
>>>>> the different parts. Each letter represents 1 ms of data. "s" stands
>>>>> for silence and "d" stands for data. The first part of the buffer is
>>>>> the invisible part that is not included in the delay reports. I've put
>>>>> 10 ms of data there, but it's unknown to the driver how big the
>>>>> invisible part is. The middle part of the buffer is the "send buffer"
>>>>> that the driver maintains, its size is 3 ms in this example. It's
>>>>> filled with silence in the beginning. The third part is the ring
>>>>> buffer, containing 10 ms of data from pulseaudio.
>>>>>
>>>>> At this point the driver reports 10 ms latency. It knows it has 3 ms of
>>>>> silence buffered too, which it should include in its latency report,
>>>>> but it's stupid, so it only reports the data in the ring buffer. The
>>>>> driver has no idea how big the invisible part is, so it doesn't include
>>>>> it in the report.
>>>>>
>>>>> Now pulseaudio calls snd_pcm_start(), which causes data to start moving
>>>>> from the ring buffer to the send buffer. After 1 ms the situation looks
>>>>> like this:
>>>>>
>>>>> DAC <- ssssssssss|ssd|ddddddddd  <- pulseaudio
>>>>>
>>>>> There's 2 ms of silence in the send buffer and 1 ms of data. The driver
>>>>> again ignores the silence in the send buffer, and reports that the
>>>>> delay is 10 ms, which consists of 1 ms of data in the send buffer and 9
>>>>> ms of data in the ring buffer.
>>>>>
>>>>> After 2 ms:
>>>>>
>>>>> DAC <- ssssssssss|sdd|dddddddd   <- pulseaudio
>>>>>
>>>>> Reported delay: 10 ms
>>>>>
>>>>> After 3 ms:
>>>>>
>>>>> DAC <- ssssssssss|ddd|ddddddd    <- pulseaudio
>>>>>
>>>>> Reported delay: 10 ms
>>>>>
>>>>> Let's say pulseaudio refills the ring buffer now.
>>>>>
>>>>> DAC <- ssssssssss|ddd|dddddddddd <- pulseaudio
>>>>>
>>>>> Reported delay: 13 ms
>>>>>
>>>>> After 4 ms:
>>>>>
>>>>> DAC <- sssssssssd|ddd|ddddddddd  <- pulseaudio
>>>>>
>>>>> The first data chunk has now entered the invisible part of the buffer,
>>>>> but it will still take 9 ms before it hits the DAC. At this point
>>>>> pulseaudio has written 13 ms of audio, and the reported delay is 12 ms.
>>>>> According to your logic, the adjusted delay is 12 + (4 - 1) = 15 ms,
>>>>> while in reality the latency is 22 ms.
>>>>    
>>>> At this point, no audio has been played yet. You still have silence in the
>>>> buffer, so alsa would not report back, that samples have been played.
>>>   
>>> But the reported delay stopped growing! That's the point where you
>>> claim the first sample hits the DAC, but as my example illustrates,
>>> that doesn't seem to be true.
>>   
>> In your example it is not true, that's right. But for the USB devices it is.
>> They only start decreasing the delay when real audio has been played,
>> and they would increase the delay when you write to the buffer,
>> I have checked that in the code.
>> And I think any driver that makes samples vanish is so severely screwed,
>> that we can't do anything about it. If the driver reports complete moonshine
>> numbers, you can't fix it, I agree with you in that respect.
>>
>> But that is not the case with USB. There is only some missing latency
>> that is not reported - call it transport delay or whatever and I suspect a
>> similar delay can be found in other alsa drivers. There is no need to figure
>> out the reason for it, it just takes some time after snd_pcm_start() was
>> called until the first sample is played - without making samples vanish.
>> And in that case the delay can be detected and used by the code.
> I don't understand how it could be possible to have a "transport delay"
> that doesn't get reported, if the delay is in the part of the pipeline
> that the driver can see. If there's no extra latency reporting
> mechanism in the USB audio spec, the driver can see up to the point
> where URBs get released. If the driver counts the submitted-but-not-
> released URBs, which it appears to do if releasing the first URB
> triggers the first decrease in the previously-growing delay reports,
> then any transport delay before that point should be automatically
> visible in the reported latency. If there's a 10 ms "transport delay",
> the driver has to submit at least 10 ms worth of URBs before the first
> URB gets released.
>
> I can't imagine a system where the full transport delay doesn't get
> included in the delay report, if at the same time the release of the
> first URB is the trigger for stopping the delay growth. Either the full
> URB queue is taken into account, or none of it is. The transport delay
> can't be larger than the URB queue, otherwise there are holes in the
> stream.
>
> You make very confident-sounding claims of the URB release notification
> happening only after the data in the URB has hit the DAC. But have you
> considered that the notification might actually happen immediately
> after the USB host controller has sent the data packet to the USB
> device? The USB device may then have an additional buffer before the
> DAC, so the received audio gets queued and doesn't immediately hit the
> DAC. That would explain the permanently too small delay reports, since
> the buffer in the USB device isn't included in the reports. In this
> case there's no way to know the USB device buffer size without a
> separate latency reporting mechanism in the USB audio spec.

You still don't understand. I do not care at all, where the delay comes 
from.
It may be for example from pulseaudio starting and stopping the sink because
of sample rate change or any other real-time event that delays playback
after snd_pcm_start() is called.
The point is, that the reported delay does not get lower than than the 
write count
before the sample is either played or vanishes into your invisible 
buffer. So with my
code you can at least catch any delays that have happened up to that time.

Your invisible buffer would by the way match my "small but constant 
remaining
latency offset", so in your initial example I would see an additional 
latency offset
of 9ms that my code does not account for, but would still have a better 
latency
estimate than the current code delivers.

>>>> I choose the point where the first d hits the DAC and that is reported
>>>> back by alsa. (see above) I've tried put it all together in a document.
>>>> I hope I can finish the part that deals with the smoother code today.
>>>> If so, I will send it to you privately because the part about
>>>> module-loopback
>>>> is still missing.
>>>> Anyway, even if you think it is wrong I am still measuring the correct
>>>> end-to-end latency with my code, so something I am doing must be
>>>> right ...
>>>   
>>>  From what I can tell, that's a coincidence.
>>   
>> No, it definitely isn't. If you accept the precondition, that samples
>> not simply vanish from the latency reports, it's physics.
> I'm not sure if you understood what I meant, if you talk about
> "vanishing samples" as something crazy that can be assumed to not
> happen.

At the end of the pipeline samples always vanish - I already said that there
is a point where your access ends, every delay that happens behind
that point can only be corrected manually as a latency offset. Again
it does not matter where they vanish to - if to the DAC or to an invisible
buffer, at least we can try to look as far ahead as possible.

> I tried to illustrate that the driver may see only up to some
> point in the full audio pipeline, so before the DAC there may be some
> buffer that the driver has no idea of. Samples don't vanish, they just
> move from the visible part to the invisible part. A clear example of
> this kind of situation is the bluetooth implementation in pulseaudio:
> we don't have visibility to the audio pipeline after the socket to
> which we write data. We can't query the amount of audio buffered
> between the socket and the bluetooth device's DAC (or if there's a
> mechanism for that, I'm not aware of it), and we don't get any
> notifications of played chunks of audio. Therefore, we can't report
> bluetooth latency accurately.
>
>> I would tend to agree that I have overlooked something, if the "extra
>> delay" would be the same every time and if I could not write down
>> the math for it.
>> But it isn't completely constant (just in the same range) and I can
>> write down the math and it matches my measurements. So I am
>> fairly sure that I am right. Did you have a look at my document?
> I've only glanced it so far. It's a big undertaking to understand all
> the math, and I haven't had time for that yet.
>
>>>>> I don't know how well this model reflects the reality of how the usb
>>>>> audio driver works, but this model seems like a plausible explanation
>>>>> for why the driver reports delays equalling the amount of written data
>>>>> in the beginning, and why the real latency is higher than the reported
>>>>> latency at later times.
>>>>>
>>>>> I hope this also clarifies why I don't buy your argument that the time
>>>>> stamp difference is somehow related to the unreported latency.
>>>>    
>>>> No, in fact it doesn't.
>>>>
>>>>>> Trying to fix up that delay on every iteration does not make any sense
>>>>>> at all, it is there from the start and it is constant.
>>>>>    
>>>>> Commenting on "it is constant": The playback latency is the sum of data
>>>>> in various buffers. The DAC consumes one sample at a time from the very
>>>>> last buffer, but I presume that all other places move data in bigger
>>>>> chunks than one sample. The unreported delay can only be constant if
>>>>> data moves to the invisible part of the buffering in one sample chunks.
>>>>> Otherwise the latency goes down every time the DAC reads a sample, and
>>>>> then when the buffer is refilled at the other end, the latency jumps up
>>>>> by the refill amount.
>>>>    
>>>> I only said the "extra latency" is constant, not the latency as such.
>>>> See your own example above that your argument is wrong. Even
>>>> if the audio is moved in chunks through your invisible buffer part,
>>>> that part still has the same length all the time. When one "d" is
>>>> moved forward another one will replace it.
>>>   
>>> No, the invisible part is not constant, even though my presentation
>>> didn't show the variance. The DAC consumes data from the invisible
>>> buffer one sample at a time, and each time it does that, the extra
>>> latency decreases by one sample. Data moves from the visible part of
>>> the buffer to the invisible part in bigger chunks. I didn't specify the
>>> chunk size, but if we assume 1 ms chunks, the extra latency grows by 1
>>> ms every time a chunk is transferred from the visible part to the
>>> invisible part.
>>   
>> Then take any part of the buffer but the last or the first bit. All the
>> chunks are always full, so it's constant. The moving bit is dealt with
>> elsewhere, (in the smoother) but there is a lot of buffer that is always
>> full.
>> And when you take USB, the driver sees only chunks. The sample
>> by sample consuming of the DAC is never seen by the driver, it gets
>> the notification from USB that a chunk has been played.
>> I'm not sure how it is with HDA, but probably similar.
> You are probably not talking about my model where there's an invisible
> buffer at the end of the pipeline, and since I don't understand what
> model you have in mind (see the text earlier in this mail), I can't
> really comment here, other than to point out that even when talking
> about a buffer that deals with equally sized chunks at both read and
> write ends, the buffered amount varies if the reads and writes don't
> happen exactly at the same time (each read reduces the buffered amount
> and each write increases the buffered amount).

Surely the amount in the buffers vary, that's why I said leave out
the first and the last bit. Everything in between is always full and
makes a constant contribution to the latency.

>
>>>>>> This is not a negative delay reported by alsa, but my "extra latency"
>>>>>> is getting negative, which means playback must have started
>>>>>> before snd_pcm_start().
>>>>>> According to Raymond Yau playback seems in fact to be started
>>>>>> before snd_pcm_start() for HDA devices, at least if I read his last
>>>>>> mail on that topic right. Then the negative delays would even make
>>>>>> sense, since data is written to the buffer before snd_pcm_start().
>>>>>    
>>>>> I had a look at the code to verify the claim that we configure alsa to
>>>>> start playback already before we call snd_pcm_start(). If we really do
>>>>> that intentionally, then it doesn't make sense to call snd_pcm_start()
>>>>> explicitly.
>>>>>
>>>>> This is what we do:
>>>>> snd_pcm_sw_params_set_start_threshold(pcm, swparams, (snd_pcm_uframes_t) -1)
>>>>>
>>>>> Note the casting of -1 to an unsigned integer. It seems that the
>>>>> intention is to set as high threshold as possible to avoid automatic
>>>>> starting. However, alsa-lib casts the threshold back to a signed value
>>>>> when it's used, and I believe the end result is indeed that playback
>>>>> starts immediately after the first write. I don't know if that matters,
>>>>> since we do the manual snd_pcm_start() call immediately after the first
>>>>> write anyway, but it seems like a bug in any case.
>>>   
>>> Not very important, but I'll clarify one thing: I had another look, and
>>> I'm not any more sure that the code where I saw the casting back to a
>>> signed integer is actually used by pulseaudio. The function
>>> is snd_pcm_write_areas(), but pulseaudio doesn't call that at least
>>> directly, and I did some searching in alsa-lib too, and I didn't find a
>>> call path that would cause snd_pcm_write_areas() to be used by
>>> pulseaudio. Even if snd_pcm_write_areas() isn't used, though, it's
>>> entirely possible that there's some other code that does a similar
>>> cast. I don't know the code is that triggers the snd_pcm_start() call
>>> when the ring buffer fill level exceeds the configured threshold. It
>>> might be in the kernel.
>>>
>>>> OK, this it why I measure an "extra latency" of -60 to -20 usec.
>>>> So again, if I can measure it and even detect a bug that way,
>>>> don't you think there must be some truth in what I'm saying?
>>>   
>>> Do I understand correctly that your "extra latency" is affected by
>>> whether snd_pcm_start() is called implicitly in mmap_write() or
>>> explicitly after mmap_write()? The time when mmap_write() is called
>>> doesn't affect the latency in the long term.
>>   
>> It does. It isn't much, but if playback starts earlier, the delay
>> will be exactly that amount less even after 10 hours of playback.
>> Let's assume you have 10ms of audio to write to the buffer.
>> During the time, when you write, samples are coming in.
>> Let's say it takes 100 usec to write the buffer. If you start
>> playback after the write, this will be 100 usec additional delay.
>> 5 samples have accumulated.
>> If you start playback immediately after the first bit of data is
>> written this might take much less time, say 20 usec.
>> So your delay is four samples less and it will remain that way
>> until the sink is stopped. There is nothing that would take away
>> the delay.
> Those 5 samples come from a source and accumulate in module-loopback's
> memblockq, right? That has no effect on the sink latency, and module-
> loopback is able to get rid of excess stuff in the memblockq, so the
> initial accumulation has no long-term effects on latency.

That is exactly my point. If the fact, that playback started earlier
than expected would be correctly reported, then module-loopback
could react to it. But with the current code it isn't, because alsa
calculates the delay relative to the start of the playback, but
playback was in fact started earlier than pulseaudio assumes.
The same is valid in the other direction, if the start of the playback
is delayed for any reason.

>
>>> The smoother will produce
>>> wrong values if it's not started at the same time as snd_pcm_start() is
>>> called, but I presume the smoother is able to fix such inaccuracies
>>> over time, so it doesn't matter that much when the snd_pcm_start() is
>>> called. So isn't it a bad thing if your "extra latency" permanently
>>> includes something that doesn't have any real effect after some time?
>>   
>> Yes, it is affected by it and it should be, because the "extra delay"
>> is the time between snd_pcm_start() and the first sample being
>> played. So if the first samples are played before snd_pcm_start()
>> the "extra latency" will become negative. And as explained above,
>> it has permanent effect. Somehow you seem to be of the opinion
>> that all delays that are not controlled by the pulseaudio code
>> vanish magically, but they don't.
> A permanent difference in delay means that there's some buffer
> somewhere whose fill level is permanently higher or lower. I don't
> think there's any place in the alsa driver where that could happen. The
> loopback memblockq could be a place where this could happen, if module-
> loopback didn't do adaptive resampling to handle the case where the
> memblockq has too much or too little audio buffered.

I don't understand your comment, sorry.

>
>> For the reported latency, it just means, that it will become slightly
>> smaller. As I said, the smoother does not use the "extra delay"
>> for anything, it is only calculated once when the origin for the
>> smoother is set and added later as an offset, when get_latency()
>> is called.
> If you ever make the extra delay negative, that means that you think
> that alsa reports playback latencies that are higher than reality. Do
> you really think that it's possible for the delay values that alsa
> reports to be permanently higher than reality?

As explained above, the "higher than reality" or "lower than reality"
latencies may not be caused by wrong delay reports of alsa, but by the
assumption that pulseaudio makes regarding the time when playback
was started. And that is what my code tries to do - get the best possible
value for the "real" start time.
It does not claim to cover all possible latencies anywhere between source
and sink and would not catch an "invisible buffer" but it is much better
than to rely blindly on the reported values like the current code does.

Just one more point regarding those negative latencies - if I allow
the "extra latency" to become negative, the resulting end-to end latency
for HDA -> HDA will be exactly the same if you try it a couple of times,
while if I truncate, the resulting latency will vary by 20 - 60 usec.
I can't believe that is coincidence ...

I can only point you again to my document - I sent you the full version
including the loopback part earlier today and math is unambiguous.