i915 regression in kernel 4.10

Mon Dec 19 14:16:44 UTC 2016

On 19/12/16 13:29, Chris Wilson wrote:
> On Mon, Dec 19, 2016 at 12:39:16PM +0100, Juergen Gross wrote:
>> With recent 4.10 kernel the graphics isn't coming up under Xen. First
>> failure message is:
>>
>> [   46.656649] i915 0000:00:02.0: swiotlb buffer is full (sz: 1630208 bytes)
> 
> Do we get a silent failure? i915_gem_gtt_prepare_pages() is where we
> call dma_map_sg() and pass the sg to swiotlb (in this case) for
> remapping, and we do check for an error value of 0. After that error,
> SWIOTLB_MAP_ERROR is propagated back and converted to 0 for
> dma_map_sg(). That looks valid, and we should report ENOMEM back to the
> caller.
> 
>> Later I see splats like:
>>
>> [   49.393583] general protection fault: 0000 [#1] SMP
> 
> What was the faulting address? RAX is particularly non-pointer-like so I
> wonder if we walked onto an uninitialised portion of the sgtable. We may
> have tripped over a bug in our sg_page iterator.

During the bisect process there have been either GP or NULL pointer
dereferences or other page faults. Typical addresses where:

xen_swiotlb_unmap_sg_attrs+0x1f/0x50: access to 0000000000000018
xen_swiotlb_unmap_sg_attrs+0x1f/0x50: access to 0000000003020118

> 
> The attached patch should prevent an early ENOMEM following the swiotlb
> allocation failure. But I suspect that we will still be tripping up the
> failure in the sg walker when binding to the GPU.
> -Chris
> 

The patch is working not too bad. :-)

Still several "swiotlb buffer is full" messages (some with sz:, most
without), but no faults any more (neither GP nor NULL pointer
dereference). Graphical login is working now.

What I do see, however, is (no idea whether this is related):

[  735.826492] INFO: task systemd-udevd:484 blocked for more than 120
seconds.
[  735.826497]       Tainted: G        W       4.9.0-pv+ #767
[  735.826499] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  735.826501] systemd-udevd   D    0   484    443 0x00000000
[  735.826507] Call Trace:
[  735.826522]  ? __schedule+0x192/0x640
[  735.826530]  ? kmem_cache_free+0x45/0x150
[  735.826535]  ? schedule+0x2d/0x80
[  735.826539]  ? schedule_timeout+0x1f3/0x380
[  735.826545]  ? error_exit+0x9/0x20
[  735.826555]  ? sg_pool_index.part.0+0x2/0x2
[  735.826561]  ? wait_for_completion+0xa4/0x110
[  735.826569]  ? wake_up_q+0x70/0x70
[  735.826577]  ? cpufreq_boost_online+0x10/0x10 [acpi_cpufreq]
[  735.826585]  ? cpuhp_issue_call+0x9c/0xe0
[  735.826590]  ? __cpuhp_setup_state+0xd5/0x1d0
[  735.826599]  ? acpi_cpufreq_init+0x1cd/0x1000 [acpi_cpufreq]
[  735.826601]  ? 0xffffffffa00b1000
[  735.826607]  ? do_one_initcall+0x38/0x180
[  735.826611]  ? kmem_cache_alloc_trace+0x98/0x1e0
[  735.826620]  ? do_init_module+0x55/0x1e5
[  735.826629]  ? load_module+0x2088/0x26b0
[  735.826633]  ? __symbol_put+0x30/0x30
[  735.826639]  ? SYSC_finit_module+0x80/0xb0
[  735.826644]  ? entry_SYSCALL_64_fastpath+0x1e/0xad

I guess it is _not_ related, OTOH there is sg_pool_index() involved...

Juergen