[Intel-gfx] [RFC PATCH 1/4] drm/i915: Drop user contexts on driver remove
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Thu May 28 13:34:42 UTC 2020
On 28/05/2020 13:10, Janusz Krzysztofik wrote:
> Hi Tvrtko,
>
> On Thu, 2020-05-28 at 11:14 +0100, Tvrtko Ursulin wrote:
>> On 18/05/2020 19:17, Janusz Krzysztofik wrote:
>>> Contexts associated with open device file descriptors together with
>>> their assigned address spaces are now closed on device file close. On
>>
>> i915_gem_driver_remove looks like module unload to me, not device file
>> close. So..
>
> Not only module unload ...
>
>>
>>> address space closure its associated DMA mappings are revoked. If the
>>> device is removed while being open, subsequent attempts to revoke
>>> those mappings while closing the device file descriptor may may be
>>> judged by intel-iommu code as a bug and result in kernel panic.
>>>
>>> Since user contexts become useless after the device is no longer
>>> available, drop them on device removal.
>>>
>>> <4> [36.900985] ------------[ cut here ]------------
>>> <2> [36.901005] kernel BUG at drivers/iommu/intel-iommu.c:3717!
>>> <4> [36.901105] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>> <4> [36.901117] CPU: 0 PID: 39 Comm: kworker/u8:1 Tainted: G U W 5.7.0-rc5-CI-CI_DRM_8485+ #1
>>> <4> [36.901133] Hardware name: Intel Corporation Elkhart Lake Embedded Platform/ElkhartLake LPDDR4x T3 CRB, BIOS EHLSFWI1.R00.1484.A00.1911290833 11/29/2019
>>> <4> [36.901250] Workqueue: i915 __i915_vm_release [i915]
>>> <4> [36.901264] RIP: 0010:intel_unmap+0x1f5/0x230
>>> <4> [36.901274] Code: 01 e8 9f bc a9 ff 85 c0 74 09 80 3d df 60 09 01 00 74 19 65 ff 0d 13 12 97 7e 0f 85 fc fe ff ff e8 82 b0 95 ff e9 f2 fe ff ff <0f> 0b e8 d4 bd a9 ff 85 c0 75 de 48 c7 c2 10 84 2c 82 be 54 00 00
>>> <4> [36.901302] RSP: 0018:ffffc900001ebdc0 EFLAGS: 00010246
>>> <4> [36.901313] RAX: 0000000000000000 RBX: ffff8882561dd000 RCX: 0000000000000000
>>> <4> [36.901324] RDX: 0000000000001000 RSI: 00000000ffd9c000 RDI: ffff888274c94000
>>> <4> [36.901336] RBP: ffff888274c940b0 R08: 0000000000000000 R09: 0000000000000001
>>> <4> [36.901348] R10: 000000000a25d812 R11: 00000000112af2d4 R12: ffff888252c70200
>>> <4> [36.901360] R13: 00000000ffd9c000 R14: 0000000000001000 R15: ffff8882561dd010
>>> <4> [36.901372] FS: 0000000000000000(0000) GS:ffff888278000000(0000) knlGS:0000000000000000
>>> <4> [36.901386] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> <4> [36.901396] CR2: 00007f06def54950 CR3: 0000000255844000 CR4: 0000000000340ef0
>>> <4> [36.901408] Call Trace:
>>> <4> [36.901418] ? process_one_work+0x1de/0x600
>>> <4> [36.901494] cleanup_page_dma+0x37/0x70 [i915]
>>> <4> [36.901573] free_pd+0x9/0x20 [i915]
>>> <4> [36.901644] gen8_ppgtt_cleanup+0x59/0xc0 [i915]
>>> <4> [36.901721] __i915_vm_release+0x14/0x30 [i915]
>>> <4> [36.901733] process_one_work+0x268/0x600
>>> <4> [36.901744] ? __schedule+0x307/0x8d0
>>> <4> [36.901756] worker_thread+0x37/0x380
>>> <4> [36.901766] ? process_one_work+0x600/0x600
>>> <4> [36.901775] kthread+0x140/0x160
>>> <4> [36.901783] ? kthread_park+0x80/0x80
>>> <4> [36.901792] ret_from_fork+0x24/0x50
>>> <4> [36.901804] Modules linked in: mei_hdcp i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ax88179_178a usbnet mii mei_me mei prime_numbers intel_lpss_pci
>>> <4> [36.901857] ---[ end trace 52d1b4d81f8d1ea7 ]---
>>>
>>> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik at linux.intel.com>
>>> ---
>>> drivers/gpu/drm/i915/gem/i915_gem_context.c | 38 +++++++++++++++++++++
>>> drivers/gpu/drm/i915/gem/i915_gem_context.h | 1 +
>>> drivers/gpu/drm/i915/i915_gem.c | 2 ++
>>> 3 files changed, 41 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> index 900ea8b7fc8f..0096a69fbfd3 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> @@ -927,6 +927,44 @@ void i915_gem_driver_release__contexts(struct drm_i915_private *i915)
>>> rcu_barrier(); /* and flush the left over RCU frees */
>>> }
>>>
>>> +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915)
>>> +{
>>> + struct i915_gem_context *ctx, *cn;
>>> +
>>> + list_for_each_entry_safe(ctx, cn, &i915->gem.contexts.list, link) {
>>> + struct drm_i915_file_private *file_priv = ctx->file_priv;
>>> + struct i915_gem_context *entry;
>>> + unsigned long int id;
>>> +
>>> + if (i915_gem_context_is_closed(ctx) || IS_ERR(file_priv))
>>> + continue;
>>> +
>>> + xa_for_each(&file_priv->context_xa, id, entry) {
>>
>> ... how is driver unload possible with open drm file descriptors, or
>> active contexts?
>
> ... but also PCI driver unbind or PCI device remove, with the module
> still loaded. That may perfectly happen even if a device file
> descriptor is still kept open.
I see. What do we do, or plan to do, with those left open drm fds after
the driver is unbound from the device? Is there a path connected to keep
saying -ENODEV (Or is a different errno standard for this case, like
ENXIO?) from that point onward for everything done with that fd. So
userspace couldn't do anything more with it, attempt to create a new
context etc. Is the DRM core handling this?
Regards,
Tvrtko
>
>> If something is going wrong sounds like something else.
>
> I think we might consider that "something" as intel-iommu code, but see
> also the last paragraph of my response below.
>
>>
>> drm postclose -> i915_gem_context_close -> closes all contexts and puts
>> all vm. What can remain dangling? An active context? But there is idling
>> via i915_gem_driver_remove -> i915_gem_suspend_late.
>
> Idling doesn't release DMA mappings which may then be released as late
> as on i915_gem_driver_release. If that doesn't happen on device remove
> only later on last device close, intel-iommu triggers a bug.
>
> I reported that issue to intel-iommu maintainers several months ago[1].
> They even agreed that was something that might need to be fixed[2], I
> provided them with information how the issue could be reproduced
> easily[3] but no progress has been reported since then. If you think
> the issue should be definitely fixed on intel-iommu side, please help
> me to convince intel-iommu maintainers to take care of that.
>
> However, please note DMA-API also complains about DMA mappings still
> left active when device is removed. That may indicate we should rather
> fix that on our side.
>
> Thanks,
> Janusz
>
> [1] https://marc.info/?l=linux-kernel&m=156689857510234&w=2
> [2] https://marc.info/?l=linux-kernel&m=156706978013967&w=2
> [3] https://marc.info/?l=linux-kernel&m=156749651025416&w=2
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> + struct i915_address_space *vm;
>>> + unsigned long int idx;
>>> +
>>> + if (entry != ctx)
>>> + continue;
>>> +
>>> + xa_erase(&file_priv->context_xa, id);
>>> +
>>> + if (id)
>>> + break;
>>> +
>>> + xa_for_each(&file_priv->vm_xa, idx, vm) {
>>> + xa_erase(&file_priv->vm_xa, idx);
>>> + i915_vm_put(vm);
>>> + }
>>> +
>>> + break;
>>> + }
>>> +
>>> + context_close(ctx);
>>> + }
>>> +
>>> + i915_gem_driver_release__contexts(i915);
>>> +}
>>> +
>>> static int gem_context_register(struct i915_gem_context *ctx,
>>> struct drm_i915_file_private *fpriv,
>>> u32 *id)
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h b/drivers/gpu/drm/i915/gem/i915_gem_context.h
>>> index 3702b2fb27ab..62808bea9239 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.h
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.h
>>> @@ -110,6 +110,7 @@ i915_gem_context_clear_user_engines(struct i915_gem_context *ctx)
>>>
>>> /* i915_gem_context.c */
>>> void i915_gem_init__contexts(struct drm_i915_private *i915);
>>> +void i915_gem_driver_remove__contexts(struct drm_i915_private *i915);
>>> void i915_gem_driver_release__contexts(struct drm_i915_private *i915);
>>>
>>> int i915_gem_context_open(struct drm_i915_private *i915,
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 0cbcb9f54e7d..87d3c4f5b6c6 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -1189,6 +1189,8 @@ void i915_gem_driver_remove(struct drm_i915_private *dev_priv)
>>> /* Flush any outstanding unpin_work. */
>>> i915_gem_drain_workqueue(dev_priv);
>>>
>>> + i915_gem_driver_remove__contexts(dev_priv);
>>> +
>>> i915_gem_drain_freed_objects(dev_priv);
>>> }
>>>
>>>
>
More information about the Intel-gfx
mailing list