[PATCH] drm/radeon: fix handling of radeon_vm_bo_rmv v3

Fri Jul 18 03:59:04 PDT 2014

Am 18.07.2014 11:44, schrieb Michel Dänzer:
> On 18.07.2014 15:56, Christian König wrote:
>> From: Christian König <christian.koenig at amd.com>
>>
>> v3: completely rewritten. We now just remember which areas
>>      of the PT to clear and do so on the next command submission.
>>
>> Bug: https://bugs.freedesktop.org/show_bug.cgi?id=79980
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
> [...]
>
>> diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c
>> index eecff6b..2726b46 100644
>> --- a/drivers/gpu/drm/radeon/radeon_vm.c
>> +++ b/drivers/gpu/drm/radeon/radeon_vm.c
>> @@ -468,6 +469,15 @@ int radeon_vm_bo_set_addr(struct radeon_device *rdev,
>>   		head = &tmp->vm_list;
>>   	}
>>   
>> +	if (bo_va->soffset) {
>> +		/* add a clone of the bo_va to clear the old address */
>> +		tmp = kzalloc(sizeof(struct radeon_bo_va), GFP_KERNEL);
>> +		tmp->soffset = bo_va->soffset;
>> +		tmp->eoffset = bo_va->eoffset;
>> +		tmp->vm = vm;
> +		INIT_LIST_HEAD(&tmp->vm_status);
>
> ?

Unnecessary, would be overwritten by the following list_add anyway.

>> +		list_add(&tmp->vm_status, &vm->freed);
>> +	}
> [...]
>
>> -int radeon_vm_bo_rmv(struct radeon_device *rdev,
>> -		     struct radeon_bo_va *bo_va)
>> +void radeon_vm_bo_rmv(struct radeon_device *rdev,
>> +		      struct radeon_bo_va *bo_va)
>>   {
>> -	int r = 0;
>> +	struct radeon_vm *vm = bo_va->vm;
>>   
>> -	mutex_lock(&bo_va->vm->mutex);
>> -	if (bo_va->soffset)
>> -		r = radeon_vm_bo_update(rdev, bo_va->vm, bo_va->bo, NULL);
>> +	list_del(&bo_va->bo_list);
>>   
>> +	mutex_lock(&vm->mutex);
>>   	list_del(&bo_va->vm_list);
>> -	mutex_unlock(&bo_va->vm->mutex);
>> -	list_del(&bo_va->bo_list);
> Was there any particular reason for moving the list_del(&bo_va->bo_list)
> outside of the VM mutex? I suspect this might be the cause of the
> problem below, which I encountered after a few piglit runs.

Ah! Well it's not the root cause of the problem cause we released that 
lock before as well. But it changed the timing quite a bit and so 
brought the problem to the surface.

The root cause is that we don't reserve the IB-BO on every command 
submission, and so a concurrent creating or tear-down of a VM could 
modify the bo_list without holding a lock.

Thanks for the Info, going to address this in a second patch.

Regards,
Christian.

>
>
> Other than those two issues, the patch looks good to me, and I haven't
> seen any piglit GPU lockups with it on my Bonaire yet.
>
>
>   general protection fault: 0000 [#1] SMP 153
>   Modules linked in: bnep bluetooth rfkill snd_hrtimer snd_seq snd_seq_device cpufreq_stats cpufreq_powersave cpufreq_userspace cpufreq_conservative binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc hid_generic usbhid hid snd_hda_codec_hdmi evdev snd_hda_codec_realtek snd_hda_codec_generic nls_utf8 nls_cp437 vfat fat kvm_amd kvm crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode ohci_pci psmouse serio_raw pcspkr edac_mce_amd k10temp edac_core r8169 mii snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer radeon(O) ttm snd sg drm_kms_helper i2c_piix4 ohci_hcd ehci_pci drm ehci_hcd soundcore i2c_algo_bit i2c_core acpi_cpufreq xhci_hcd video usbcore usb_common processor wmi button thermal_sys fuse autofs4 ext4 crc16 mbcache jbd2 dm_mod sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common ahci libahci libata scsi_mod
>   CPU: 2 PID: 14721 Comm: shader_runner Tainted: G           O  3.16.0-rc5+ #143
>   Hardware name: System manufacturer System Product Name/A88X-PRO, BIOS 1001 04/01/2014
>   task: ffff8800083e4b10 ti: ffff880103bb8000 task.ti: ffff880103bb8000
>   RIP: 0010:[<ffffffffc0573672>]  [<ffffffffc0573672>] radeon_vm_bo_find+0x13/0x22 [radeon]
>   RSP: 0018:ffff880103bbbbf0  EFLAGS: 00010297
>   RAX: dead000000100100 RBX: ffff8800dba24000 RCX: ef7bdef7bdef7bdf
>   RDX: 000000000000a7b3 RSI: ffff8802120a9bb8 RDI: ffff8801f1d4f800
>   RBP: ffff880103bbbbf0 R08: 0000000000000000 R09: ffffffff00000000
>   R10: ffff880103bbba80 R11: ffff8800dba25eb0 R12: ffff880103bbbc18
>   R13: ffff8801f1d4f800 R14: ffff8800dba24000 R15: 0000000000000000
>   FS:  00007f640e4b7700(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f35bc858000 CR3: 00000001024a1000 CR4: 00000000000407e0
>   Stack:
>    ffff880103bbbdc0 ffffffffc04fb879 ffff8801f1d4f848 ffff880103bbbdf0
>    ffff8800dba24018 ffff880215228098 ffff8800dba24000 ffff880003207c00
>    0000000000000003 ffff8801f9c399c0 ffff8801fa08a120 0000001700000000
>   Call Trace:
>    [<ffffffffc04fb879>] radeon_cs_ioctl+0x4eb/0x677 [radeon]
>    [<ffffffffc042638e>] drm_ioctl+0x366/0x42c [drm]
>    [...]
>
>