Screen corruption using radeon kernel driver

Thu Dec 1 14:06:19 UTC 2022

On Thu, Dec 1, 2022 at 9:01 AM Robin Murphy <robin.murphy at arm.com> wrote:
>
> On 2022-11-30 19:59, Mikhail Krylov wrote:
> > On Wed, Nov 30, 2022 at 11:07:32AM -0500, Alex Deucher wrote:
> >> On Wed, Nov 30, 2022 at 10:42 AM Robin Murphy <robin.murphy at arm.com> wrote:
> >>>
> >>> On 2022-11-30 14:28, Alex Deucher wrote:
> >>>> On Wed, Nov 30, 2022 at 7:54 AM Robin Murphy <robin.murphy at arm.com> wrote:
> >>>>>
> >>>>> On 2022-11-29 17:11, Mikhail Krylov wrote:
> >>>>>> On Tue, Nov 29, 2022 at 11:05:28AM -0500, Alex Deucher wrote:
> >>>>>>> On Tue, Nov 29, 2022 at 10:59 AM Mikhail Krylov <sqarert at gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> On Tue, Nov 29, 2022 at 09:44:19AM -0500, Alex Deucher wrote:
> >>>>>>>>> On Mon, Nov 28, 2022 at 3:48 PM Mikhail Krylov <sqarert at gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Nov 28, 2022 at 09:50:50AM -0500, Alex Deucher wrote:
> >>>>>>>>>>
> >>>>>>>>>>>>> [excessive quoting removed]
> >>>>>>>>>>
> >>>>>>>>>>>> So, is there any progress on this issue? I do understand it's not a high
> >>>>>>>>>>>> priority one, and today I've checked it on 6.0 kernel, and
> >>>>>>>>>>>> unfortunately, it still persists...
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm considering writing a patch that will allow user to override
> >>>>>>>>>>>> need_dma32/dma_bits setting with a module parameter. I'll have some time
> >>>>>>>>>>>> after the New Year for that.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is it at all possible that such a patch will be merged into kernel?
> >>>>>>>>>>>>
> >>>>>>>>>>> On Mon, Nov 28, 2022 at 9:31 AM Mikhail Krylov <sqarert at gmail.com> wrote:
> >>>>>>>>>>> Unless someone familiar with HIMEM can figure out what is going wrong
> >>>>>>>>>>> we should just revert the patch.
> >>>>>>>>>>>
> >>>>>>>>>>> Alex
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Okay, I was suggesting that mostly because
> >>>>>>>>>>
> >>>>>>>>>> a) it works for me with dma_bits = 40 (I understand that's what it is
> >>>>>>>>>> without the original patch applied);
> >>>>>>>>>>
> >>>>>>>>>> b) there's a hint of uncertainity on this line
> >>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/radeon/radeon_device.c#n1359
> >>>>>>>>>> saying that for AGP dma_bits = 32 is the safest option, so apparently there are
> >>>>>>>>>> setups, unlike mine, where dma_bits = 32 is better than 40.
> >>>>>>>>>>
> >>>>>>>>>> But I'm in no position to argue, just wanted to make myself clear.
> >>>>>>>>>> I'm okay with rebuilding the kernel for my machine until the original
> >>>>>>>>>> patch is reverted or any other fix is applied.
> >>>>>>>>>
> >>>>>>>>> What GPU do you have and is it AGP?  If it is AGP, does setting
> >>>>>>>>> radeon.agpmode=-1 also fix it?
> >>>>>>>>>
> >>>>>>>>> Alex
> >>>>>>>>
> >>>>>>>> That is ATI Radeon X1950, and, unfortunately, radeon.agpmode=-1 doesn't
> >>>>>>>> help, it just makes 3D acceleration in games such as OpenArena stop
> >>>>>>>> working.
> >>>>>>>
> >>>>>>> Just to confirm, is the board AGP or PCIe?
> >>>>>>>
> >>>>>>> Alex
> >>>>>>
> >>>>>> It is AGP. That's an old machine.
> >>>>>
> >>>>> Can you check whether dma_addressing_limited() is actually returning the
> >>>>> expected result at the point of radeon_ttm_init()? Disabling highmem is
> >>>>> presumably just hiding whatever problem exists, by throwing away all
> >>>>>    >32-bit RAM such that use_dma32 doesn't matter.
> >>>>
> >>>> The device in question only supports a 32 bit DMA mask so
> >>>> dma_addressing_limited() should return true.  Bounce buffers are not
> >>>> really usable on GPUs because they map so much memory.  If
> >>>> dma_addressing_limited() returns false, that would explain it.
> >>>
> >>> Right, it appears to be the only part of the offending commit that
> >>> *could* reasonably make any difference, so I'm primarily wondering if
> >>> dma_get_required_mask() somehow gets confused.
> >>
> >> Mikhail,
> >>
> >> Can you see that dma_addressing_limited() and dma_get_required_mask()
> >> return in this case?
> >>
> >> Alex
> >>
> >>
> >>>
> >>> Thanks,
> >>> Robin.
> >
> > Unfortunately, right now I don't have enough time for kernel
> > modifications and rebuilds (I will later!), so I did a quick-and-dirty
> > research with kprobe.
> >
> > The problem is that dma_addressing_limited() seems to be inlined and
> > kprobe fails to intercept it.
> >
> > But I managed to get the result of dma_get_required_mask(). It returns
> > 0x7fffffff (!) on the vanilla (with the patch, buggy) kernel:
> >
> > $ sudo kprobe-perf 'r:dma_get_required_mask $retval'
> > Tracing kprobe dma_get_required_mask. Ctrl-C to end.
> >          modprobe-1244    [000] d...   105.582816: dma_get_required_mask: (radeon_ttm_init+0x61/0x240 [radeon] <- dma_get_required_mask) arg1=0x7fffffff
> >
> > This function does not even get called in the kernel without the patch
> > that I built myself. I believe that's because ttm_bo_device_init()
> > doesn't call it without the patch.
> >
> > Hope that helps at least a bit. If not, I'll be able to do more thorough
> > research in a couple of weeks, probably.
>
> Hmm, just to clarify, what's your actual RAM layout? I've been assuming
> that the issue must be caused by unexpected DMA address truncation, but
> double-checking the older threads it seems that might not be the case.
> I just did a quick sanity-check of both HIGHMEM4G and HIGHMEM64G configs
> in a VM with either 2GB or 4GB of RAM assigned, and the
> dma_direct_get_required_mask() calculation seemed to return the
> appropriate result for all combinations.
>
> Otherwise, the only significant difference of use_dma32 seems to be to
> switch TTM's allocation flags from GFP_HIGHUSER to GFP_DMA32. Could it
> just be that the highmem support somewhere between TTM and radeon has
> bitrotted, and it hasn't been noticed until this change because everyone
> still using a 32-bit system with highmem also happens not to be using a
> newer 40-bit-capable GPU? Or perhaps it never worked for AGP at all, in
> which case an explicit special case might be clearer?

WIth AGP, the driver just sets up an aperture on the GPU to point to
the AGP aperture in the system.  The platform AGP drivers handle the
DMA mappings into their aperture.  It's possible the AGP drivers are
doing something wrong with respect to their DMA masks?

Alex

>
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> index d33fec488713..acb2d534bff5 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -696,6 +696,7 @@ int radeon_ttm_init(struct radeon_device *rdev)
>                                rdev->ddev->anon_inode->i_mapping,
>                                rdev->ddev->vma_offset_manager,
>                                rdev->need_swiotlb,
> +                              rdev->flags & RADEON_IS_AGP ||
>                                dma_addressing_limited(&rdev->pdev->dev));
>         if (r) {
>                 DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
>
>
> Robin.