[Intel-gfx] [PATCH 1/3] iosys-map: Add per-word read
Christian König
christian.koenig at amd.com
Sat Jun 11 08:16:51 UTC 2022
Am 11.06.22 um 01:21 schrieb Lucas De Marchi:
> Instead of always falling back to memcpy_fromio() for any size, prefer
> using read{b,w,l}(). When reading struct members it's common to read
> individual integer variables individually. Going through memcpy_fromio()
> for each of them poses a high penalty.
>
> Employ a similar trick as __seqprop() by using _Generic() to generate
> only the specific call based on a type-compatible variable.
>
> For a pariticular i915 workload producing GPU context switches,
> __get_engine_usage_record() is particularly hot since the engine usage
> is read from device local memory with dgfx, possibly multiple times
> since it's racy. Test execution time for this test shows a ~12.5%
> improvement with DG2:
>
> Before:
> nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
> median = 9.52548e+06; var = 526149;
> After:
> nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
> median = 8.33955e+06; var = 333113;
>
> Other things attempted that didn't prove very useful:
> 1) Change the _Generic() on x86 to just dereference the memory address
> 2) Change __get_engine_usage_record() to do just 1 read per loop,
> comparing with the previous value read
> 3) Change __get_engine_usage_record() to access the fields directly as it
> was before the conversion to iosys-map
>
> (3) did gave a small improvement (~3%), but doesn't seem to scale well
> to other similar cases in the driver.
>
> Additional test by Chris Wilson using gem_create from igt with some
> changes to track object creation time. This happens to accidentally
> stress this code path:
>
> Pre iosys_map conversion of engine busyness:
> lmem0: Creating 262144 4KiB objects took 59274.2ms
>
> Unpatched:
> lmem0: Creating 262144 4KiB objects took 108830.2ms
>
> With readl (this patch):
> lmem0: Creating 262144 4KiB objects took 61348.6ms
>
> s/readl/READ_ONCE/
> lmem0: Creating 262144 4KiB objects took 61333.2ms
>
> So we do take a little bit more time than before the conversion, but
> that is due to other factors: bringing the READ_ONCE back would be as
> good as just doing this conversion.
>
> Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
Reviewed-by: Christian König <christian.koenig at amd.com> for the entire
series.
> ---
> include/linux/iosys-map.h | 26 ++++++++++++++++++++++----
> 1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
> index e69a002d5aa4..cd28c7a1b79c 100644
> --- a/include/linux/iosys-map.h
> +++ b/include/linux/iosys-map.h
> @@ -333,6 +333,20 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
> memset(dst->vaddr + offset, value, len);
> }
>
> +#ifdef CONFIG_64BIT
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \
> + u64: val_ = readq(vaddr_iomem_),
> +#else
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)
> +#endif
> +
> +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__, \
> + u8: val__ = readb(vaddr_iomem__), \
> + u16: val__ = readw(vaddr_iomem__), \
> + u32: val__ = readl(vaddr_iomem__), \
> + __iosys_map_rd_io_u64_case(val__, vaddr_iomem__) \
> + default: memcpy_fromio(&(val__), vaddr_iomem__, sizeof(val__)))
> +
> /**
> * iosys_map_rd - Read a C-type value from the iosys_map
> *
> @@ -346,10 +360,14 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
> * Returns:
> * The value read from the mapping.
> */
> -#define iosys_map_rd(map__, offset__, type__) ({ \
> - type__ val; \
> - iosys_map_memcpy_from(&val, map__, offset__, sizeof(val)); \
> - val; \
> +#define iosys_map_rd(map__, offset__, type__) ({ \
> + type__ val; \
> + if ((map__)->is_iomem) { \
> + __iosys_map_rd_io(val, (map__)->vaddr_iomem + offset__, type__);\
> + } else { \
> + memcpy(&val, (map__)->vaddr + offset__, sizeof(val)); \
> + } \
> + val; \
> })
>
> /**
More information about the Intel-gfx
mailing list