[Mesa-dev] [PATCH] gallium/swr: update rasterizer (532172)
Tom Stellard
tom at stellard.net
Tue Mar 22 20:04:58 UTC 2016
On Tue, Mar 22, 2016 at 02:45:48PM -0500, Tim Rowley wrote:
> Highlights include:
> * code style fixes
> * start removing win32 types
> * switch DC/DS rings to ringbuffer datastructure
> * rdtsc bucket support for shaders
> * address some coverity issues
> * user clip planes
> * global arena
> * support llvm-svn
Is there some reason why all these changes are squashed into a single
patch?
-Tom
> ---
> src/gallium/docs/source/drivers/openswr/knobs.rst | 18 +-
> src/gallium/drivers/swr/Makefile.sources-arch | 2 +-
> .../drivers/swr/rasterizer/common/containers.hpp | 270 +++----
> src/gallium/drivers/swr/rasterizer/common/os.h | 44 +-
> .../swr/rasterizer/common/rdtsc_buckets.cpp | 18 +-
> .../drivers/swr/rasterizer/common/rdtsc_buckets.h | 9 +-
> .../swr/rasterizer/common/rdtsc_buckets_shared.h | 4 +-
> .../drivers/swr/rasterizer/common/simdintrin.h | 805 +++++++++++++--------
> src/gallium/drivers/swr/rasterizer/core/api.cpp | 308 ++++----
> src/gallium/drivers/swr/rasterizer/core/api.h | 56 +-
> src/gallium/drivers/swr/rasterizer/core/arena.cpp | 166 -----
> src/gallium/drivers/swr/rasterizer/core/arena.h | 310 +++++++-
> .../drivers/swr/rasterizer/core/backend.cpp | 241 ++----
> src/gallium/drivers/swr/rasterizer/core/backend.h | 173 ++++-
> src/gallium/drivers/swr/rasterizer/core/clip.cpp | 3 +
> src/gallium/drivers/swr/rasterizer/core/clip.h | 98 ++-
> src/gallium/drivers/swr/rasterizer/core/context.h | 45 +-
> .../drivers/swr/rasterizer/core/depthstencil.h | 6 +-
> src/gallium/drivers/swr/rasterizer/core/fifo.hpp | 6 +-
> .../swr/rasterizer/core/format_conversion.h | 4 +-
> .../drivers/swr/rasterizer/core/format_types.h | 32 +-
> .../drivers/swr/rasterizer/core/frontend.cpp | 95 ++-
> src/gallium/drivers/swr/rasterizer/core/frontend.h | 13 +-
> .../drivers/swr/rasterizer/core/knobs_init.h | 5 +
> src/gallium/drivers/swr/rasterizer/core/pa.h | 92 +--
> .../drivers/swr/rasterizer/core/rasterizer.cpp | 78 +-
> .../drivers/swr/rasterizer/core/ringbuffer.h | 102 +++
> src/gallium/drivers/swr/rasterizer/core/state.h | 10 +-
> .../drivers/swr/rasterizer/core/threads.cpp | 222 +-----
> src/gallium/drivers/swr/rasterizer/core/threads.h | 6 +-
> .../drivers/swr/rasterizer/core/tilemgr.cpp | 298 +++++++-
> src/gallium/drivers/swr/rasterizer/core/tilemgr.h | 121 +---
> src/gallium/drivers/swr/rasterizer/core/utils.cpp | 5 +
> src/gallium/drivers/swr/rasterizer/core/utils.h | 51 +-
> .../drivers/swr/rasterizer/jitter/JitManager.cpp | 4 +
> .../drivers/swr/rasterizer/jitter/JitManager.h | 8 +-
> .../drivers/swr/rasterizer/jitter/blend_jit.cpp | 8 +-
> .../drivers/swr/rasterizer/jitter/builder.cpp | 16 +-
> .../drivers/swr/rasterizer/jitter/builder.h | 6 +
> .../drivers/swr/rasterizer/jitter/builder_misc.cpp | 172 ++++-
> .../drivers/swr/rasterizer/jitter/builder_misc.h | 8 +-
> .../drivers/swr/rasterizer/jitter/fetch_jit.cpp | 72 +-
> .../jitter/scripts/gen_llvm_ir_macros.py | 21 +-
> .../rasterizer/jitter/scripts/gen_llvm_types.py | 2 +-
> .../swr/rasterizer/jitter/streamout_jit.cpp | 8 +-
> .../drivers/swr/rasterizer/memory/ClearTile.cpp | 14 +-
> .../drivers/swr/rasterizer/memory/Convert.h | 14 +-
> .../drivers/swr/rasterizer/memory/tilingtraits.h | 58 +-
> .../drivers/swr/rasterizer/scripts/gen_knobs.py | 2 +-
> .../drivers/swr/rasterizer/scripts/knob_defs.py | 73 +-
> .../rasterizer/scripts/templates/knobs.template | 8 +-
> src/gallium/drivers/swr/swr_context.cpp | 1 -
> 52 files changed, 2464 insertions(+), 1747 deletions(-)
> delete mode 100644 src/gallium/drivers/swr/rasterizer/core/arena.cpp
> create mode 100644 src/gallium/drivers/swr/rasterizer/core/ringbuffer.h
>
> diff --git a/src/gallium/docs/source/drivers/openswr/knobs.rst b/src/gallium/docs/source/drivers/openswr/knobs.rst
> index 06f228a..c26581d 100644
> --- a/src/gallium/docs/source/drivers/openswr/knobs.rst
> +++ b/src/gallium/docs/source/drivers/openswr/knobs.rst
> @@ -4,10 +4,6 @@
> OpenSWR has a number of environment variables which control its
> operation, in addition to the normal Mesa and gallium controls.
>
> -.. envvar:: KNOB_ENABLE_ASSERT_DIALOGS <bool> (true)
> -
> -Use dialogs when asserts fire. Asserts are only enabled in debug builds
> -
> .. envvar:: KNOB_SINGLE_THREADED <bool> (false)
>
> If enabled will perform all rendering on the API thread. This is useful mainly for debugging purposes.
> @@ -52,7 +48,7 @@ Frame at which to stop saving buckets data. NOTE: KNOB_ENABLE_RDTSC must be ena
>
> Number of spin-loop iterations worker threads will perform before going to sleep when waiting for work
>
> -.. envvar:: KNOB_MAX_DRAWS_IN_FLIGHT <uint32_t> (160)
> +.. envvar:: KNOB_MAX_DRAWS_IN_FLIGHT <uint32_t> (96)
>
> Maximum number of draws outstanding before API thread blocks.
>
> @@ -64,18 +60,6 @@ Maximum primitives in a single Draw(). Larger primitives are split into smaller
>
> Maximum primitives in a single Draw() with tessellation enabled. Larger primitives are split into smaller Draw calls. Should be a multiple of (vectorWidth).
>
> -.. envvar:: KNOB_MAX_FRAC_ODD_TESS_FACTOR <float> (63.0f)
> -
> -(DEBUG) Maximum tessellation factor for fractional-odd partitioning.
> -
> -.. envvar:: KNOB_MAX_FRAC_EVEN_TESS_FACTOR <float> (64.0f)
> -
> -(DEBUG) Maximum tessellation factor for fractional-even partitioning.
> -
> -.. envvar:: KNOB_MAX_INTEGER_TESS_FACTOR <uint32_t> (64)
> -
> -(DEBUG) Maximum tessellation factor for integer partitioning.
> -
> .. envvar:: KNOB_BUCKETS_ENABLE_THREADVIZ <bool> (false)
>
> Enable threadviz output.
> diff --git a/src/gallium/drivers/swr/Makefile.sources-arch b/src/gallium/drivers/swr/Makefile.sources-arch
> index 6c105f4..a04b120 100644
> --- a/src/gallium/drivers/swr/Makefile.sources-arch
> +++ b/src/gallium/drivers/swr/Makefile.sources-arch
> @@ -59,7 +59,6 @@ COMMON_CXX_SOURCES := \
> CORE_CXX_SOURCES := \
> rasterizer/core/api.cpp \
> rasterizer/core/api.h \
> - rasterizer/core/arena.cpp \
> rasterizer/core/arena.h \
> rasterizer/core/backend.cpp \
> rasterizer/core/backend.h \
> @@ -83,6 +82,7 @@ CORE_CXX_SOURCES := \
> rasterizer/core/rasterizer.h \
> rasterizer/core/rdtsc_core.cpp \
> rasterizer/core/rdtsc_core.h \
> + rasterizer/core/ringbuffer.h \
> rasterizer/core/state.h \
> rasterizer/core/threads.cpp \
> rasterizer/core/threads.h \
> diff --git a/src/gallium/drivers/swr/rasterizer/common/containers.hpp b/src/gallium/drivers/swr/rasterizer/common/containers.hpp
> index bc96c5f..f3c0597 100644
> --- a/src/gallium/drivers/swr/rasterizer/common/containers.hpp
> +++ b/src/gallium/drivers/swr/rasterizer/common/containers.hpp
> @@ -33,137 +33,137 @@ namespace SWRL
> template <typename T, int NUM_ELEMENTS>
> struct UncheckedFixedVector
> {
> - UncheckedFixedVector() : mSize(0)
> - {
> - }
> -
> - UncheckedFixedVector(std::size_t size, T const& exemplar)
> - {
> - this->mSize = 0;
> - for (std::size_t i = 0; i < size; ++i)
> - this->push_back(exemplar);
> - }
> -
> - template <typename Iter>
> - UncheckedFixedVector(Iter fst, Iter lst)
> - {
> - this->mSize = 0;
> - for ( ; fst != lst; ++fst)
> - this->push_back(*fst);
> - }
> -
> - UncheckedFixedVector(UncheckedFixedVector const& UFV)
> - {
> - this->mSize = 0;
> - for (std::size_t i = 0, N = UFV.size(); i < N; ++i)
> - (*this)[i] = UFV[i];
> - this->mSize = UFV.size();
> - }
> -
> - UncheckedFixedVector& operator=(UncheckedFixedVector const& UFV)
> - {
> - for (std::size_t i = 0, N = UFV.size(); i < N; ++i)
> - (*this)[i] = UFV[i];
> - this->mSize = UFV.size();
> - return *this;
> - }
> -
> - T* begin() { return &this->mElements[0]; }
> - T* end() { return &this->mElements[0] + this->mSize; }
> - T const* begin() const { return &this->mElements[0]; }
> - T const* end() const { return &this->mElements[0] + this->mSize; }
> -
> - friend bool operator==(UncheckedFixedVector const& L, UncheckedFixedVector const& R)
> - {
> - if (L.size() != R.size()) return false;
> - for (std::size_t i = 0, N = L.size(); i < N; ++i)
> - {
> - if (L[i] != R[i]) return false;
> - }
> - return true;
> - }
> -
> - friend bool operator!=(UncheckedFixedVector const& L, UncheckedFixedVector const& R)
> - {
> - if (L.size() != R.size()) return true;
> - for (std::size_t i = 0, N = L.size(); i < N; ++i)
> - {
> - if (L[i] != R[i]) return true;
> - }
> - return false;
> - }
> -
> - T& operator[](std::size_t idx)
> - {
> - return this->mElements[idx];
> - }
> - T const& operator[](std::size_t idx) const
> - {
> - return this->mElements[idx];
> - }
> - void push_back(T const& t)
> - {
> - this->mElements[this->mSize] = t;
> - ++this->mSize;
> - }
> - void pop_back()
> - {
> - SWR_ASSERT(this->mSize > 0);
> - --this->mSize;
> - }
> - T& back()
> - {
> - return this->mElements[this->mSize-1];
> - }
> - T const& back() const
> - {
> - return this->mElements[this->mSize-1];
> - }
> - bool empty() const
> - {
> - return this->mSize == 0;
> - }
> - std::size_t size() const
> - {
> - return this->mSize;
> - }
> - void resize(std::size_t sz)
> - {
> - this->mSize = sz;
> - }
> - void clear()
> - {
> - this->resize(0);
> - }
> + UncheckedFixedVector() : mSize(0)
> + {
> + }
> +
> + UncheckedFixedVector(std::size_t size, T const& exemplar)
> + {
> + this->mSize = 0;
> + for (std::size_t i = 0; i < size; ++i)
> + this->push_back(exemplar);
> + }
> +
> + template <typename Iter>
> + UncheckedFixedVector(Iter fst, Iter lst)
> + {
> + this->mSize = 0;
> + for ( ; fst != lst; ++fst)
> + this->push_back(*fst);
> + }
> +
> + UncheckedFixedVector(UncheckedFixedVector const& UFV)
> + {
> + this->mSize = 0;
> + for (std::size_t i = 0, N = UFV.size(); i < N; ++i)
> + (*this)[i] = UFV[i];
> + this->mSize = UFV.size();
> + }
> +
> + UncheckedFixedVector& operator=(UncheckedFixedVector const& UFV)
> + {
> + for (std::size_t i = 0, N = UFV.size(); i < N; ++i)
> + (*this)[i] = UFV[i];
> + this->mSize = UFV.size();
> + return *this;
> + }
> +
> + T* begin() { return &this->mElements[0]; }
> + T* end() { return &this->mElements[0] + this->mSize; }
> + T const* begin() const { return &this->mElements[0]; }
> + T const* end() const { return &this->mElements[0] + this->mSize; }
> +
> + friend bool operator==(UncheckedFixedVector const& L, UncheckedFixedVector const& R)
> + {
> + if (L.size() != R.size()) return false;
> + for (std::size_t i = 0, N = L.size(); i < N; ++i)
> + {
> + if (L[i] != R[i]) return false;
> + }
> + return true;
> + }
> +
> + friend bool operator!=(UncheckedFixedVector const& L, UncheckedFixedVector const& R)
> + {
> + if (L.size() != R.size()) return true;
> + for (std::size_t i = 0, N = L.size(); i < N; ++i)
> + {
> + if (L[i] != R[i]) return true;
> + }
> + return false;
> + }
> +
> + T& operator[](std::size_t idx)
> + {
> + return this->mElements[idx];
> + }
> + T const& operator[](std::size_t idx) const
> + {
> + return this->mElements[idx];
> + }
> + void push_back(T const& t)
> + {
> + this->mElements[this->mSize] = t;
> + ++this->mSize;
> + }
> + void pop_back()
> + {
> + SWR_ASSERT(this->mSize > 0);
> + --this->mSize;
> + }
> + T& back()
> + {
> + return this->mElements[this->mSize-1];
> + }
> + T const& back() const
> + {
> + return this->mElements[this->mSize-1];
> + }
> + bool empty() const
> + {
> + return this->mSize == 0;
> + }
> + std::size_t size() const
> + {
> + return this->mSize;
> + }
> + void resize(std::size_t sz)
> + {
> + this->mSize = sz;
> + }
> + void clear()
> + {
> + this->resize(0);
> + }
> private:
> - std::size_t mSize;
> - T mElements[NUM_ELEMENTS];
> + std::size_t mSize{ 0 };
> + T mElements[NUM_ELEMENTS];
> };
>
> template <typename T, int NUM_ELEMENTS>
> struct FixedStack : UncheckedFixedVector<T, NUM_ELEMENTS>
> {
> - FixedStack() {}
> -
> - void push(T const& t)
> - {
> - this->push_back(t);
> - }
> -
> - void pop()
> - {
> - this->pop_back();
> - }
> -
> - T& top()
> - {
> - return this->back();
> - }
> -
> - T const& top() const
> - {
> - return this->back();
> - }
> + FixedStack() {}
> +
> + void push(T const& t)
> + {
> + this->push_back(t);
> + }
> +
> + void pop()
> + {
> + this->pop_back();
> + }
> +
> + T& top()
> + {
> + return this->back();
> + }
> +
> + T const& top() const
> + {
> + return this->back();
> + }
> };
>
> template <typename T>
> @@ -190,16 +190,16 @@ namespace std
> template <typename T, int N>
> struct hash<SWRL::UncheckedFixedVector<T, N>>
> {
> - size_t operator() (SWRL::UncheckedFixedVector<T, N> const& v) const
> - {
> - if (v.size() == 0) return 0;
> - std::hash<T> H;
> - size_t x = H(v[0]);
> - if (v.size() == 1) return x;
> - for (size_t i = 1; i < v.size(); ++i)
> - x ^= H(v[i]) + 0x9e3779b9 + (x<<6) + (x>>2);
> - return x;
> - }
> + size_t operator() (SWRL::UncheckedFixedVector<T, N> const& v) const
> + {
> + if (v.size() == 0) return 0;
> + std::hash<T> H;
> + size_t x = H(v[0]);
> + if (v.size() == 1) return x;
> + for (size_t i = 1; i < v.size(); ++i)
> + x ^= H(v[i]) + 0x9e3779b9 + (x<<6) + (x>>2);
> + return x;
> + }
> };
>
>
> diff --git a/src/gallium/drivers/swr/rasterizer/common/os.h b/src/gallium/drivers/swr/rasterizer/common/os.h
> index 522ae0d..5794f3f 100644
> --- a/src/gallium/drivers/swr/rasterizer/common/os.h
> +++ b/src/gallium/drivers/swr/rasterizer/common/os.h
> @@ -47,16 +47,18 @@
> #define DEBUGBREAK __debugbreak()
>
> #define PRAGMA_WARNING_PUSH_DISABLE(...) \
> - __pragma(warning(push));\
> - __pragma(warning(disable:__VA_ARGS__));
> + __pragma(warning(push));\
> + __pragma(warning(disable:__VA_ARGS__));
>
> #define PRAGMA_WARNING_POP() __pragma(warning(pop))
>
> #if defined(_WIN32)
> #if defined(_WIN64)
> +#define BitScanReverseSizeT BitScanReverse64
> #define BitScanForwardSizeT BitScanForward64
> #define _mm_popcount_sizeT _mm_popcnt_u64
> #else
> +#define BitScanReverseSizeT BitScanReverse
> #define BitScanForwardSizeT BitScanForward
> #define _mm_popcount_sizeT _mm_popcnt_u32
> #endif
> @@ -68,29 +70,20 @@
>
> #include <stdlib.h>
> #include <string.h>
> -#include <X11/Xmd.h>
> #include <x86intrin.h>
> #include <stdint.h>
> #include <sys/types.h>
> #include <unistd.h>
> #include <sys/stat.h>
> +#include <stdio.h>
>
> -typedef void VOID;
> +typedef void VOID;
> typedef void* LPVOID;
> -typedef CARD8 BOOL;
> -typedef wchar_t WCHAR;
> -typedef uint16_t UINT16;
> -typedef int INT;
> -typedef unsigned int UINT;
> -typedef uint32_t UINT32;
> -typedef uint64_t UINT64;
> -typedef int64_t INT64;
> -typedef void* HANDLE;
> -typedef float FLOAT;
> -typedef int LONG;
> -typedef CARD8 BYTE;
> -typedef unsigned char UCHAR;
> -typedef unsigned int DWORD;
> +typedef int INT;
> +typedef unsigned int UINT;
> +typedef void* HANDLE;
> +typedef int LONG;
> +typedef unsigned int DWORD;
>
> #undef FALSE
> #define FALSE 0
> @@ -104,8 +97,11 @@ typedef unsigned int DWORD;
> #define INLINE __inline
> #endif
> #define DEBUGBREAK asm ("int $3")
> +#if !defined(__CYGWIN__)
> #define __cdecl
> +#define __stdcall
> #define __declspec(X)
> +#endif
>
> #define GCC_VERSION (__GNUC__ * 10000 \
> + __GNUC_MINOR__ * 100 \
> @@ -180,21 +176,13 @@ unsigned char _bittest(const LONG *a, LONG b)
>
> #define CreateDirectory(name, pSecurity) mkdir(name, 0777)
>
> -#if defined(_WIN32)
> -static inline
> -unsigned int _mm_popcnt_u32(unsigned int v)
> -{
> - return __builtin_popcount(v);
> -}
> -#endif
> -
> #define _aligned_free free
> #define InterlockedCompareExchange(Dest, Exchange, Comparand) __sync_val_compare_and_swap(Dest, Comparand, Exchange)
> #define InterlockedExchangeAdd(Addend, Value) __sync_fetch_and_add(Addend, Value)
> #define InterlockedDecrement(Append) __sync_sub_and_fetch(Append, 1)
> +#define InterlockedDecrement64(Append) __sync_sub_and_fetch(Append, 1)
> #define InterlockedIncrement(Append) __sync_add_and_fetch(Append, 1)
> #define _ReadWriteBarrier() asm volatile("" ::: "memory")
> -#define __stdcall
>
> #define PRAGMA_WARNING_PUSH_DISABLE(...)
> #define PRAGMA_WARNING_POP()
> @@ -206,7 +194,7 @@ unsigned int _mm_popcnt_u32(unsigned int v)
> #endif
>
> // Universal types
> -typedef BYTE KILOBYTE[1024];
> +typedef uint8_t KILOBYTE[1024];
> typedef KILOBYTE MEGABYTE[1024];
> typedef MEGABYTE GIGABYTE[1024];
>
> diff --git a/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets.cpp b/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets.cpp
> index 454641b..c6768b4 100644
> --- a/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets.cpp
> @@ -64,12 +64,14 @@ void BucketManager::RegisterThread(const std::string& name)
>
> UINT BucketManager::RegisterBucket(const BUCKET_DESC& desc)
> {
> + mThreadMutex.lock();
> size_t id = mBuckets.size();
> mBuckets.push_back(desc);
> + mThreadMutex.unlock();
> return (UINT)id;
> }
>
> -void BucketManager::PrintBucket(FILE* f, UINT level, UINT64 threadCycles, UINT64 parentCycles, const BUCKET& bucket)
> +void BucketManager::PrintBucket(FILE* f, UINT level, uint64_t threadCycles, uint64_t parentCycles, const BUCKET& bucket)
> {
> const char *arrows[] = {
> "",
> @@ -88,7 +90,7 @@ void BucketManager::PrintBucket(FILE* f, UINT level, UINT64 threadCycles, UINT64
> float percentParent = (float)((double)bucket.elapsed / (double)parentCycles * 100.0);
>
> // compute average cycle count per invocation
> - UINT64 CPE = bucket.elapsed / bucket.count;
> + uint64_t CPE = bucket.elapsed / bucket.count;
>
> BUCKET_DESC &desc = mBuckets[bucket.id];
>
> @@ -127,7 +129,7 @@ void BucketManager::PrintThread(FILE* f, const BUCKET_THREAD& thread)
>
> // compute thread level total cycle counts across all buckets from root
> const BUCKET& root = thread.root;
> - UINT64 totalCycles = 0;
> + uint64_t totalCycles = 0;
> for (const BUCKET& child : root.children)
> {
> totalCycles += child.elapsed;
> @@ -186,3 +188,13 @@ void BucketManager::PrintReport(const std::string& filename)
> fclose(f);
> }
> }
> +
> +void BucketManager_StartBucket(BucketManager* pBucketMgr, uint32_t id)
> +{
> + pBucketMgr->StartBucket(id);
> +}
> +
> +void BucketManager_StopBucket(BucketManager* pBucketMgr, uint32_t id)
> +{
> + pBucketMgr->StopBucket(id);
> +}
> diff --git a/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets.h b/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets.h
> index 99cb10e..9dfa7f6 100644
> --- a/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets.h
> +++ b/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets.h
> @@ -70,7 +70,9 @@ public:
> // removes all registered buckets
> void ClearBuckets()
> {
> + mThreadMutex.lock();
> mBuckets.clear();
> + mThreadMutex.unlock();
> }
>
> /// Registers a new thread with the manager.
> @@ -209,7 +211,7 @@ public:
> }
>
> private:
> - void PrintBucket(FILE* f, UINT level, UINT64 threadCycles, UINT64 parentCycles, const BUCKET& bucket);
> + void PrintBucket(FILE* f, UINT level, uint64_t threadCycles, uint64_t parentCycles, const BUCKET& bucket);
> void PrintThread(FILE* f, const BUCKET_THREAD& thread);
>
> // list of active threads that have registered with this manager
> @@ -227,3 +229,8 @@ private:
> bool mThreadViz{ false };
> std::string mThreadVizDir;
> };
> +
> +
> +// C helpers for jitter
> +void BucketManager_StartBucket(BucketManager* pBucketMgr, uint32_t id);
> +void BucketManager_StopBucket(BucketManager* pBucketMgr, uint32_t id);
> diff --git a/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets_shared.h b/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets_shared.h
> index 41c6d5d..34c322e 100644
> --- a/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets_shared.h
> +++ b/src/gallium/drivers/swr/rasterizer/common/rdtsc_buckets_shared.h
> @@ -64,13 +64,13 @@ struct BUCKET_THREAD
> std::string name;
>
> // id for this thread, assigned by the thread manager
> - uint32_t id;
> + uint32_t id{ 0 };
>
> // root of the bucket hierarchy for this thread
> BUCKET root;
>
> // currently executing bucket somewhere in the hierarchy
> - BUCKET* pCurrent;
> + BUCKET* pCurrent{ nullptr };
>
> // currently executing hierarchy level
> uint32_t level{ 0 };
> diff --git a/src/gallium/drivers/swr/rasterizer/common/simdintrin.h b/src/gallium/drivers/swr/rasterizer/common/simdintrin.h
> index 8fa6d9e..fa792b4 100644
> --- a/src/gallium/drivers/swr/rasterizer/common/simdintrin.h
> +++ b/src/gallium/drivers/swr/rasterizer/common/simdintrin.h
> @@ -43,14 +43,14 @@ typedef uint8_t simdmask;
> // simd vector
> OSALIGNSIMD(union) simdvector
> {
> - simdscalar v[4];
> - struct
> - {
> - simdscalar x, y, z, w;
> - };
> -
> - simdscalar& operator[] (const int i) { return v[i]; }
> - const simdscalar& operator[] (const int i) const { return v[i]; }
> + simdscalar v[4];
> + struct
> + {
> + simdscalar x, y, z, w;
> + };
> +
> + simdscalar& operator[] (const int i) { return v[i]; }
> + const simdscalar& operator[] (const int i) const { return v[i]; }
> };
>
> #if KNOB_SIMD_WIDTH == 8
> @@ -59,8 +59,8 @@ OSALIGNSIMD(union) simdvector
> #define _simd_load1_ps _mm256_broadcast_ss
> #define _simd_loadu_ps _mm256_loadu_ps
> #define _simd_setzero_ps _mm256_setzero_ps
> -#define _simd_set1_ps _mm256_set1_ps
> -#define _simd_blend_ps _mm256_blend_ps
> +#define _simd_set1_ps _mm256_set1_ps
> +#define _simd_blend_ps _mm256_blend_ps
> #define _simd_blendv_ps _mm256_blendv_ps
> #define _simd_store_ps _mm256_store_ps
> #define _simd_mul_ps _mm256_mul_ps
> @@ -100,21 +100,156 @@ OSALIGNSIMD(union) simdvector
> INLINE \
> __m256i func(__m256i a, __m256i b)\
> {\
> - __m128i aHi = _mm256_extractf128_si256(a, 1);\
> - __m128i bHi = _mm256_extractf128_si256(b, 1);\
> - __m128i aLo = _mm256_castsi256_si128(a);\
> - __m128i bLo = _mm256_castsi256_si128(b);\
> + __m128i aHi = _mm256_extractf128_si256(a, 1);\
> + __m128i bHi = _mm256_extractf128_si256(b, 1);\
> + __m128i aLo = _mm256_castsi256_si128(a);\
> + __m128i bLo = _mm256_castsi256_si128(b);\
> \
> - __m128i subLo = intrin(aLo, bLo);\
> - __m128i subHi = intrin(aHi, bHi);\
> + __m128i subLo = intrin(aLo, bLo);\
> + __m128i subHi = intrin(aHi, bHi);\
> \
> - __m256i result = _mm256_castsi128_si256(subLo);\
> - result = _mm256_insertf128_si256(result, subHi, 1);\
> + __m256i result = _mm256_castsi128_si256(subLo);\
> + result = _mm256_insertf128_si256(result, subHi, 1);\
> \
> - return result;\
> + return result;\
> }
>
> #if (KNOB_ARCH == KNOB_ARCH_AVX)
> +INLINE
> +__m256 _simdemu_permute_ps(__m256 a, __m256i b)
> +{
> + __m128 aHi = _mm256_extractf128_ps(a, 1);
> + __m128i bHi = _mm256_extractf128_si256(b, 1);
> + __m128 aLo = _mm256_castps256_ps128(a);
> + __m128i bLo = _mm256_castsi256_si128(b);
> +
> + __m128i indexHi = _mm_cmpgt_epi32(bLo, _mm_set1_epi32(3));
> + __m128 resLow = _mm_permutevar_ps(aLo, _mm_and_si128(bLo, _mm_set1_epi32(0x3)));
> + __m128 resHi = _mm_permutevar_ps(aHi, _mm_and_si128(bLo, _mm_set1_epi32(0x3)));
> + __m128 blendLowRes = _mm_blendv_ps(resLow, resHi, _mm_castsi128_ps(indexHi));
> +
> + indexHi = _mm_cmpgt_epi32(bHi, _mm_set1_epi32(3));
> + resLow = _mm_permutevar_ps(aLo, _mm_and_si128(bHi, _mm_set1_epi32(0x3)));
> + resHi = _mm_permutevar_ps(aHi, _mm_and_si128(bHi, _mm_set1_epi32(0x3)));
> + __m128 blendHiRes = _mm_blendv_ps(resLow, resHi, _mm_castsi128_ps(indexHi));
> +
> + __m256 result = _mm256_castps128_ps256(blendLowRes);
> + result = _mm256_insertf128_ps(result, blendHiRes, 1);
> +
> + return result;
> +}
> +
> +INLINE
> +__m256i _simdemu_srlv_epi32(__m256i vA, __m256i vCount)
> +{
> + int32_t aHi, aLow, countHi, countLow;
> + __m128i vAHi = _mm_castps_si128(_mm256_extractf128_ps(_mm256_castsi256_ps(vA), 1));
> + __m128i vALow = _mm_castps_si128(_mm256_extractf128_ps(_mm256_castsi256_ps(vA), 0));
> + __m128i vCountHi = _mm_castps_si128(_mm256_extractf128_ps(_mm256_castsi256_ps(vCount), 1));
> + __m128i vCountLow = _mm_castps_si128(_mm256_extractf128_ps(_mm256_castsi256_ps(vCount), 0));
> +
> + aHi = _mm_extract_epi32(vAHi, 0);
> + countHi = _mm_extract_epi32(vCountHi, 0);
> + aHi >>= countHi;
> + vAHi = _mm_insert_epi32(vAHi, aHi, 0);
> +
> + aLow = _mm_extract_epi32(vALow, 0);
> + countLow = _mm_extract_epi32(vCountLow, 0);
> + aLow >>= countLow;
> + vALow = _mm_insert_epi32(vALow, aLow, 0);
> +
> + aHi = _mm_extract_epi32(vAHi, 1);
> + countHi = _mm_extract_epi32(vCountHi, 1);
> + aHi >>= countHi;
> + vAHi = _mm_insert_epi32(vAHi, aHi, 1);
> +
> + aLow = _mm_extract_epi32(vALow, 1);
> + countLow = _mm_extract_epi32(vCountLow, 1);
> + aLow >>= countLow;
> + vALow = _mm_insert_epi32(vALow, aLow, 1);
> +
> + aHi = _mm_extract_epi32(vAHi, 2);
> + countHi = _mm_extract_epi32(vCountHi, 2);
> + aHi >>= countHi;
> + vAHi = _mm_insert_epi32(vAHi, aHi, 2);
> +
> + aLow = _mm_extract_epi32(vALow, 2);
> + countLow = _mm_extract_epi32(vCountLow, 2);
> + aLow >>= countLow;
> + vALow = _mm_insert_epi32(vALow, aLow, 2);
> +
> + aHi = _mm_extract_epi32(vAHi, 3);
> + countHi = _mm_extract_epi32(vCountHi, 3);
> + aHi >>= countHi;
> + vAHi = _mm_insert_epi32(vAHi, aHi, 3);
> +
> + aLow = _mm_extract_epi32(vALow, 3);
> + countLow = _mm_extract_epi32(vCountLow, 3);
> + aLow >>= countLow;
> + vALow = _mm_insert_epi32(vALow, aLow, 3);
> +
> + __m256i ret = _mm256_set1_epi32(0);
> + ret = _mm256_insertf128_si256(ret, vAHi, 1);
> + ret = _mm256_insertf128_si256(ret, vALow, 0);
> + return ret;
> +}
> +
> +
> +INLINE
> +__m256i _simdemu_sllv_epi32(__m256i vA, __m256i vCount)
> +{
> + int32_t aHi, aLow, countHi, countLow;
> + __m128i vAHi = _mm_castps_si128(_mm256_extractf128_ps(_mm256_castsi256_ps(vA), 1));
> + __m128i vALow = _mm_castps_si128(_mm256_extractf128_ps(_mm256_castsi256_ps(vA), 0));
> + __m128i vCountHi = _mm_castps_si128(_mm256_extractf128_ps(_mm256_castsi256_ps(vCount), 1));
> + __m128i vCountLow = _mm_castps_si128(_mm256_extractf128_ps(_mm256_castsi256_ps(vCount), 0));
> +
> + aHi = _mm_extract_epi32(vAHi, 0);
> + countHi = _mm_extract_epi32(vCountHi, 0);
> + aHi <<= countHi;
> + vAHi = _mm_insert_epi32(vAHi, aHi, 0);
> +
> + aLow = _mm_extract_epi32(vALow, 0);
> + countLow = _mm_extract_epi32(vCountLow, 0);
> + aLow <<= countLow;
> + vALow = _mm_insert_epi32(vALow, aLow, 0);
> +
> + aHi = _mm_extract_epi32(vAHi, 1);
> + countHi = _mm_extract_epi32(vCountHi, 1);
> + aHi <<= countHi;
> + vAHi = _mm_insert_epi32(vAHi, aHi, 1);
> +
> + aLow = _mm_extract_epi32(vALow, 1);
> + countLow = _mm_extract_epi32(vCountLow, 1);
> + aLow <<= countLow;
> + vALow = _mm_insert_epi32(vALow, aLow, 1);
> +
> + aHi = _mm_extract_epi32(vAHi, 2);
> + countHi = _mm_extract_epi32(vCountHi, 2);
> + aHi <<= countHi;
> + vAHi = _mm_insert_epi32(vAHi, aHi, 2);
> +
> + aLow = _mm_extract_epi32(vALow, 2);
> + countLow = _mm_extract_epi32(vCountLow, 2);
> + aLow <<= countLow;
> + vALow = _mm_insert_epi32(vALow, aLow, 2);
> +
> + aHi = _mm_extract_epi32(vAHi, 3);
> + countHi = _mm_extract_epi32(vCountHi, 3);
> + aHi <<= countHi;
> + vAHi = _mm_insert_epi32(vAHi, aHi, 3);
> +
> + aLow = _mm_extract_epi32(vALow, 3);
> + countLow = _mm_extract_epi32(vCountLow, 3);
> + aLow <<= countLow;
> + vALow = _mm_insert_epi32(vALow, aLow, 3);
> +
> + __m256i ret = _mm256_set1_epi32(0);
> + ret = _mm256_insertf128_si256(ret, vAHi, 1);
> + ret = _mm256_insertf128_si256(ret, vALow, 0);
> + return ret;
> +}
> +
> #define _simd_mul_epi32 _simdemu_mul_epi32
> #define _simd_mullo_epi32 _simdemu_mullo_epi32
> #define _simd_sub_epi32 _simdemu_sub_epi32
> @@ -136,7 +271,14 @@ __m256i func(__m256i a, __m256i b)\
> #define _simd_add_epi8 _simdemu_add_epi8
> #define _simd_cmpeq_epi64 _simdemu_cmpeq_epi64
> #define _simd_cmpgt_epi64 _simdemu_cmpgt_epi64
> +#define _simd_cmpgt_epi8 _simdemu_cmpgt_epi8
> +#define _simd_cmpeq_epi8 _simdemu_cmpeq_epi8
> +#define _simd_cmpgt_epi16 _simdemu_cmpgt_epi16
> +#define _simd_cmpeq_epi16 _simdemu_cmpeq_epi16
> #define _simd_movemask_epi8 _simdemu_movemask_epi8
> +#define _simd_permute_ps _simdemu_permute_ps
> +#define _simd_srlv_epi32 _simdemu_srlv_epi32
> +#define _simd_sllv_epi32 _simdemu_sllv_epi32
>
> SIMD_EMU_EPI(_simdemu_mul_epi32, _mm_mul_epi32)
> SIMD_EMU_EPI(_simdemu_mullo_epi32, _mm_mullo_epi32)
> @@ -158,6 +300,10 @@ SIMD_EMU_EPI(_simdemu_subs_epu8, _mm_subs_epu8)
> SIMD_EMU_EPI(_simdemu_add_epi8, _mm_add_epi8)
> SIMD_EMU_EPI(_simdemu_cmpeq_epi64, _mm_cmpeq_epi64)
> SIMD_EMU_EPI(_simdemu_cmpgt_epi64, _mm_cmpgt_epi64)
> +SIMD_EMU_EPI(_simdemu_cmpgt_epi8, _mm_cmpgt_epi8)
> +SIMD_EMU_EPI(_simdemu_cmpeq_epi8, _mm_cmpeq_epi8)
> +SIMD_EMU_EPI(_simdemu_cmpgt_epi16, _mm_cmpgt_epi16)
> +SIMD_EMU_EPI(_simdemu_cmpeq_epi16, _mm_cmpeq_epi16)
>
> #define _simd_unpacklo_epi32(a, b) _mm256_castps_si256(_mm256_unpacklo_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b)))
> #define _simd_unpackhi_epi32(a, b) _mm256_castps_si256(_mm256_unpackhi_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b)))
> @@ -176,25 +322,25 @@ SIMD_EMU_EPI(_simdemu_shuffle_epi8, _mm_shuffle_epi8)
> INLINE
> __m128 _mm_fmaddemu_ps(__m128 a, __m128 b, __m128 c)
> {
> - __m128 res = _mm_mul_ps(a, b);
> - res = _mm_add_ps(res, c);
> - return res;
> + __m128 res = _mm_mul_ps(a, b);
> + res = _mm_add_ps(res, c);
> + return res;
> }
>
> INLINE
> __m256 _mm_fmaddemu256_ps(__m256 a, __m256 b, __m256 c)
> {
> - __m256 res = _mm256_mul_ps(a, b);
> - res = _mm256_add_ps(res, c);
> - return res;
> + __m256 res = _mm256_mul_ps(a, b);
> + res = _mm256_add_ps(res, c);
> + return res;
> }
>
> INLINE
> __m256 _mm_fmsubemu256_ps(__m256 a, __m256 b, __m256 c)
> {
> - __m256 res = _mm256_mul_ps(a, b);
> - res = _mm256_sub_ps(res, c);
> - return res;
> + __m256 res = _mm256_mul_ps(a, b);
> + res = _mm256_sub_ps(res, c);
> + return res;
> }
>
> INLINE
> @@ -295,7 +441,14 @@ int _simdemu_movemask_epi8(__m256i a)
>
> #define _simd_cmpeq_epi64 _mm256_cmpeq_epi64
> #define _simd_cmpgt_epi64 _mm256_cmpgt_epi64
> +#define _simd_cmpgt_epi8 _mm256_cmpgt_epi8
> +#define _simd_cmpeq_epi8 _mm256_cmpeq_epi8
> +#define _simd_cmpgt_epi16 _mm256_cmpgt_epi16
> +#define _simd_cmpeq_epi16 _mm256_cmpeq_epi16
> #define _simd_movemask_epi8 _mm256_movemask_epi8
> +#define _simd_permute_ps _mm256_permutevar8x32_ps
> +#define _simd_srlv_epi32 _mm256_srlv_epi32
> +#define _simd_sllv_epi32 _mm256_sllv_epi32
> #endif
>
> #define _simd_shuffleps_epi32(vA, vB, imm) _mm256_castps_si256(_mm256_shuffle_ps(_mm256_castsi256_ps(vA), _mm256_castsi256_ps(vB), imm))
> @@ -343,30 +496,30 @@ void _simd_mov(simdscalar &r, unsigned int rlane, simdscalar& s, unsigned int sl
>
> INLINE __m256i _simdemu_slli_epi32(__m256i a, uint32_t i)
> {
> - __m128i aHi = _mm256_extractf128_si256(a, 1);
> - __m128i aLo = _mm256_castsi256_si128(a);
> + __m128i aHi = _mm256_extractf128_si256(a, 1);
> + __m128i aLo = _mm256_castsi256_si128(a);
>
> - __m128i resHi = _mm_slli_epi32(aHi, i);
> - __m128i resLo = _mm_slli_epi32(aLo, i);
> + __m128i resHi = _mm_slli_epi32(aHi, i);
> + __m128i resLo = _mm_slli_epi32(aLo, i);
>
> - __m256i result = _mm256_castsi128_si256(resLo);
> - result = _mm256_insertf128_si256(result, resHi, 1);
> + __m256i result = _mm256_castsi128_si256(resLo);
> + result = _mm256_insertf128_si256(result, resHi, 1);
>
> - return result;
> + return result;
> }
>
> INLINE __m256i _simdemu_srai_epi32(__m256i a, uint32_t i)
> {
> - __m128i aHi = _mm256_extractf128_si256(a, 1);
> - __m128i aLo = _mm256_castsi256_si128(a);
> + __m128i aHi = _mm256_extractf128_si256(a, 1);
> + __m128i aLo = _mm256_castsi256_si128(a);
>
> - __m128i resHi = _mm_srai_epi32(aHi, i);
> - __m128i resLo = _mm_srai_epi32(aLo, i);
> + __m128i resHi = _mm_srai_epi32(aHi, i);
> + __m128i resLo = _mm_srai_epi32(aLo, i);
>
> - __m256i result = _mm256_castsi128_si256(resLo);
> - result = _mm256_insertf128_si256(result, resHi, 1);
> + __m256i result = _mm256_castsi128_si256(resLo);
> + result = _mm256_insertf128_si256(result, resHi, 1);
>
> - return result;
> + return result;
> }
>
> INLINE __m256i _simdemu_srli_epi32(__m256i a, uint32_t i)
> @@ -386,7 +539,7 @@ INLINE __m256i _simdemu_srli_epi32(__m256i a, uint32_t i)
> INLINE
> void _simdvec_transpose(simdvector &v)
> {
> - SWR_ASSERT(false, "Need to implement 8 wide version");
> + SWR_ASSERT(false, "Need to implement 8 wide version");
> }
>
> #else
> @@ -397,132 +550,132 @@ void _simdvec_transpose(simdvector &v)
> INLINE
> void _simdvec_load_ps(simdvector& r, const float *p)
> {
> - r[0] = _simd_set1_ps(p[0]);
> - r[1] = _simd_set1_ps(p[1]);
> - r[2] = _simd_set1_ps(p[2]);
> - r[3] = _simd_set1_ps(p[3]);
> + r[0] = _simd_set1_ps(p[0]);
> + r[1] = _simd_set1_ps(p[1]);
> + r[2] = _simd_set1_ps(p[2]);
> + r[3] = _simd_set1_ps(p[3]);
> }
>
> INLINE
> void _simdvec_mov(simdvector& r, const simdscalar& s)
> {
> - r[0] = s;
> - r[1] = s;
> - r[2] = s;
> - r[3] = s;
> + r[0] = s;
> + r[1] = s;
> + r[2] = s;
> + r[3] = s;
> }
>
> INLINE
> void _simdvec_mov(simdvector& r, const simdvector& v)
> {
> - r[0] = v[0];
> - r[1] = v[1];
> - r[2] = v[2];
> - r[3] = v[3];
> + r[0] = v[0];
> + r[1] = v[1];
> + r[2] = v[2];
> + r[3] = v[3];
> }
>
> // just move a lane from the source simdvector to dest simdvector
> INLINE
> void _simdvec_mov(simdvector &r, unsigned int rlane, simdvector& s, unsigned int slane)
> {
> - _simd_mov(r[0], rlane, s[0], slane);
> - _simd_mov(r[1], rlane, s[1], slane);
> - _simd_mov(r[2], rlane, s[2], slane);
> - _simd_mov(r[3], rlane, s[3], slane);
> + _simd_mov(r[0], rlane, s[0], slane);
> + _simd_mov(r[1], rlane, s[1], slane);
> + _simd_mov(r[2], rlane, s[2], slane);
> + _simd_mov(r[3], rlane, s[3], slane);
> }
>
> INLINE
> void _simdvec_dp3_ps(simdscalar& r, const simdvector& v0, const simdvector& v1)
> {
> - simdscalar tmp;
> - r = _simd_mul_ps(v0[0], v1[0]); // (v0.x*v1.x)
> + simdscalar tmp;
> + r = _simd_mul_ps(v0[0], v1[0]); // (v0.x*v1.x)
>
> - tmp = _simd_mul_ps(v0[1], v1[1]); // (v0.y*v1.y)
> - r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y)
> + tmp = _simd_mul_ps(v0[1], v1[1]); // (v0.y*v1.y)
> + r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y)
>
> - tmp = _simd_mul_ps(v0[2], v1[2]); // (v0.z*v1.z)
> - r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y) + (v0.z*v1.z)
> + tmp = _simd_mul_ps(v0[2], v1[2]); // (v0.z*v1.z)
> + r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y) + (v0.z*v1.z)
> }
>
> INLINE
> void _simdvec_dp4_ps(simdscalar& r, const simdvector& v0, const simdvector& v1)
> {
> - simdscalar tmp;
> - r = _simd_mul_ps(v0[0], v1[0]); // (v0.x*v1.x)
> + simdscalar tmp;
> + r = _simd_mul_ps(v0[0], v1[0]); // (v0.x*v1.x)
>
> - tmp = _simd_mul_ps(v0[1], v1[1]); // (v0.y*v1.y)
> - r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y)
> + tmp = _simd_mul_ps(v0[1], v1[1]); // (v0.y*v1.y)
> + r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y)
>
> - tmp = _simd_mul_ps(v0[2], v1[2]); // (v0.z*v1.z)
> - r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y) + (v0.z*v1.z)
> + tmp = _simd_mul_ps(v0[2], v1[2]); // (v0.z*v1.z)
> + r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y) + (v0.z*v1.z)
>
> - tmp = _simd_mul_ps(v0[3], v1[3]); // (v0.w*v1.w)
> - r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y) + (v0.z*v1.z)
> + tmp = _simd_mul_ps(v0[3], v1[3]); // (v0.w*v1.w)
> + r = _simd_add_ps(r, tmp); // (v0.x*v1.x) + (v0.y*v1.y) + (v0.z*v1.z)
> }
>
> INLINE
> simdscalar _simdvec_rcp_length_ps(const simdvector& v)
> {
> - simdscalar length;
> - _simdvec_dp4_ps(length, v, v);
> - return _simd_rsqrt_ps(length);
> + simdscalar length;
> + _simdvec_dp4_ps(length, v, v);
> + return _simd_rsqrt_ps(length);
> }
>
> INLINE
> void _simdvec_normalize_ps(simdvector& r, const simdvector& v)
> {
> - simdscalar vecLength;
> - vecLength = _simdvec_rcp_length_ps(v);
> + simdscalar vecLength;
> + vecLength = _simdvec_rcp_length_ps(v);
>
> - r[0] = _simd_mul_ps(v[0], vecLength);
> - r[1] = _simd_mul_ps(v[1], vecLength);
> - r[2] = _simd_mul_ps(v[2], vecLength);
> - r[3] = _simd_mul_ps(v[3], vecLength);
> + r[0] = _simd_mul_ps(v[0], vecLength);
> + r[1] = _simd_mul_ps(v[1], vecLength);
> + r[2] = _simd_mul_ps(v[2], vecLength);
> + r[3] = _simd_mul_ps(v[3], vecLength);
> }
>
> INLINE
> void _simdvec_mul_ps(simdvector& r, const simdvector& v, const simdscalar& s)
> {
> - r[0] = _simd_mul_ps(v[0], s);
> - r[1] = _simd_mul_ps(v[1], s);
> - r[2] = _simd_mul_ps(v[2], s);
> - r[3] = _simd_mul_ps(v[3], s);
> + r[0] = _simd_mul_ps(v[0], s);
> + r[1] = _simd_mul_ps(v[1], s);
> + r[2] = _simd_mul_ps(v[2], s);
> + r[3] = _simd_mul_ps(v[3], s);
> }
>
> INLINE
> void _simdvec_mul_ps(simdvector& r, const simdvector& v0, const simdvector& v1)
> {
> - r[0] = _simd_mul_ps(v0[0], v1[0]);
> - r[1] = _simd_mul_ps(v0[1], v1[1]);
> - r[2] = _simd_mul_ps(v0[2], v1[2]);
> - r[3] = _simd_mul_ps(v0[3], v1[3]);
> + r[0] = _simd_mul_ps(v0[0], v1[0]);
> + r[1] = _simd_mul_ps(v0[1], v1[1]);
> + r[2] = _simd_mul_ps(v0[2], v1[2]);
> + r[3] = _simd_mul_ps(v0[3], v1[3]);
> }
>
> INLINE
> void _simdvec_add_ps(simdvector& r, const simdvector& v0, const simdvector& v1)
> {
> - r[0] = _simd_add_ps(v0[0], v1[0]);
> - r[1] = _simd_add_ps(v0[1], v1[1]);
> - r[2] = _simd_add_ps(v0[2], v1[2]);
> - r[3] = _simd_add_ps(v0[3], v1[3]);
> + r[0] = _simd_add_ps(v0[0], v1[0]);
> + r[1] = _simd_add_ps(v0[1], v1[1]);
> + r[2] = _simd_add_ps(v0[2], v1[2]);
> + r[3] = _simd_add_ps(v0[3], v1[3]);
> }
>
> INLINE
> void _simdvec_min_ps(simdvector& r, const simdvector& v0, const simdscalar& s)
> {
> - r[0] = _simd_min_ps(v0[0], s);
> - r[1] = _simd_min_ps(v0[1], s);
> - r[2] = _simd_min_ps(v0[2], s);
> - r[3] = _simd_min_ps(v0[3], s);
> + r[0] = _simd_min_ps(v0[0], s);
> + r[1] = _simd_min_ps(v0[1], s);
> + r[2] = _simd_min_ps(v0[2], s);
> + r[3] = _simd_min_ps(v0[3], s);
> }
>
> INLINE
> void _simdvec_max_ps(simdvector& r, const simdvector& v0, const simdscalar& s)
> {
> - r[0] = _simd_max_ps(v0[0], s);
> - r[1] = _simd_max_ps(v0[1], s);
> - r[2] = _simd_max_ps(v0[2], s);
> - r[3] = _simd_max_ps(v0[3], s);
> + r[0] = _simd_max_ps(v0[0], s);
> + r[1] = _simd_max_ps(v0[1], s);
> + r[2] = _simd_max_ps(v0[2], s);
> + r[3] = _simd_max_ps(v0[3], s);
> }
>
> // Matrix4x4 * Vector4
> @@ -532,65 +685,65 @@ void _simdvec_max_ps(simdvector& r, const simdvector& v0, const simdscalar& s)
> // outVec.w = (m30 * v.x) + (m31 * v.y) + (m32 * v.z) + (m33 * v.w)
> INLINE
> void _simd_mat4x4_vec4_multiply(
> - simdvector& result,
> - const float *pMatrix,
> - const simdvector& v)
> -{
> - simdscalar m;
> - simdscalar r0;
> - simdscalar r1;
> -
> - m = _simd_load1_ps(pMatrix + 0*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 0*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 0*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 0*4 + 3); // m[row][3]
> - r1 = _simd_mul_ps(m, v[3]); // (m3 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * v.w)
> - result[0] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 1*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 1*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 1*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 1*4 + 3); // m[row][3]
> - r1 = _simd_mul_ps(m, v[3]); // (m3 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * v.w)
> - result[1] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 2*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 2*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 2*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 2*4 + 3); // m[row][3]
> - r1 = _simd_mul_ps(m, v[3]); // (m3 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * v.w)
> - result[2] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 3*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 3*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 3*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 3*4 + 3); // m[row][3]
> - r1 = _simd_mul_ps(m, v[3]); // (m3 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * v.w)
> - result[3] = r0;
> + simdvector& result,
> + const float *pMatrix,
> + const simdvector& v)
> +{
> + simdscalar m;
> + simdscalar r0;
> + simdscalar r1;
> +
> + m = _simd_load1_ps(pMatrix + 0*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 0*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 0*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 0*4 + 3); // m[row][3]
> + r1 = _simd_mul_ps(m, v[3]); // (m3 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * v.w)
> + result[0] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 1*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 1*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 1*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 1*4 + 3); // m[row][3]
> + r1 = _simd_mul_ps(m, v[3]); // (m3 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * v.w)
> + result[1] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 2*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 2*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 2*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 2*4 + 3); // m[row][3]
> + r1 = _simd_mul_ps(m, v[3]); // (m3 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * v.w)
> + result[2] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 3*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 3*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 3*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 3*4 + 3); // m[row][3]
> + r1 = _simd_mul_ps(m, v[3]); // (m3 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * v.w)
> + result[3] = r0;
> }
>
> // Matrix4x4 * Vector3 - Direction Vector where w = 0.
> @@ -600,45 +753,45 @@ void _simd_mat4x4_vec4_multiply(
> // outVec.w = (m30 * v.x) + (m31 * v.y) + (m32 * v.z) + (m33 * 0)
> INLINE
> void _simd_mat3x3_vec3_w0_multiply(
> - simdvector& result,
> - const float *pMatrix,
> - const simdvector& v)
> -{
> - simdscalar m;
> - simdscalar r0;
> - simdscalar r1;
> -
> - m = _simd_load1_ps(pMatrix + 0*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 0*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 0*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - result[0] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 1*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 1*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 1*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - result[1] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 2*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 2*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 2*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - result[2] = r0;
> -
> - result[3] = _simd_setzero_ps();
> + simdvector& result,
> + const float *pMatrix,
> + const simdvector& v)
> +{
> + simdscalar m;
> + simdscalar r0;
> + simdscalar r1;
> +
> + m = _simd_load1_ps(pMatrix + 0*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 0*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 0*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + result[0] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 1*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 1*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 1*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + result[1] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 2*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 2*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 2*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + result[2] = r0;
> +
> + result[3] = _simd_setzero_ps();
> }
>
> // Matrix4x4 * Vector3 - Position vector where w = 1.
> @@ -648,108 +801,108 @@ void _simd_mat3x3_vec3_w0_multiply(
> // outVec.w = (m30 * v.x) + (m31 * v.y) + (m32 * v.z) + (m33 * 1)
> INLINE
> void _simd_mat4x4_vec3_w1_multiply(
> - simdvector& result,
> - const float *pMatrix,
> - const simdvector& v)
> -{
> - simdscalar m;
> - simdscalar r0;
> - simdscalar r1;
> -
> - m = _simd_load1_ps(pMatrix + 0*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 0*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 0*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 0*4 + 3); // m[row][3]
> - r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> - result[0] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 1*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 1*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 1*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 1*4 + 3); // m[row][3]
> - r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> - result[1] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 2*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 2*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 2*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 2*4 + 3); // m[row][3]
> - r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> - result[2] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 3*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 3*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 3*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 3*4 + 3); // m[row][3]
> - result[3] = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> + simdvector& result,
> + const float *pMatrix,
> + const simdvector& v)
> +{
> + simdscalar m;
> + simdscalar r0;
> + simdscalar r1;
> +
> + m = _simd_load1_ps(pMatrix + 0*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 0*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 0*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 0*4 + 3); // m[row][3]
> + r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> + result[0] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 1*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 1*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 1*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 1*4 + 3); // m[row][3]
> + r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> + result[1] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 2*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 2*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 2*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 2*4 + 3); // m[row][3]
> + r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> + result[2] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 3*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 3*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 3*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 3*4 + 3); // m[row][3]
> + result[3] = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> }
>
> INLINE
> void _simd_mat4x3_vec3_w1_multiply(
> - simdvector& result,
> - const float *pMatrix,
> - const simdvector& v)
> -{
> - simdscalar m;
> - simdscalar r0;
> - simdscalar r1;
> -
> - m = _simd_load1_ps(pMatrix + 0*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 0*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 0*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 0*4 + 3); // m[row][3]
> - r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> - result[0] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 1*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 1*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 1*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 1*4 + 3); // m[row][3]
> - r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> - result[1] = r0;
> -
> - m = _simd_load1_ps(pMatrix + 2*4 + 0); // m[row][0]
> - r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> - m = _simd_load1_ps(pMatrix + 2*4 + 1); // m[row][1]
> - r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> - m = _simd_load1_ps(pMatrix + 2*4 + 2); // m[row][2]
> - r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> - r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> - m = _simd_load1_ps(pMatrix + 2*4 + 3); // m[row][3]
> - r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> - result[2] = r0;
> - result[3] = _simd_set1_ps(1.0f);
> + simdvector& result,
> + const float *pMatrix,
> + const simdvector& v)
> +{
> + simdscalar m;
> + simdscalar r0;
> + simdscalar r1;
> +
> + m = _simd_load1_ps(pMatrix + 0*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 0*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 0*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 0*4 + 3); // m[row][3]
> + r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> + result[0] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 1*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 1*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 1*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 1*4 + 3); // m[row][3]
> + r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> + result[1] = r0;
> +
> + m = _simd_load1_ps(pMatrix + 2*4 + 0); // m[row][0]
> + r0 = _simd_mul_ps(m, v[0]); // (m00 * v.x)
> + m = _simd_load1_ps(pMatrix + 2*4 + 1); // m[row][1]
> + r1 = _simd_mul_ps(m, v[1]); // (m1 * v.y)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y)
> + m = _simd_load1_ps(pMatrix + 2*4 + 2); // m[row][2]
> + r1 = _simd_mul_ps(m, v[2]); // (m2 * v.z)
> + r0 = _simd_add_ps(r0, r1); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z)
> + m = _simd_load1_ps(pMatrix + 2*4 + 3); // m[row][3]
> + r0 = _simd_add_ps(r0, m); // (m0 * v.x) + (m1 * v.y) + (m2 * v.z) + (m2 * 1)
> + result[2] = r0;
> + result[3] = _simd_set1_ps(1.0f);
> }
>
> //////////////////////////////////////////////////////////////////////////
> @@ -783,5 +936,61 @@ static INLINE simdscalar InterpolateComponent(simdscalar vI, simdscalar vJ, cons
> return vplaneps(vA, vB, vC, vI, vJ);
> }
>
> +INLINE
> +UINT pdep_u32(UINT a, UINT mask)
> +{
> +#if KNOB_ARCH==KNOB_ARCH_AVX2
> + return _pdep_u32(a, mask);
> +#else
> + UINT result = 0;
> +
> + // copied from http://wm.ite.pl/articles/pdep-soft-emu.html
> + // using bsf instead of funky loop
> + DWORD maskIndex;
> + while (_BitScanForward(&maskIndex, mask))
> + {
> + // 1. isolate lowest set bit of mask
> + const UINT lowest = 1 << maskIndex;
> +
> + // 2. populate LSB from src
> + const UINT LSB = (UINT)((int)(a << 31) >> 31);
> +
> + // 3. copy bit from mask
> + result |= LSB & lowest;
> +
> + // 4. clear lowest bit
> + mask &= ~lowest;
> +
> + // 5. prepare for next iteration
> + a >>= 1;
> + }
> +
> + return result;
> +#endif
> +}
> +
> +INLINE
> +UINT pext_u32(UINT a, UINT mask)
> +{
> +#if KNOB_ARCH==KNOB_ARCH_AVX2
> + return _pext_u32(a, mask);
> +#else
> + UINT result = 0;
> + DWORD maskIndex;
> + uint32_t currentBit = 0;
> + while (_BitScanForward(&maskIndex, mask))
> + {
> + // 1. isolate lowest set bit of mask
> + const UINT lowest = 1 << maskIndex;
> +
> + // 2. copy bit from mask
> + result |= ((a & lowest) > 0) << currentBit++;
> +
> + // 3. clear lowest bit
> + mask &= ~lowest;
> + }
> + return result;
> +#endif
> +}
>
> #endif//__SWR_SIMDINTRIN_H__
> diff --git a/src/gallium/drivers/swr/rasterizer/core/api.cpp b/src/gallium/drivers/swr/rasterizer/core/api.cpp
> index fccccab..6ebb3f8 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/api.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/api.cpp
> @@ -49,7 +49,7 @@ void SetupDefaultState(SWR_CONTEXT *pContext);
> /// @brief Create SWR Context.
> /// @param pCreateInfo - pointer to creation info.
> HANDLE SwrCreateContext(
> - const SWR_CREATECONTEXT_INFO* pCreateInfo)
> + SWR_CREATECONTEXT_INFO* pCreateInfo)
> {
> RDTSC_RESET();
> RDTSC_INIT(0);
> @@ -61,27 +61,16 @@ HANDLE SwrCreateContext(
> pContext->driverType = pCreateInfo->driver;
> pContext->privateStateSize = pCreateInfo->privateStateSize;
>
> - pContext->dcRing = (DRAW_CONTEXT*)_aligned_malloc(sizeof(DRAW_CONTEXT)*KNOB_MAX_DRAWS_IN_FLIGHT, 64);
> - memset(pContext->dcRing, 0, sizeof(DRAW_CONTEXT)*KNOB_MAX_DRAWS_IN_FLIGHT);
> -
> - pContext->dsRing = (DRAW_STATE*)_aligned_malloc(sizeof(DRAW_STATE)*KNOB_MAX_DRAWS_IN_FLIGHT, 64);
> - memset(pContext->dsRing, 0, sizeof(DRAW_STATE)*KNOB_MAX_DRAWS_IN_FLIGHT);
> -
> - pContext->numSubContexts = pCreateInfo->maxSubContexts;
> - if (pContext->numSubContexts > 1)
> - {
> - pContext->subCtxSave = (DRAW_STATE*)_aligned_malloc(sizeof(DRAW_STATE) * pContext->numSubContexts, 64);
> - memset(pContext->subCtxSave, 0, sizeof(DRAW_STATE) * pContext->numSubContexts);
> - }
> + pContext->dcRing.Init(KNOB_MAX_DRAWS_IN_FLIGHT);
> + pContext->dsRing.Init(KNOB_MAX_DRAWS_IN_FLIGHT);
>
> for (uint32_t dc = 0; dc < KNOB_MAX_DRAWS_IN_FLIGHT; ++dc)
> {
> - pContext->dcRing[dc].pArena = new Arena();
> - pContext->dcRing[dc].inUse = false;
> + pContext->dcRing[dc].pArena = new CachingArena(pContext->cachingArenaAllocator);
> pContext->dcRing[dc].pTileMgr = new MacroTileMgr(*(pContext->dcRing[dc].pArena));
> pContext->dcRing[dc].pDispatch = new DispatchQueue(); /// @todo Could lazily allocate this if Dispatch seen.
>
> - pContext->dsRing[dc].pArena = new Arena();
> + pContext->dsRing[dc].pArena = new CachingArena(pContext->cachingArenaAllocator);
> }
>
> if (!KNOB_SINGLE_THREADED)
> @@ -108,9 +97,6 @@ HANDLE SwrCreateContext(
> pContext->pScratch[i] = (uint8_t*)_aligned_malloc((32 * 1024), KNOB_SIMD_WIDTH * 4);
> }
>
> - pContext->nextDrawId = 1;
> - pContext->DrawEnqueued = 1;
> -
> // State setup AFTER context is fully initialized
> SetupDefaultState(pContext);
>
> @@ -125,6 +111,13 @@ HANDLE SwrCreateContext(
> pContext->pfnStoreTile = pCreateInfo->pfnStoreTile;
> pContext->pfnClearTile = pCreateInfo->pfnClearTile;
>
> + // pass pointer to bucket manager back to caller
> +#ifdef KNOB_ENABLE_RDTSC
> + pCreateInfo->pBucketMgr = &gBucketMgr;
> +#endif
> +
> + pCreateInfo->contextSaveSize = sizeof(API_STATE);
> +
> return (HANDLE)pContext;
> }
>
> @@ -148,10 +141,6 @@ void SwrDestroyContext(HANDLE hContext)
> _aligned_free(pContext->pScratch[i]);
> }
>
> - _aligned_free(pContext->dcRing);
> - _aligned_free(pContext->dsRing);
> - _aligned_free(pContext->subCtxSave);
> -
> delete(pContext->pHotTileMgr);
>
> pContext->~SWR_CONTEXT();
> @@ -168,49 +157,28 @@ void WakeAllThreads(SWR_CONTEXT *pContext)
> pContext->FifosNotEmpty.notify_all();
> }
>
> -bool StillDrawing(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC)
> +template<bool IsDraw>
> +void QueueWork(SWR_CONTEXT *pContext)
> {
> - // For single thread nothing should still be drawing.
> - if (KNOB_SINGLE_THREADED) { return false; }
> -
> - if (pDC->isCompute)
> + if (IsDraw)
> {
> - if (pDC->doneCompute)
> - {
> - pDC->inUse = false;
> - return false;
> - }
> + // Each worker thread looks at a DC for both FE and BE work at different times and so we
> + // multiply threadDone by 2. When the threadDone counter has reached 0 then all workers
> + // have moved past this DC. (i.e. Each worker has checked this DC for both FE and BE work and
> + // then moved on if all work is done.)
> + pContext->pCurDrawContext->threadsDone =
> + pContext->NumWorkerThreads ? pContext->NumWorkerThreads * 2 : 2;
> }
> -
> - // Check if backend work is done. First make sure all triangles have been binned.
> - if (pDC->doneFE == true)
> + else
> {
> - // ensure workers have all moved passed this draw
> - if (pDC->threadsDoneFE != pContext->NumWorkerThreads)
> - {
> - return true;
> - }
> -
> - if (pDC->threadsDoneBE != pContext->NumWorkerThreads)
> - {
> - return true;
> - }
> -
> - pDC->inUse = false; // all work is done.
> + pContext->pCurDrawContext->threadsDone =
> + pContext->NumWorkerThreads ? pContext->NumWorkerThreads : 1;
> }
>
> - return pDC->inUse;
> -}
> -
> -void QueueDraw(SWR_CONTEXT *pContext)
> -{
> - SWR_ASSERT(pContext->pCurDrawContext->inUse == false);
> - pContext->pCurDrawContext->inUse = true;
> -
> _ReadWriteBarrier();
> {
> std::unique_lock<std::mutex> lock(pContext->WaitLock);
> - pContext->DrawEnqueued++;
> + pContext->dcRing.Enqueue();
> }
>
> if (KNOB_SINGLE_THREADED)
> @@ -219,10 +187,24 @@ void QueueDraw(SWR_CONTEXT *pContext)
> uint32_t mxcsr = _mm_getcsr();
> _mm_setcsr(mxcsr | _MM_FLUSH_ZERO_ON | _MM_DENORMALS_ZERO_ON);
>
> - std::unordered_set<uint32_t> lockedTiles;
> - uint64_t curDraw[2] = { pContext->pCurDrawContext->drawId, pContext->pCurDrawContext->drawId };
> - WorkOnFifoFE(pContext, 0, curDraw[0], 0);
> - WorkOnFifoBE(pContext, 0, curDraw[1], lockedTiles);
> + if (IsDraw)
> + {
> + static TileSet lockedTiles;
> + uint64_t curDraw[2] = { pContext->pCurDrawContext->drawId, pContext->pCurDrawContext->drawId };
> + WorkOnFifoFE(pContext, 0, curDraw[0], 0);
> + WorkOnFifoBE(pContext, 0, curDraw[1], lockedTiles);
> + }
> + else
> + {
> + uint64_t curDispatch = pContext->pCurDrawContext->drawId;
> + WorkOnCompute(pContext, 0, curDispatch);
> + }
> +
> + // Dequeue the work here, if not already done, since we're single threaded (i.e. no workers).
> + if (!pContext->dcRing.IsEmpty())
> + {
> + pContext->dcRing.Dequeue();
> + }
>
> // restore csr
> _mm_setcsr(mxcsr);
> @@ -239,40 +221,14 @@ void QueueDraw(SWR_CONTEXT *pContext)
> pContext->pCurDrawContext = nullptr;
> }
>
> -///@todo Combine this with QueueDraw
> -void QueueDispatch(SWR_CONTEXT *pContext)
> +INLINE void QueueDraw(SWR_CONTEXT* pContext)
> {
> - SWR_ASSERT(pContext->pCurDrawContext->inUse == false);
> - pContext->pCurDrawContext->inUse = true;
> -
> - _ReadWriteBarrier();
> - {
> - std::unique_lock<std::mutex> lock(pContext->WaitLock);
> - pContext->DrawEnqueued++;
> - }
> -
> - if (KNOB_SINGLE_THREADED)
> - {
> - // flush denormals to 0
> - uint32_t mxcsr = _mm_getcsr();
> - _mm_setcsr(mxcsr | _MM_FLUSH_ZERO_ON | _MM_DENORMALS_ZERO_ON);
> -
> - uint64_t curDispatch = pContext->pCurDrawContext->drawId;
> - WorkOnCompute(pContext, 0, curDispatch);
> -
> - // restore csr
> - _mm_setcsr(mxcsr);
> - }
> - else
> - {
> - RDTSC_START(APIDrawWakeAllThreads);
> - WakeAllThreads(pContext);
> - RDTSC_STOP(APIDrawWakeAllThreads, 1, 0);
> - }
> + QueueWork<true>(pContext);
> +}
>
> - // Set current draw context to NULL so that next state call forces a new draw context to be created and populated.
> - pContext->pPrevDrawContext = pContext->pCurDrawContext;
> - pContext->pCurDrawContext = nullptr;
> +INLINE void QueueDispatch(SWR_CONTEXT* pContext)
> +{
> + QueueWork<false>(pContext);
> }
>
> DRAW_CONTEXT* GetDrawContext(SWR_CONTEXT *pContext, bool isSplitDraw = false)
> @@ -281,22 +237,22 @@ DRAW_CONTEXT* GetDrawContext(SWR_CONTEXT *pContext, bool isSplitDraw = false)
> // If current draw context is null then need to obtain a new draw context to use from ring.
> if (pContext->pCurDrawContext == nullptr)
> {
> - uint32_t dcIndex = pContext->nextDrawId % KNOB_MAX_DRAWS_IN_FLIGHT;
> -
> - DRAW_CONTEXT* pCurDrawContext = &pContext->dcRing[dcIndex];
> - pContext->pCurDrawContext = pCurDrawContext;
> -
> - // Need to wait until this draw context is available to use.
> - while (StillDrawing(pContext, pCurDrawContext))
> + // Need to wait for a free entry.
> + while (pContext->dcRing.IsFull())
> {
> _mm_pause();
> }
>
> + uint32_t dcIndex = pContext->dcRing.GetHead() % KNOB_MAX_DRAWS_IN_FLIGHT;
> +
> + DRAW_CONTEXT* pCurDrawContext = &pContext->dcRing[dcIndex];
> + pContext->pCurDrawContext = pCurDrawContext;
> +
> // Assign next available entry in DS ring to this DC.
> uint32_t dsIndex = pContext->curStateId % KNOB_MAX_DRAWS_IN_FLIGHT;
> pCurDrawContext->pState = &pContext->dsRing[dsIndex];
>
> - Arena& stateArena = *(pCurDrawContext->pState->pArena);
> + auto& stateArena = *(pCurDrawContext->pState->pArena);
>
> // Copy previous state to current state.
> if (pContext->pPrevDrawContext)
> @@ -332,18 +288,15 @@ DRAW_CONTEXT* GetDrawContext(SWR_CONTEXT *pContext, bool isSplitDraw = false)
> pCurDrawContext->pArena->Reset();
> pCurDrawContext->pContext = pContext;
> pCurDrawContext->isCompute = false; // Dispatch has to set this to true.
> - pCurDrawContext->inUse = false;
>
> - pCurDrawContext->doneCompute = false;
> pCurDrawContext->doneFE = false;
> pCurDrawContext->FeLock = 0;
> - pCurDrawContext->threadsDoneFE = 0;
> - pCurDrawContext->threadsDoneBE = 0;
> + pCurDrawContext->threadsDone = 0;
>
> pCurDrawContext->pTileMgr->initialize();
>
> // Assign unique drawId for this DC
> - pCurDrawContext->drawId = pContext->nextDrawId++;
> + pCurDrawContext->drawId = pContext->dcRing.GetHead();
> }
> else
> {
> @@ -354,38 +307,36 @@ DRAW_CONTEXT* GetDrawContext(SWR_CONTEXT *pContext, bool isSplitDraw = false)
> return pContext->pCurDrawContext;
> }
>
> -void SWR_API SwrSetActiveSubContext(
> - HANDLE hContext,
> - uint32_t subContextIndex)
> +API_STATE* GetDrawState(SWR_CONTEXT *pContext)
> {
> - SWR_CONTEXT *pContext = (SWR_CONTEXT*)hContext;
> - if (subContextIndex >= pContext->numSubContexts)
> - {
> - return;
> - }
> + DRAW_CONTEXT* pDC = GetDrawContext(pContext);
> + SWR_ASSERT(pDC->pState != nullptr);
>
> - if (subContextIndex != pContext->curSubCtxId)
> - {
> - // Save and restore draw state
> - DRAW_CONTEXT* pDC = GetDrawContext(pContext);
> - CopyState(
> - pContext->subCtxSave[pContext->curSubCtxId],
> - *(pDC->pState));
> + return &pDC->pState->state;
> +}
>
> - CopyState(
> - *(pDC->pState),
> - pContext->subCtxSave[subContextIndex]);
> +void SWR_API SwrSaveState(
> + HANDLE hContext,
> + void* pOutputStateBlock,
> + size_t memSize)
> +{
> + SWR_CONTEXT *pContext = (SWR_CONTEXT*)hContext;
> + auto pSrc = GetDrawState(pContext);
> + SWR_ASSERT(pOutputStateBlock && memSize >= sizeof(*pSrc));
>
> - pContext->curSubCtxId = subContextIndex;
> - }
> + memcpy(pOutputStateBlock, pSrc, sizeof(*pSrc));
> }
>
> -API_STATE* GetDrawState(SWR_CONTEXT *pContext)
> +void SWR_API SwrRestoreState(
> + HANDLE hContext,
> + const void* pStateBlock,
> + size_t memSize)
> {
> - DRAW_CONTEXT* pDC = GetDrawContext(pContext);
> - SWR_ASSERT(pDC->pState != nullptr);
> + SWR_CONTEXT *pContext = (SWR_CONTEXT*)hContext;
> + auto pDst = GetDrawState(pContext);
> + SWR_ASSERT(pStateBlock && memSize >= sizeof(*pDst));
>
> - return &pDC->pState->state;
> + memcpy(pDst, pStateBlock, sizeof(*pDst));
> }
>
> void SetupDefaultState(SWR_CONTEXT *pContext)
> @@ -431,16 +382,12 @@ void SwrWaitForIdle(HANDLE hContext)
> SWR_CONTEXT *pContext = GetContext(hContext);
>
> RDTSC_START(APIWaitForIdle);
> - // Wait for all work to complete.
> - for (uint32_t dc = 0; dc < KNOB_MAX_DRAWS_IN_FLIGHT; ++dc)
> - {
> - DRAW_CONTEXT *pDC = &pContext->dcRing[dc];
>
> - while (StillDrawing(pContext, pDC))
> - {
> - _mm_pause();
> - }
> + while (!pContext->dcRing.IsEmpty())
> + {
> + _mm_pause();
> }
> +
> RDTSC_STOP(APIWaitForIdle, 1, 0);
> }
>
> @@ -770,16 +717,25 @@ void SetupMacroTileScissors(DRAW_CONTEXT *pDC)
> pState->scissorInFixedPoint.bottom = bottom * FIXED_POINT_SCALE - 1;
> }
> }
> -
> +// templated backend function tables
> +extern PFN_BACKEND_FUNC gBackendNullPs[SWR_MULTISAMPLE_TYPE_MAX];
> +extern PFN_BACKEND_FUNC gBackendSingleSample[2][2];
> +extern PFN_BACKEND_FUNC gBackendPixelRateTable[SWR_MULTISAMPLE_TYPE_MAX][SWR_MSAA_SAMPLE_PATTERN_MAX][SWR_INPUT_COVERAGE_MAX][2][2];
> +extern PFN_BACKEND_FUNC gBackendSampleRateTable[SWR_MULTISAMPLE_TYPE_MAX][SWR_INPUT_COVERAGE_MAX][2];
> +extern PFN_OUTPUT_MERGER gBackendOutputMergerTable[SWR_NUM_RENDERTARGETS + 1][SWR_MULTISAMPLE_TYPE_MAX];
> +extern PFN_CALC_PIXEL_BARYCENTRICS gPixelBarycentricTable[2];
> +extern PFN_CALC_SAMPLE_BARYCENTRICS gSampleBarycentricTable[2];
> +extern PFN_CALC_CENTROID_BARYCENTRICS gCentroidBarycentricTable[SWR_MULTISAMPLE_TYPE_MAX][2][2][2];
> void SetupPipeline(DRAW_CONTEXT *pDC)
> {
> DRAW_STATE* pState = pDC->pState;
> const SWR_RASTSTATE &rastState = pState->state.rastState;
> + const SWR_PS_STATE &psState = pState->state.psState;
> BACKEND_FUNCS& backendFuncs = pState->backendFuncs;
> const uint32_t forcedSampleCount = (rastState.bForcedSampleCount) ? 1 : 0;
>
> // setup backend
> - if (pState->state.psState.pfnPixelShader == nullptr)
> + if (psState.pfnPixelShader == nullptr)
> {
> backendFuncs.pfnBackend = gBackendNullPs[pState->state.rastState.sampleCount];
> // always need to generate I & J per sample for Z interpolation
> @@ -788,41 +744,40 @@ void SetupPipeline(DRAW_CONTEXT *pDC)
> else
> {
> const bool bMultisampleEnable = ((rastState.sampleCount > SWR_MULTISAMPLE_1X) || rastState.bForcedSampleCount) ? 1 : 0;
> - const uint32_t centroid = ((pState->state.psState.barycentricsMask & SWR_BARYCENTRIC_CENTROID_MASK) > 0) ? 1 : 0;
> + const uint32_t centroid = ((psState.barycentricsMask & SWR_BARYCENTRIC_CENTROID_MASK) > 0) ? 1 : 0;
>
> // currently only support 'normal' input coverage
> - SWR_ASSERT(pState->state.psState.inputCoverage == SWR_INPUT_COVERAGE_NORMAL ||
> - pState->state.psState.inputCoverage == SWR_INPUT_COVERAGE_NONE);
> + SWR_ASSERT(psState.inputCoverage == SWR_INPUT_COVERAGE_NORMAL ||
> + psState.inputCoverage == SWR_INPUT_COVERAGE_NONE);
>
> - SWR_BARYCENTRICS_MASK barycentricsMask = (SWR_BARYCENTRICS_MASK)pState->state.psState.barycentricsMask;
> + SWR_BARYCENTRICS_MASK barycentricsMask = (SWR_BARYCENTRICS_MASK)psState.barycentricsMask;
>
> // select backend function
> - switch(pState->state.psState.shadingRate)
> + switch(psState.shadingRate)
> {
> case SWR_SHADING_RATE_PIXEL:
> if(bMultisampleEnable)
> {
> // always need to generate I & J per sample for Z interpolation
> barycentricsMask = (SWR_BARYCENTRICS_MASK)(barycentricsMask | SWR_BARYCENTRIC_PER_SAMPLE_MASK);
> - backendFuncs.pfnBackend = gBackendPixelRateTable[rastState.sampleCount][rastState.samplePattern][pState->state.psState.inputCoverage][centroid][forcedSampleCount];
> - backendFuncs.pfnOutputMerger = gBackendOutputMergerTable[pState->state.psState.numRenderTargets][pState->state.blendState.sampleCount];
> + backendFuncs.pfnBackend = gBackendPixelRateTable[rastState.sampleCount][rastState.samplePattern][psState.inputCoverage][centroid][forcedSampleCount];
> + backendFuncs.pfnOutputMerger = gBackendOutputMergerTable[psState.numRenderTargets][pState->state.blendState.sampleCount];
> }
> else
> {
> // always need to generate I & J per pixel for Z interpolation
> barycentricsMask = (SWR_BARYCENTRICS_MASK)(barycentricsMask | SWR_BARYCENTRIC_PER_PIXEL_MASK);
> - backendFuncs.pfnBackend = gBackendSingleSample[pState->state.psState.inputCoverage][centroid];
> - backendFuncs.pfnOutputMerger = gBackendOutputMergerTable[pState->state.psState.numRenderTargets][SWR_MULTISAMPLE_1X];
> + backendFuncs.pfnBackend = gBackendSingleSample[psState.inputCoverage][centroid];
> + backendFuncs.pfnOutputMerger = gBackendOutputMergerTable[psState.numRenderTargets][SWR_MULTISAMPLE_1X];
> }
> break;
> case SWR_SHADING_RATE_SAMPLE:
> SWR_ASSERT(rastState.samplePattern == SWR_MSAA_STANDARD_PATTERN);
> // always need to generate I & J per sample for Z interpolation
> barycentricsMask = (SWR_BARYCENTRICS_MASK)(barycentricsMask | SWR_BARYCENTRIC_PER_SAMPLE_MASK);
> - backendFuncs.pfnBackend = gBackendSampleRateTable[rastState.sampleCount][pState->state.psState.inputCoverage][centroid];
> - backendFuncs.pfnOutputMerger = gBackendOutputMergerTable[pState->state.psState.numRenderTargets][pState->state.blendState.sampleCount];
> + backendFuncs.pfnBackend = gBackendSampleRateTable[rastState.sampleCount][psState.inputCoverage][centroid];
> + backendFuncs.pfnOutputMerger = gBackendOutputMergerTable[psState.numRenderTargets][pState->state.blendState.sampleCount];
> break;
> - case SWR_SHADING_RATE_COARSE:
> default:
> SWR_ASSERT(0 && "Invalid shading rate");
> break;
> @@ -913,7 +868,7 @@ void SetupPipeline(DRAW_CONTEXT *pDC)
>
> uint32_t numRTs = pState->state.psState.numRenderTargets;
> pState->state.colorHottileEnable = 0;
> - if(pState->state.psState.pfnPixelShader != nullptr)
> + if (psState.pfnPixelShader != nullptr)
> {
> for (uint32_t rt = 0; rt < numRTs; ++rt)
> {
> @@ -1005,6 +960,11 @@ uint32_t MaxVertsPerDraw(
> }
> break;
>
> + // The Primitive Assembly code can only handle 1 RECT at a time.
> + case TOP_RECT_LIST:
> + vertsPerDraw = 3;
> + break;
> +
> default:
> // We are not splitting up draws for other topologies.
> break;
> @@ -1305,7 +1265,10 @@ void SwrDrawIndexedInstanced(
> DrawIndexedInstance(hContext, topology, numIndices, indexOffset, baseVertex, numInstances, startInstance);
> }
>
> -// Attach surfaces to pipeline
> +//////////////////////////////////////////////////////////////////////////
> +/// @brief SwrInvalidateTiles
> +/// @param hContext - Handle passed back from SwrCreateContext
> +/// @param attachmentMask - The mask specifies which surfaces attached to the hottiles to invalidate.
> void SwrInvalidateTiles(
> HANDLE hContext,
> uint32_t attachmentMask)
> @@ -1313,10 +1276,39 @@ void SwrInvalidateTiles(
> SWR_CONTEXT *pContext = (SWR_CONTEXT*)hContext;
> DRAW_CONTEXT* pDC = GetDrawContext(pContext);
>
> + pDC->FeWork.type = DISCARDINVALIDATETILES;
> + pDC->FeWork.pfnWork = ProcessDiscardInvalidateTiles;
> + pDC->FeWork.desc.discardInvalidateTiles.attachmentMask = attachmentMask;
> + memset(&pDC->FeWork.desc.discardInvalidateTiles.rect, 0, sizeof(SWR_RECT));
> + pDC->FeWork.desc.discardInvalidateTiles.newTileState = SWR_TILE_INVALID;
> + pDC->FeWork.desc.discardInvalidateTiles.createNewTiles = false;
> + pDC->FeWork.desc.discardInvalidateTiles.fullTilesOnly = false;
> +
> + //enqueue
> + QueueDraw(pContext);
> +}
> +
> +//////////////////////////////////////////////////////////////////////////
> +/// @brief SwrDiscardRect
> +/// @param hContext - Handle passed back from SwrCreateContext
> +/// @param attachmentMask - The mask specifies which surfaces attached to the hottiles to discard.
> +/// @param rect - if rect is all zeros, the entire attachment surface will be discarded
> +void SwrDiscardRect(
> + HANDLE hContext,
> + uint32_t attachmentMask,
> + SWR_RECT rect)
> +{
> + SWR_CONTEXT *pContext = (SWR_CONTEXT*)hContext;
> + DRAW_CONTEXT* pDC = GetDrawContext(pContext);
> +
> // Queue a load to the hottile
> - pDC->FeWork.type = INVALIDATETILES;
> - pDC->FeWork.pfnWork = ProcessInvalidateTiles;
> - pDC->FeWork.desc.invalidateTiles.attachmentMask = attachmentMask;
> + pDC->FeWork.type = DISCARDINVALIDATETILES;
> + pDC->FeWork.pfnWork = ProcessDiscardInvalidateTiles;
> + pDC->FeWork.desc.discardInvalidateTiles.attachmentMask = attachmentMask;
> + pDC->FeWork.desc.discardInvalidateTiles.rect = rect;
> + pDC->FeWork.desc.discardInvalidateTiles.newTileState = SWR_TILE_RESOLVED;
> + pDC->FeWork.desc.discardInvalidateTiles.createNewTiles = true;
> + pDC->FeWork.desc.discardInvalidateTiles.fullTilesOnly = true;
>
> //enqueue
> QueueDraw(pContext);
> @@ -1391,7 +1383,7 @@ void SwrClearRenderTarget(
> uint32_t clearMask,
> const float clearColor[4],
> float z,
> - BYTE stencil)
> + uint8_t stencil)
> {
> RDTSC_START(APIClearRenderTarget);
>
> diff --git a/src/gallium/drivers/swr/rasterizer/core/api.h b/src/gallium/drivers/swr/rasterizer/core/api.h
> index 72fae8b..90c2f03 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/api.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/api.h
> @@ -53,7 +53,7 @@ typedef void(SWR_API *PFN_CALLBACK_FUNC)(uint64_t data, uint64_t data2, uint64_t
> /// @param pDstHotTile - pointer to the hot tile surface
> typedef void(SWR_API *PFN_LOAD_TILE)(HANDLE hPrivateContext, SWR_FORMAT dstFormat,
> SWR_RENDERTARGET_ATTACHMENT renderTargetIndex,
> - uint32_t x, uint32_t y, uint32_t renderTargetArrayIndex, BYTE *pDstHotTile);
> + uint32_t x, uint32_t y, uint32_t renderTargetArrayIndex, uint8_t *pDstHotTile);
>
> //////////////////////////////////////////////////////////////////////////
> /// @brief Function signature for store hot tiles
> @@ -65,7 +65,7 @@ typedef void(SWR_API *PFN_LOAD_TILE)(HANDLE hPrivateContext, SWR_FORMAT dstForma
> /// @param pSrcHotTile - pointer to the hot tile surface
> typedef void(SWR_API *PFN_STORE_TILE)(HANDLE hPrivateContext, SWR_FORMAT srcFormat,
> SWR_RENDERTARGET_ATTACHMENT renderTargetIndex,
> - uint32_t x, uint32_t y, uint32_t renderTargetArrayIndex, BYTE *pSrcHotTile);
> + uint32_t x, uint32_t y, uint32_t renderTargetArrayIndex, uint8_t *pSrcHotTile);
>
> /// @brief Function signature for clearing from the hot tiles clear value
> /// @param hPrivateContext - handle to private data
> @@ -77,6 +77,8 @@ typedef void(SWR_API *PFN_CLEAR_TILE)(HANDLE hPrivateContext,
> SWR_RENDERTARGET_ATTACHMENT rtIndex,
> uint32_t x, uint32_t y, const float* pClearColor);
>
> +class BucketManager;
> +
> //////////////////////////////////////////////////////////////////////////
> /// SWR_CREATECONTEXT_INFO
> /////////////////////////////////////////////////////////////////////////
> @@ -88,13 +90,17 @@ struct SWR_CREATECONTEXT_INFO
> // Use SwrGetPrivateContextState() to access private state.
> uint32_t privateStateSize;
>
> - // Each SWR context can have multiple sets of active state
> - uint32_t maxSubContexts;
> -
> - // tile manipulation functions
> + // Tile manipulation functions
> PFN_LOAD_TILE pfnLoadTile;
> PFN_STORE_TILE pfnStoreTile;
> PFN_CLEAR_TILE pfnClearTile;
> +
> + // Pointer to rdtsc buckets mgr returned to the caller.
> + // Only populated when KNOB_ENABLE_RDTSC is set
> + BucketManager* pBucketMgr;
> +
> + // Output: size required memory passed to for SwrSaveState / SwrRestoreState
> + size_t contextSaveSize;
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -112,7 +118,7 @@ struct SWR_RECT
> /// @brief Create SWR Context.
> /// @param pCreateInfo - pointer to creation info.
> HANDLE SWR_API SwrCreateContext(
> - const SWR_CREATECONTEXT_INFO* pCreateInfo);
> + SWR_CREATECONTEXT_INFO* pCreateInfo);
>
> //////////////////////////////////////////////////////////////////////////
> /// @brief Destroys SWR Context.
> @@ -121,12 +127,24 @@ void SWR_API SwrDestroyContext(
> HANDLE hContext);
>
> //////////////////////////////////////////////////////////////////////////
> -/// @brief Set currently active state context
> -/// @param subContextIndex - value from 0 to
> -/// SWR_CREATECONTEXT_INFO.maxSubContexts. Defaults to 0.
> -void SWR_API SwrSetActiveSubContext(
> +/// @brief Saves API state associated with hContext
> +/// @param hContext - Handle passed back from SwrCreateContext
> +/// @param pOutputStateBlock - Memory block to receive API state data
> +/// @param memSize - Size of memory pointed to by pOutputStateBlock
> +void SWR_API SwrSaveState(
> HANDLE hContext,
> - uint32_t subContextIndex);
> + void* pOutputStateBlock,
> + size_t memSize);
> +
> +//////////////////////////////////////////////////////////////////////////
> +/// @brief Restores API state to hContext previously saved with SwrSaveState
> +/// @param hContext - Handle passed back from SwrCreateContext
> +/// @param pStateBlock - Memory block to read API state data from
> +/// @param memSize - Size of memory pointed to by pStateBlock
> +void SWR_API SwrRestoreState(
> + HANDLE hContext,
> + const void* pStateBlock,
> + size_t memSize);
>
> //////////////////////////////////////////////////////////////////////////
> /// @brief Sync cmd. Executes the callback func when all rendering up to this sync
> @@ -391,6 +409,16 @@ void SWR_API SwrInvalidateTiles(
> uint32_t attachmentMask);
>
> //////////////////////////////////////////////////////////////////////////
> +/// @brief SwrDiscardRect
> +/// @param hContext - Handle passed back from SwrCreateContext
> +/// @param attachmentMask - The mask specifies which surfaces attached to the hottiles to discard.
> +/// @param rect - if rect is all zeros, the entire attachment surface will be discarded
> +void SWR_API SwrDiscardRect(
> + HANDLE hContext,
> + uint32_t attachmentMask,
> + SWR_RECT rect);
> +
> +//////////////////////////////////////////////////////////////////////////
> /// @brief SwrDispatch
> /// @param hContext - Handle passed back from SwrCreateContext
> /// @param threadGroupCountX - Number of thread groups dispatched in X direction
> @@ -419,9 +447,9 @@ void SWR_API SwrStoreTiles(
> void SWR_API SwrClearRenderTarget(
> HANDLE hContext,
> uint32_t clearMask,
> - const FLOAT clearColor[4],
> + const float clearColor[4],
> float z,
> - BYTE stencil);
> + uint8_t stencil);
>
> void SWR_API SwrSetRastState(
> HANDLE hContext,
> diff --git a/src/gallium/drivers/swr/rasterizer/core/arena.cpp b/src/gallium/drivers/swr/rasterizer/core/arena.cpp
> deleted file mode 100644
> index 8184c8d..0000000
> --- a/src/gallium/drivers/swr/rasterizer/core/arena.cpp
> +++ /dev/null
> @@ -1,166 +0,0 @@
> -/****************************************************************************
> -* Copyright (C) 2014-2015 Intel Corporation. All Rights Reserved.
> -*
> -* Permission is hereby granted, free of charge, to any person obtaining a
> -* copy of this software and associated documentation files (the "Software"),
> -* to deal in the Software without restriction, including without limitation
> -* the rights to use, copy, modify, merge, publish, distribute, sublicense,
> -* and/or sell copies of the Software, and to permit persons to whom the
> -* Software is furnished to do so, subject to the following conditions:
> -*
> -* The above copyright notice and this permission notice (including the next
> -* paragraph) shall be included in all copies or substantial portions of the
> -* Software.
> -*
> -* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> -* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> -* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> -* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> -* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> -* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> -* IN THE SOFTWARE.
> -*
> -* @file arena.cpp
> -*
> -* @brief Arena memory manager
> -* The arena is convenient and fast for managing allocations for any of
> -* our allocations that are associated with operations and can all be freed
> -* once when their operation has completed. Allocations are cheap since
> -* most of the time its simply an increment of an offset. Also, no need to
> -* free individual allocations. All of the arena memory can be freed at once.
> -*
> -******************************************************************************/
> -
> -#include "context.h"
> -#include "arena.h"
> -
> -#include <cmath>
> -
> -Arena::Arena()
> - : m_pCurBlock(nullptr), m_size(0)
> -{
> - m_pMutex = new std::mutex();
> -}
> -
> -Arena::~Arena()
> -{
> - Reset(); // Reset just in case to avoid leaking memory.
> -
> - if (m_pCurBlock)
> - {
> - _aligned_free(m_pCurBlock->pMem);
> - delete m_pCurBlock;
> - }
> -
> - delete m_pMutex;
> -}
> -
> -///@todo Remove this when all users have stopped using this.
> -void Arena::Init()
> -{
> - m_size = 0;
> - m_pCurBlock = nullptr;
> -
> - m_pMutex = new std::mutex();
> -}
> -
> -void* Arena::AllocAligned(size_t size, size_t align)
> -{
> - if (m_pCurBlock)
> - {
> - ArenaBlock* pCurBlock = m_pCurBlock;
> - pCurBlock->offset = AlignUp(pCurBlock->offset, align);
> -
> - if ((pCurBlock->offset + size) <= pCurBlock->blockSize)
> - {
> - void* pMem = PtrAdd(pCurBlock->pMem, pCurBlock->offset);
> - pCurBlock->offset += size;
> - m_size += size;
> - return pMem;
> - }
> -
> - // Not enough memory in this block, fall through to allocate
> - // a new block
> - }
> -
> - static const size_t ArenaBlockSize = 1024*1024;
> - size_t blockSize = std::max(m_size + ArenaBlockSize, std::max(size, ArenaBlockSize));
> - blockSize = AlignUp(blockSize, KNOB_SIMD_WIDTH*4);
> -
> - void *pMem = _aligned_malloc(blockSize, KNOB_SIMD_WIDTH*4); // Arena blocks are always simd byte aligned.
> - SWR_ASSERT(pMem != nullptr);
> -
> - ArenaBlock* pNewBlock = new (std::nothrow) ArenaBlock();
> - SWR_ASSERT(pNewBlock != nullptr);
> -
> - if (pNewBlock != nullptr)
> - {
> - pNewBlock->pNext = m_pCurBlock;
> -
> - m_pCurBlock = pNewBlock;
> - m_pCurBlock->pMem = pMem;
> - m_pCurBlock->blockSize = blockSize;
> -
> - }
> -
> - return AllocAligned(size, align);
> -}
> -
> -void* Arena::Alloc(size_t size)
> -{
> - return AllocAligned(size, 1);
> -}
> -
> -void* Arena::AllocAlignedSync(size_t size, size_t align)
> -{
> - void* pAlloc = nullptr;
> -
> - SWR_ASSERT(m_pMutex != nullptr);
> -
> - m_pMutex->lock();
> - pAlloc = AllocAligned(size, align);
> - m_pMutex->unlock();
> -
> - return pAlloc;
> -}
> -
> -void* Arena::AllocSync(size_t size)
> -{
> - void* pAlloc = nullptr;
> -
> - SWR_ASSERT(m_pMutex != nullptr);
> -
> - m_pMutex->lock();
> - pAlloc = Alloc(size);
> - m_pMutex->unlock();
> -
> - return pAlloc;
> -}
> -
> -void Arena::Reset(bool removeAll)
> -{
> - if (m_pCurBlock)
> - {
> - m_pCurBlock->offset = 0;
> -
> - ArenaBlock *pUsedBlocks = m_pCurBlock->pNext;
> - m_pCurBlock->pNext = nullptr;
> - while(pUsedBlocks)
> - {
> - ArenaBlock* pBlock = pUsedBlocks;
> - pUsedBlocks = pBlock->pNext;
> -
> - _aligned_free(pBlock->pMem);
> - delete pBlock;
> - }
> -
> - if (removeAll)
> - {
> - _aligned_free(m_pCurBlock->pMem);
> - delete m_pCurBlock;
> - m_pCurBlock = nullptr;
> - }
> - }
> -
> - m_size = 0;
> -}
> diff --git a/src/gallium/drivers/swr/rasterizer/core/arena.h b/src/gallium/drivers/swr/rasterizer/core/arena.h
> index 76eee11..7f635b8 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/arena.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/arena.h
> @@ -33,37 +33,307 @@
> #pragma once
>
> #include <mutex>
> +#include <algorithm>
> +#include <atomic>
> +#include "core/utils.h"
>
> -class Arena
> +class DefaultAllocator
> {
> public:
> - Arena();
> - ~Arena();
> + void* AllocateAligned(size_t size, size_t align)
> + {
> + void* p = _aligned_malloc(size, align);
> + return p;
> + }
> + void Free(void* pMem)
> + {
> + _aligned_free(pMem);
> + }
> +};
>
> - void Init();
> +static const size_t ARENA_BLOCK_SHIFT = 5;
> +static const size_t ARENA_BLOCK_ALIGN = KNOB_SIMD_WIDTH * 4;
> +static_assert((1U << ARENA_BLOCK_SHIFT) == ARENA_BLOCK_ALIGN,
> + "Invalid value for ARENA_BLOCK_ALIGN/SHIFT");
>
> - void* AllocAligned(size_t size, size_t align);
> - void* Alloc(size_t size);
> +struct ArenaBlock
> +{
> + size_t blockSize = 0;
> + ArenaBlock* pNext = nullptr;
> +};
> +static_assert(sizeof(ArenaBlock) <= ARENA_BLOCK_ALIGN,
> + "Increase BLOCK_ALIGN size");
>
> - void* AllocAlignedSync(size_t size, size_t align);
> - void* AllocSync(size_t size);
> +// Caching Allocator for Arena
> +template<uint32_t NumBucketsT = 1, uint32_t StartBucketBitT = 20>
> +struct CachingAllocatorT : DefaultAllocator
> +{
> + static uint32_t GetBucketId(size_t blockSize)
> + {
> + uint32_t bucketId = 0;
>
> - void Reset(bool removeAll = false);
> - size_t Size() { return m_size; }
> +#if defined(BitScanReverseSizeT)
> + BitScanReverseSizeT((unsigned long*)&bucketId, blockSize >> CACHE_START_BUCKET_BIT);
> + bucketId = std::min<uint32_t>(bucketId, CACHE_NUM_BUCKETS - 1);
> +#endif
>
> -private:
> + return bucketId;
> + }
> +
> + void* AllocateAligned(size_t size, size_t align)
> + {
> + SWR_ASSERT(size >= sizeof(ArenaBlock));
> + SWR_ASSERT(size <= uint32_t(-1));
> +
> + size_t blockSize = size - ARENA_BLOCK_ALIGN;
> +
> + {
> + // search cached blocks
> + std::lock_guard<std::mutex> l(m_mutex);
> + ArenaBlock* pPrevBlock = &m_cachedBlocks[GetBucketId(blockSize)];
> + ArenaBlock* pBlock = pPrevBlock->pNext;
> + ArenaBlock* pPotentialBlock = nullptr;
> + ArenaBlock* pPotentialPrev = nullptr;
> +
> + while (pBlock)
> + {
> + if (pBlock->blockSize >= blockSize)
> + {
> + if (pBlock == AlignUp(pBlock, align))
> + {
> + if (pBlock->blockSize == blockSize)
> + {
> + // Won't find a better match
> + break;
> + }
> +
> + // We could use this as it is larger than we wanted, but
> + // continue to search for a better match
> + pPotentialBlock = pBlock;
> + pPotentialPrev = pPrevBlock;
> + }
> + }
> + else
> + {
> + // Blocks are sorted by size (biggest first)
> + // So, if we get here, there are no blocks
> + // large enough, fall through to allocation.
> + pBlock = nullptr;
> + break;
> + }
> +
> + pPrevBlock = pBlock;
> + pBlock = pBlock->pNext;
> + }
> +
> + if (!pBlock)
> + {
> + // Couldn't find an exact match, use next biggest size
> + pBlock = pPotentialBlock;
> + pPrevBlock = pPotentialPrev;
> + }
> +
> + if (pBlock)
> + {
> + SWR_ASSERT(pPrevBlock && pPrevBlock->pNext == pBlock);
> + pPrevBlock->pNext = pBlock->pNext;
> + pBlock->pNext = nullptr;
>
> - struct ArenaBlock
> + return pBlock;
> + }
> +
> + m_totalAllocated += size;
> +
> +#if 0
> + {
> + static uint32_t count = 0;
> + char buf[128];
> + sprintf_s(buf, "Arena Alloc %d 0x%llx bytes - 0x%llx total\n", ++count, uint64_t(size), uint64_t(m_totalAllocated));
> + OutputDebugStringA(buf);
> + }
> +#endif
> + }
> +
> + return this->DefaultAllocator::AllocateAligned(size, align);
> + }
> +
> + void Free(void* pMem)
> + {
> + if (pMem)
> + {
> + ArenaBlock* pNewBlock = reinterpret_cast<ArenaBlock*>(pMem);
> + SWR_ASSERT(pNewBlock->blockSize >= 0);
> +
> + std::unique_lock<std::mutex> l(m_mutex);
> + ArenaBlock* pPrevBlock = &m_cachedBlocks[GetBucketId(pNewBlock->blockSize)];
> + ArenaBlock* pBlock = pPrevBlock->pNext;
> +
> + while (pBlock)
> + {
> + if (pNewBlock->blockSize >= pBlock->blockSize)
> + {
> + // Insert here
> + break;
> + }
> + pPrevBlock = pBlock;
> + pBlock = pBlock->pNext;
> + }
> +
> + // Insert into list
> + SWR_ASSERT(pPrevBlock);
> + pPrevBlock->pNext = pNewBlock;
> + pNewBlock->pNext = pBlock;
> + }
> + }
> +
> + ~CachingAllocatorT()
> {
> - void* pMem = nullptr;
> - size_t blockSize = 0;
> - size_t offset = 0;
> - ArenaBlock* pNext = nullptr;
> - };
> + // Free all cached blocks
> + for (uint32_t i = 0; i < CACHE_NUM_BUCKETS; ++i)
> + {
> + ArenaBlock* pBlock = m_cachedBlocks[i].pNext;
> + while (pBlock)
> + {
> + ArenaBlock* pNext = pBlock->pNext;
> + this->DefaultAllocator::Free(pBlock);
> + pBlock = pNext;
> + }
> + }
> + }
>
> - ArenaBlock* m_pCurBlock = nullptr;
> - size_t m_size = 0;
> + // buckets, for block sizes < (1 << (start+1)), < (1 << (start+2)), ...
> + static const uint32_t CACHE_NUM_BUCKETS = NumBucketsT;
> + static const uint32_t CACHE_START_BUCKET_BIT = StartBucketBitT;
> +
> + ArenaBlock m_cachedBlocks[CACHE_NUM_BUCKETS];
> + std::mutex m_mutex;
> +
> + size_t m_totalAllocated = 0;
> +};
> +typedef CachingAllocatorT<> CachingAllocator;
> +
> +template<typename T = DefaultAllocator, size_t BlockSizeT = (128 * 1024)>
> +class TArena
> +{
> +public:
> + TArena(T& in_allocator) : m_allocator(in_allocator) {}
> + TArena() : m_allocator(m_defAllocator) {}
> + ~TArena()
> + {
> + Reset(true);
> + }
> +
> + void* AllocAligned(size_t size, size_t align)
> + {
> + SWR_ASSERT(size);
> + SWR_ASSERT(align <= ARENA_BLOCK_ALIGN);
> +
> + if (m_pCurBlock)
> + {
> + ArenaBlock* pCurBlock = m_pCurBlock;
> + size_t offset = AlignUp(m_offset, align);
> +
> + if ((offset + size) <= pCurBlock->blockSize)
> + {
> + void* pMem = PtrAdd(pCurBlock, offset + ARENA_BLOCK_ALIGN);
> + m_offset = offset + size;
> + return pMem;
> + }
> +
> + // Not enough memory in this block, fall through to allocate
> + // a new block
> + }
> +
> + static const size_t ArenaBlockSize = BlockSizeT - ARENA_BLOCK_ALIGN;
> + size_t blockSize = std::max(size, ArenaBlockSize);
> +
> + // Add in one BLOCK_ALIGN unit to store ArenaBlock in.
> + blockSize = AlignUp(blockSize, ARENA_BLOCK_ALIGN);
> +
> + void *pMem = m_allocator.AllocateAligned(blockSize + ARENA_BLOCK_ALIGN, ARENA_BLOCK_ALIGN); // Arena blocks are always simd byte aligned.
> + SWR_ASSERT(pMem != nullptr);
> +
> + ArenaBlock* pNewBlock = new (pMem) ArenaBlock();
> +
> + if (pNewBlock != nullptr)
> + {
> + m_offset = 0;
> + pNewBlock->pNext = m_pCurBlock;
> +
> + m_pCurBlock = pNewBlock;
> + m_pCurBlock->blockSize = blockSize;
> + }
> +
> + return AllocAligned(size, align);
> + }
> +
> + void* Alloc(size_t size)
> + {
> + return AllocAligned(size, 1);
> + }
> +
> + void* AllocAlignedSync(size_t size, size_t align)
> + {
> + void* pAlloc = nullptr;
> +
> + m_mutex.lock();
> + pAlloc = AllocAligned(size, align);
> + m_mutex.unlock();
> +
> + return pAlloc;
> + }
> +
> + void* AllocSync(size_t size)
> + {
> + void* pAlloc = nullptr;
> +
> + m_mutex.lock();
> + pAlloc = Alloc(size);
> + m_mutex.unlock();
> +
> + return pAlloc;
> + }
> +
> + void Reset(bool removeAll = false)
> + {
> + m_offset = 0;
> +
> + if (m_pCurBlock)
> + {
> + ArenaBlock *pUsedBlocks = m_pCurBlock->pNext;
> + m_pCurBlock->pNext = nullptr;
> + while (pUsedBlocks)
> + {
> + ArenaBlock* pBlock = pUsedBlocks;
> + pUsedBlocks = pBlock->pNext;
> +
> + m_allocator.Free(pBlock);
> + }
> +
> + if (removeAll)
> + {
> + m_allocator.Free(m_pCurBlock);
> + m_pCurBlock = nullptr;
> + }
> + }
> + }
> +
> + bool IsEmpty()
> + {
> + return (m_pCurBlock == nullptr) || (m_offset == 0 && m_pCurBlock->pNext == nullptr);
> + }
> +
> +private:
> +
> + ArenaBlock* m_pCurBlock = nullptr;
> + size_t m_offset = 0;
>
> /// @note Mutex is only used by sync allocation functions.
> - std::mutex* m_pMutex;
> + std::mutex m_mutex;
> +
> + DefaultAllocator m_defAllocator;
> + T& m_allocator;
> };
> +
> +using StdArena = TArena<DefaultAllocator>;
> +using CachingArena = TArena<CachingAllocator>;
> diff --git a/src/gallium/drivers/swr/rasterizer/core/backend.cpp b/src/gallium/drivers/swr/rasterizer/core/backend.cpp
> index 4a472bc..95110af 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/backend.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/backend.cpp
> @@ -156,7 +156,7 @@ void ProcessQueryStatsBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTil
> }
>
> template<SWR_FORMAT format>
> -void ClearRasterTile(BYTE *pTileBuffer, simdvector &value)
> +void ClearRasterTile(uint8_t *pTileBuffer, simdvector &value)
> {
> auto lambda = [&](int comp)
> {
> @@ -299,10 +299,10 @@ void ProcessClearBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, vo
> /// @todo clear data should come in as RGBA32_FLOAT
> DWORD clearData[4];
> float clearFloat[4];
> - clearFloat[0] = ((BYTE*)(&pClear->clearRTColor))[0] / 255.0f;
> - clearFloat[1] = ((BYTE*)(&pClear->clearRTColor))[1] / 255.0f;
> - clearFloat[2] = ((BYTE*)(&pClear->clearRTColor))[2] / 255.0f;
> - clearFloat[3] = ((BYTE*)(&pClear->clearRTColor))[3] / 255.0f;
> + clearFloat[0] = ((uint8_t*)(&pClear->clearRTColor))[0] / 255.0f;
> + clearFloat[1] = ((uint8_t*)(&pClear->clearRTColor))[1] / 255.0f;
> + clearFloat[2] = ((uint8_t*)(&pClear->clearRTColor))[2] / 255.0f;
> + clearFloat[3] = ((uint8_t*)(&pClear->clearRTColor))[3] / 255.0f;
> clearData[0] = *(DWORD*)&clearFloat[0];
> clearData[1] = *(DWORD*)&clearFloat[1];
> clearData[2] = *(DWORD*)&clearFloat[2];
> @@ -399,30 +399,32 @@ void ProcessStoreTileBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile
> }
>
>
> -void ProcessInvalidateTilesBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, void *pData)
> +void ProcessDiscardInvalidateTilesBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, void *pData)
> {
> - INVALIDATE_TILES_DESC *pDesc = (INVALIDATE_TILES_DESC*)pData;
> + DISCARD_INVALIDATE_TILES_DESC *pDesc = (DISCARD_INVALIDATE_TILES_DESC *)pData;
> SWR_CONTEXT *pContext = pDC->pContext;
>
> + const int numSamples = GetNumSamples(pDC->pState->state.rastState.sampleCount);
> +
> for (uint32_t i = 0; i < SWR_NUM_ATTACHMENTS; ++i)
> {
> if (pDesc->attachmentMask & (1 << i))
> {
> - HOTTILE *pHotTile = pContext->pHotTileMgr->GetHotTile(pContext, pDC, macroTile, (SWR_RENDERTARGET_ATTACHMENT)i, false);
> + HOTTILE *pHotTile = pContext->pHotTileMgr->GetHotTileNoLoad(
> + pContext, pDC, macroTile, (SWR_RENDERTARGET_ATTACHMENT)i, pDesc->createNewTiles, numSamples);
> if (pHotTile)
> {
> - pHotTile->state = HOTTILE_INVALID;
> + pHotTile->state = (HOTTILE_STATE)pDesc->newTileState;
> }
> }
> }
> }
>
> #if KNOB_SIMD_WIDTH == 8
> -const __m256 vQuadCenterOffsetsX = { 0.5, 1.5, 0.5, 1.5, 2.5, 3.5, 2.5, 3.5 };
> -const __m256 vQuadCenterOffsetsY = { 0.5, 0.5, 1.5, 1.5, 0.5, 0.5, 1.5, 1.5 };
> -const __m256 vQuadULOffsetsX ={0.0, 1.0, 0.0, 1.0, 2.0, 3.0, 2.0, 3.0};
> -const __m256 vQuadULOffsetsY ={0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0};
> -#define MASK 0xff
> +const __m256 vCenterOffsetsX = {0.5, 1.5, 0.5, 1.5, 2.5, 3.5, 2.5, 3.5};
> +const __m256 vCenterOffsetsY = {0.5, 0.5, 1.5, 1.5, 0.5, 0.5, 1.5, 1.5};
> +const __m256 vULOffsetsX = {0.0, 1.0, 0.0, 1.0, 2.0, 3.0, 2.0, 3.0};
> +const __m256 vULOffsetsY = {0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0};
> #else
> #error Unsupported vector width
> #endif
> @@ -457,155 +459,6 @@ simdmask ComputeUserClipMask(uint8_t clipMask, float* pUserClipBuffer, simdscala
> return _simd_movemask_ps(vClipMask);
> }
>
> -template<SWR_MULTISAMPLE_COUNT sampleCountT, bool bIsStandardPattern, bool bForcedSampleCount>
> -INLINE void generateInputCoverage(const uint64_t *const coverageMask, uint32_t (&inputMask)[KNOB_SIMD_WIDTH], const uint32_t sampleMask)
> -{
> -
> - // will need to update for avx512
> - assert(KNOB_SIMD_WIDTH == 8);
> -
> - __m256i mask[2];
> - __m256i sampleCoverage[2];
> - if(bIsStandardPattern)
> - {
> - __m256i src = _mm256_set1_epi32(0);
> - __m256i index0 = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0), index1;
> -
> - if(MultisampleTraits<sampleCountT>::numSamples == 1)
> - {
> - mask[0] = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, -1);
> - }
> - else if(MultisampleTraits<sampleCountT>::numSamples == 2)
> - {
> - mask[0] = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
> - }
> - else if(MultisampleTraits<sampleCountT>::numSamples == 4)
> - {
> - mask[0] = _mm256_set_epi32(0, 0, 0, 0, -1, -1, -1, -1);
> - }
> - else if(MultisampleTraits<sampleCountT>::numSamples == 8)
> - {
> - mask[0] = _mm256_set1_epi32(-1);
> - }
> - else if(MultisampleTraits<sampleCountT>::numSamples == 16)
> - {
> - mask[0] = _mm256_set1_epi32(-1);
> - mask[1] = _mm256_set1_epi32(-1);
> - index1 = _mm256_set_epi32(15, 14, 13, 12, 11, 10, 9, 8);
> - }
> -
> - // gather coverage for samples 0-7
> - sampleCoverage[0] = _mm256_castps_si256(_simd_mask_i32gather_ps(_mm256_castsi256_ps(src), (const float*)coverageMask, index0, _mm256_castsi256_ps(mask[0]), 8));
> - if(MultisampleTraits<sampleCountT>::numSamples > 8)
> - {
> - // gather coverage for samples 8-15
> - sampleCoverage[1] = _mm256_castps_si256(_simd_mask_i32gather_ps(_mm256_castsi256_ps(src), (const float*)coverageMask, index1, _mm256_castsi256_ps(mask[1]), 8));
> - }
> - }
> - else
> - {
> - // center coverage is the same for all samples; just broadcast to the sample slots
> - uint32_t centerCoverage = ((uint32_t)(*coverageMask) & MASK);
> - if(MultisampleTraits<sampleCountT>::numSamples == 1)
> - {
> - sampleCoverage[0] = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, centerCoverage);
> - }
> - else if(MultisampleTraits<sampleCountT>::numSamples == 2)
> - {
> - sampleCoverage[0] = _mm256_set_epi32(0, 0, 0, 0, 0, 0, centerCoverage, centerCoverage);
> - }
> - else if(MultisampleTraits<sampleCountT>::numSamples == 4)
> - {
> - sampleCoverage[0] = _mm256_set_epi32(0, 0, 0, 0, centerCoverage, centerCoverage, centerCoverage, centerCoverage);
> - }
> - else if(MultisampleTraits<sampleCountT>::numSamples == 8)
> - {
> - sampleCoverage[0] = _mm256_set1_epi32(centerCoverage);
> - }
> - else if(MultisampleTraits<sampleCountT>::numSamples == 16)
> - {
> - sampleCoverage[0] = _mm256_set1_epi32(centerCoverage);
> - sampleCoverage[1] = _mm256_set1_epi32(centerCoverage);
> - }
> - }
> -
> - mask[0] = _mm256_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0xC, 0x8, 0x4, 0x0,
> - -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0xC, 0x8, 0x4, 0x0);
> - // pull out the the 8bit 4x2 coverage for samples 0-7 into the lower 32 bits of each 128bit lane
> - __m256i packedCoverage0 = _simd_shuffle_epi8(sampleCoverage[0], mask[0]);
> -
> - __m256i packedCoverage1;
> - if(MultisampleTraits<sampleCountT>::numSamples > 8)
> - {
> - // pull out the the 8bit 4x2 coverage for samples 8-15 into the lower 32 bits of each 128bit lane
> - packedCoverage1 = _simd_shuffle_epi8(sampleCoverage[1], mask[0]);
> - }
> -
> -#if (KNOB_ARCH == KNOB_ARCH_AVX)
> - // pack lower 32 bits of each 128 bit lane into lower 64 bits of single 128 bit lane
> - __m256i hiToLow = _mm256_permute2f128_si256(packedCoverage0, packedCoverage0, 0x83);
> - __m256 shufRes = _mm256_shuffle_ps(_mm256_castsi256_ps(hiToLow), _mm256_castsi256_ps(hiToLow), _MM_SHUFFLE(1, 1, 0, 1));
> - packedCoverage0 = _mm256_castps_si256(_mm256_blend_ps(_mm256_castsi256_ps(packedCoverage0), shufRes, 0xFE));
> -
> - __m256i packedSampleCoverage;
> - if(MultisampleTraits<sampleCountT>::numSamples > 8)
> - {
> - // pack lower 32 bits of each 128 bit lane into upper 64 bits of single 128 bit lane
> - hiToLow = _mm256_permute2f128_si256(packedCoverage1, packedCoverage1, 0x83);
> - shufRes = _mm256_shuffle_ps(_mm256_castsi256_ps(hiToLow), _mm256_castsi256_ps(hiToLow), _MM_SHUFFLE(1, 1, 0, 1));
> - shufRes = _mm256_blend_ps(_mm256_castsi256_ps(packedCoverage1), shufRes, 0xFE);
> - packedCoverage1 = _mm256_castps_si256(_mm256_castpd_ps(_mm256_shuffle_pd(_mm256_castps_pd(shufRes), _mm256_castps_pd(shufRes), 0x01)));
> - packedSampleCoverage = _mm256_castps_si256(_mm256_blend_ps(_mm256_castsi256_ps(packedCoverage0), _mm256_castsi256_ps(packedCoverage1), 0xFC));
> - }
> - else
> - {
> - packedSampleCoverage = packedCoverage0;
> - }
> -#else
> - __m256i permMask = _mm256_set_epi32(0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x4, 0x0);
> - // pack lower 32 bits of each 128 bit lane into lower 64 bits of single 128 bit lane
> - packedCoverage0 = _mm256_permutevar8x32_epi32(packedCoverage0, permMask);
> -
> - __m256i packedSampleCoverage;
> - if(MultisampleTraits<sampleCountT>::numSamples > 8)
> - {
> - permMask = _mm256_set_epi32(0x7, 0x7, 0x7, 0x7, 0x4, 0x0, 0x7, 0x7);
> - // pack lower 32 bits of each 128 bit lane into upper 64 bits of single 128 bit lane
> - packedCoverage1 = _mm256_permutevar8x32_epi32(packedCoverage1, permMask);
> -
> - // blend coverage masks for samples 0-7 and samples 8-15 into single 128 bit lane
> - packedSampleCoverage = _mm256_blend_epi32(packedCoverage0, packedCoverage1, 0x0C);
> - }
> - else
> - {
> - packedSampleCoverage = packedCoverage0;
> - }
> -#endif
> -
> - for(int32_t i = KNOB_SIMD_WIDTH - 1; i >= 0; i--)
> - {
> - // convert packed sample coverage masks into single coverage masks for all samples for each pixel in the 4x2
> - inputMask[i] = _simd_movemask_epi8(packedSampleCoverage);
> -
> - if(!bForcedSampleCount)
> - {
> - // input coverage has to be anded with sample mask if MSAA isn't forced on
> - inputMask[i] &= sampleMask;
> - }
> -
> - // shift to the next pixel in the 4x2
> - packedSampleCoverage = _simd_slli_epi32(packedSampleCoverage, 1);
> - }
> -}
> -
> -template<SWR_MULTISAMPLE_COUNT sampleCountT, bool bIsStandardPattern, bool bForcedSampleCount>
> -INLINE void generateInputCoverage(const uint64_t *const coverageMask, __m256 &inputCoverage, const uint32_t sampleMask)
> -{
> - uint32_t inputMask[KNOB_SIMD_WIDTH];
> - generateInputCoverage<sampleCountT, bIsStandardPattern, bForcedSampleCount>(coverageMask, inputMask, sampleMask);
> - inputCoverage = _simd_castsi_ps(_mm256_set_epi32(inputMask[7], inputMask[6], inputMask[5], inputMask[4], inputMask[3], inputMask[2], inputMask[1], inputMask[0]));
> -}
> -
> template<bool perspMask>
> INLINE void CalcPixelBarycentrics(const BarycentricCoeffs& coeffs, SWR_PS_CONTEXT &psContext)
> {
> @@ -766,6 +619,8 @@ void OutputMerger(SWR_PS_CONTEXT &psContext, uint8_t* (&pColorBase)[SWR_NUM_REND
> // type safety guaranteed from template instantiation in BEChooser<>::GetFunc
> static const SWR_MULTISAMPLE_COUNT sampleCount = (SWR_MULTISAMPLE_COUNT)sampleCountT;
> uint32_t rasterTileColorOffset = MultisampleTraits<sampleCount>::RasterTileColorOffset(sample);
> + simdvector blendOut;
> +
> for(uint32_t rt = 0; rt < NumRT; ++rt)
> {
> uint8_t *pColorSample;
> @@ -779,6 +634,9 @@ void OutputMerger(SWR_PS_CONTEXT &psContext, uint8_t* (&pColorBase)[SWR_NUM_REND
> }
>
> const SWR_RENDER_TARGET_BLEND_STATE *pRTBlend = &pBlendState->renderTarget[rt];
> + // pfnBlendFunc may not update all channels. Initialize with PS output.
> + /// TODO: move this into the blend JIT.
> + blendOut = psContext.shaded[rt];
>
> // Blend outputs and update coverage mask for alpha test
> if(pfnBlendFunc[rt] != nullptr)
> @@ -789,7 +647,7 @@ void OutputMerger(SWR_PS_CONTEXT &psContext, uint8_t* (&pColorBase)[SWR_NUM_REND
> psContext.shaded[1],
> sample,
> pColorSample,
> - psContext.shaded[rt],
> + blendOut,
> &psContext.oMask,
> (simdscalari*)&coverageMask);
> }
> @@ -805,19 +663,19 @@ void OutputMerger(SWR_PS_CONTEXT &psContext, uint8_t* (&pColorBase)[SWR_NUM_REND
> // store with color mask
> if(!pRTBlend->writeDisableRed)
> {
> - _simd_maskstore_ps((float*)pColorSample, outputMask, psContext.shaded[rt].x);
> + _simd_maskstore_ps((float*)pColorSample, outputMask, blendOut.x);
> }
> if(!pRTBlend->writeDisableGreen)
> {
> - _simd_maskstore_ps((float*)(pColorSample + simd), outputMask, psContext.shaded[rt].y);
> + _simd_maskstore_ps((float*)(pColorSample + simd), outputMask, blendOut.y);
> }
> if(!pRTBlend->writeDisableBlue)
> {
> - _simd_maskstore_ps((float*)(pColorSample + simd * 2), outputMask, psContext.shaded[rt].z);
> + _simd_maskstore_ps((float*)(pColorSample + simd * 2), outputMask, blendOut.z);
> }
> if(!pRTBlend->writeDisableAlpha)
> {
> - _simd_maskstore_ps((float*)(pColorSample + simd * 3), outputMask, psContext.shaded[rt].w);
> + _simd_maskstore_ps((float*)(pColorSample + simd * 3), outputMask, blendOut.w);
> }
> }
> }
> @@ -884,9 +742,9 @@ void BackendSingleSample(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint3
> for(uint32_t yy = y; yy < y + KNOB_TILE_Y_DIM; yy += SIMD_TILE_Y_DIM)
> {
> // UL pixel corner
> - psContext.vY.UL = _simd_add_ps(vQuadULOffsetsY, _simd_set1_ps((float)yy));
> + psContext.vY.UL = _simd_add_ps(vULOffsetsY, _simd_set1_ps((float)yy));
> // pixel center
> - psContext.vY.center = _simd_add_ps(vQuadCenterOffsetsY, _simd_set1_ps((float)yy));
> + psContext.vY.center = _simd_add_ps(vCenterOffsetsY, _simd_set1_ps((float)yy));
>
> for(uint32_t xx = x; xx < x + KNOB_TILE_X_DIM; xx += SIMD_TILE_X_DIM)
> {
> @@ -898,9 +756,9 @@ void BackendSingleSample(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint3
> if(coverageMask & MASK)
> {
> RDTSC_START(BEBarycentric);
> - psContext.vX.UL = _simd_add_ps(vQuadULOffsetsX, _simd_set1_ps((float)xx));
> + psContext.vX.UL = _simd_add_ps(vULOffsetsX, _simd_set1_ps((float)xx));
> // pixel center
> - psContext.vX.center = _simd_add_ps(vQuadCenterOffsetsX, _simd_set1_ps((float)xx));
> + psContext.vX.center = _simd_add_ps(vCenterOffsetsX, _simd_set1_ps((float)xx));
>
> backendFuncs.pfnCalcPixelBarycentrics(coeffs, psContext);
>
> @@ -1077,15 +935,15 @@ void BackendSampleRate(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_
> for (uint32_t yy = y; yy < y + KNOB_TILE_Y_DIM; yy += SIMD_TILE_Y_DIM)
> {
> // UL pixel corner
> - psContext.vY.UL = _simd_add_ps(vQuadULOffsetsY, _simd_set1_ps((float)yy));
> + psContext.vY.UL = _simd_add_ps(vULOffsetsY, _simd_set1_ps((float)yy));
> // pixel center
> - psContext.vY.center = _simd_add_ps(vQuadCenterOffsetsY, _simd_set1_ps((float)yy));
> + psContext.vY.center = _simd_add_ps(vCenterOffsetsY, _simd_set1_ps((float)yy));
>
> for (uint32_t xx = x; xx < x + KNOB_TILE_X_DIM; xx += SIMD_TILE_X_DIM)
> {
> - psContext.vX.UL = _simd_add_ps(vQuadULOffsetsX, _simd_set1_ps((float)xx));
> + psContext.vX.UL = _simd_add_ps(vULOffsetsX, _simd_set1_ps((float)xx));
> // pixel center
> - psContext.vX.center = _simd_add_ps(vQuadCenterOffsetsX, _simd_set1_ps((float)xx));
> + psContext.vX.center = _simd_add_ps(vCenterOffsetsX, _simd_set1_ps((float)xx));
>
> RDTSC_START(BEBarycentric);
> backendFuncs.pfnCalcPixelBarycentrics(coeffs, psContext);
> @@ -1313,14 +1171,14 @@ void BackendPixelRate(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_t
>
> for(uint32_t yy = y; yy < y + KNOB_TILE_Y_DIM; yy += SIMD_TILE_Y_DIM)
> {
> - psContext.vY.UL = _simd_add_ps(vQuadULOffsetsY, _simd_set1_ps((float)yy));
> - psContext.vY.center = _simd_add_ps(vQuadCenterOffsetsY, _simd_set1_ps((float)yy));
> + psContext.vY.UL = _simd_add_ps(vULOffsetsY, _simd_set1_ps((float)yy));
> + psContext.vY.center = _simd_add_ps(vCenterOffsetsY, _simd_set1_ps((float)yy));
> for(uint32_t xx = x; xx < x + KNOB_TILE_X_DIM; xx += SIMD_TILE_X_DIM)
> {
> - simdscalar vZ[MultisampleTraits<sampleCount>::numSamples];
> - psContext.vX.UL = _simd_add_ps(vQuadULOffsetsX, _simd_set1_ps((float)xx));
> + simdscalar vZ[MultisampleTraits<sampleCount>::numSamples]{ 0 };
> + psContext.vX.UL = _simd_add_ps(vULOffsetsX, _simd_set1_ps((float)xx));
> // set pixel center positions
> - psContext.vX.center = _simd_add_ps(vQuadCenterOffsetsX, _simd_set1_ps((float)xx));
> + psContext.vX.center = _simd_add_ps(vCenterOffsetsX, _simd_set1_ps((float)xx));
>
> if (bInputCoverage)
> {
> @@ -1353,7 +1211,7 @@ void BackendPixelRate(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_t
> }
> else
> {
> - psContext.activeMask = _simd_set1_epi32(-1);
> + psContext.activeMask = _simd_set1_epi32(-1);
> }
>
> // need to declare enough space for all samples
> @@ -1555,6 +1413,7 @@ void BackendNullPS(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_t y,
> SWR_CONTEXT *pContext = pDC->pContext;
> const API_STATE& state = GetApiState(pDC);
> const BACKEND_FUNCS& backendFuncs = pDC->pState->backendFuncs;
> + const SWR_RASTSTATE& rastState = pDC->pState->state.rastState;
>
> // broadcast scalars
> BarycentricCoeffs coeffs;
> @@ -1572,7 +1431,7 @@ void BackendNullPS(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_t y,
>
> coeffs.vRecipDet = _simd_broadcast_ss(&work.recipDet);
>
> - BYTE *pDepthBase = renderBuffers.pDepth, *pStencilBase = renderBuffers.pStencil;
> + uint8_t *pDepthBase = renderBuffers.pDepth, *pStencilBase = renderBuffers.pStencil;
>
> RDTSC_STOP(BESetup, 0, 0);
>
> @@ -1580,12 +1439,12 @@ void BackendNullPS(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_t y,
> for (uint32_t yy = y; yy < y + KNOB_TILE_Y_DIM; yy += SIMD_TILE_Y_DIM)
> {
> // UL pixel corner
> - simdscalar vYSamplePosUL = _simd_add_ps(vQuadULOffsetsY, _simd_set1_ps((float)yy));
> + simdscalar vYSamplePosUL = _simd_add_ps(vULOffsetsY, _simd_set1_ps((float)yy));
>
> for (uint32_t xx = x; xx < x + KNOB_TILE_X_DIM; xx += SIMD_TILE_X_DIM)
> {
> // UL pixel corners
> - simdscalar vXSamplePosUL = _simd_add_ps(vQuadULOffsetsX, _simd_set1_ps((float)xx));
> + simdscalar vXSamplePosUL = _simd_add_ps(vULOffsetsX, _simd_set1_ps((float)xx));
>
> // iterate over active samples
> unsigned long sample = 0;
> @@ -1593,7 +1452,8 @@ void BackendNullPS(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_t y,
> while (_BitScanForward(&sample, sampleMask))
> {
> sampleMask &= ~(1 << sample);
> - if (work.coverageMask[sample] & MASK)
> + simdmask coverageMask = work.coverageMask[sample] & MASK;
> + if (coverageMask)
> {
> RDTSC_START(BEBarycentric);
> // calculate per sample positions
> @@ -1607,7 +1467,14 @@ void BackendNullPS(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_t y,
>
> RDTSC_STOP(BEBarycentric, 0, 0);
>
> - simdscalar vCoverageMask = vMask(work.coverageMask[sample] & MASK);
> + // interpolate user clip distance if available
> + if (rastState.clipDistanceMask)
> + {
> + coverageMask &= ~ComputeUserClipMask(rastState.clipDistanceMask, work.pUserClipBuffer,
> + psContext.vI.sample, psContext.vJ.sample);
> + }
> +
> + simdscalar vCoverageMask = vMask(coverageMask);
> simdscalar stencilPassMask = vCoverageMask;
>
> // offset depth/stencil buffers current sample
> diff --git a/src/gallium/drivers/swr/rasterizer/core/backend.h b/src/gallium/drivers/swr/rasterizer/core/backend.h
> index 53089e5..2fa1895 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/backend.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/backend.h
> @@ -29,16 +29,20 @@
> #pragma once
>
> #include "common/os.h"
> -#include "core/context.h"
> +#include "core/context.h"
> +#include "core/multisample.h"
>
> void ProcessComputeBE(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t threadGroupId);
> void ProcessSyncBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, void *pUserData);
> void ProcessQueryStatsBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, void *pUserData);
> void ProcessClearBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, void *pUserData);
> void ProcessStoreTileBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, void *pData);
> -void ProcessInvalidateTilesBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, void *pData);
> +void ProcessDiscardInvalidateTilesBE(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t macroTile, void *pData);
> void BackendNullPS(DRAW_CONTEXT *pDC, uint32_t workerId, uint32_t x, uint32_t y, SWR_TRIANGLE_DESC &work, RenderOutputBuffers &renderBuffers);
> void InitClearTilesTable();
> +simdmask ComputeUserClipMask(uint8_t clipMask, float* pUserClipBuffer, simdscalar vI, simdscalar vJ);
> +void InitBackendFuncTables();
> +void InitCPSFuncTables();
>
> enum SWR_BACKEND_FUNCS
> {
> @@ -47,13 +51,160 @@ enum SWR_BACKEND_FUNCS
> SWR_BACKEND_MSAA_SAMPLE_RATE,
> SWR_BACKEND_FUNCS_MAX,
> };
> -void InitBackendFuncTables();
>
> -extern PFN_BACKEND_FUNC gBackendNullPs[SWR_MULTISAMPLE_TYPE_MAX];
> -extern PFN_BACKEND_FUNC gBackendSingleSample[2][2];
> -extern PFN_BACKEND_FUNC gBackendPixelRateTable[SWR_MULTISAMPLE_TYPE_MAX][SWR_MSAA_SAMPLE_PATTERN_MAX][SWR_INPUT_COVERAGE_MAX][2][2];
> -extern PFN_BACKEND_FUNC gBackendSampleRateTable[SWR_MULTISAMPLE_TYPE_MAX][SWR_INPUT_COVERAGE_MAX][2];
> -extern PFN_OUTPUT_MERGER gBackendOutputMergerTable[SWR_NUM_RENDERTARGETS+1][SWR_MULTISAMPLE_TYPE_MAX];
> -extern PFN_CALC_PIXEL_BARYCENTRICS gPixelBarycentricTable[2];
> -extern PFN_CALC_SAMPLE_BARYCENTRICS gSampleBarycentricTable[2];
> -extern PFN_CALC_CENTROID_BARYCENTRICS gCentroidBarycentricTable[SWR_MULTISAMPLE_TYPE_MAX][2][2][2];
> +#if KNOB_SIMD_WIDTH == 8
> +extern const __m256 vCenterOffsetsX;
> +extern const __m256 vCenterOffsetsY;
> +extern const __m256 vULOffsetsX;
> +extern const __m256 vULOffsetsY;
> +#define MASK 0xff
> +#endif
> +
> +template<SWR_MULTISAMPLE_COUNT sampleCountT, bool bIsStandardPattern, bool bForcedSampleCount>
> +INLINE void generateInputCoverage(const uint64_t *const coverageMask, uint32_t (&inputMask)[KNOB_SIMD_WIDTH], const uint32_t sampleMask)
> +{
> +
> + // will need to update for avx512
> + assert(KNOB_SIMD_WIDTH == 8);
> +
> + __m256i mask[2];
> + __m256i sampleCoverage[2];
> + if(bIsStandardPattern)
> + {
> + __m256i src = _mm256_set1_epi32(0);
> + __m256i index0 = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0), index1;
> +
> + if(MultisampleTraits<sampleCountT>::numSamples == 1)
> + {
> + mask[0] = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, -1);
> + }
> + else if(MultisampleTraits<sampleCountT>::numSamples == 2)
> + {
> + mask[0] = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
> + }
> + else if(MultisampleTraits<sampleCountT>::numSamples == 4)
> + {
> + mask[0] = _mm256_set_epi32(0, 0, 0, 0, -1, -1, -1, -1);
> + }
> + else if(MultisampleTraits<sampleCountT>::numSamples == 8)
> + {
> + mask[0] = _mm256_set1_epi32(-1);
> + }
> + else if(MultisampleTraits<sampleCountT>::numSamples == 16)
> + {
> + mask[0] = _mm256_set1_epi32(-1);
> + mask[1] = _mm256_set1_epi32(-1);
> + index1 = _mm256_set_epi32(15, 14, 13, 12, 11, 10, 9, 8);
> + }
> +
> + // gather coverage for samples 0-7
> + sampleCoverage[0] = _mm256_castps_si256(_simd_mask_i32gather_ps(_mm256_castsi256_ps(src), (const float*)coverageMask, index0, _mm256_castsi256_ps(mask[0]), 8));
> + if(MultisampleTraits<sampleCountT>::numSamples > 8)
> + {
> + // gather coverage for samples 8-15
> + sampleCoverage[1] = _mm256_castps_si256(_simd_mask_i32gather_ps(_mm256_castsi256_ps(src), (const float*)coverageMask, index1, _mm256_castsi256_ps(mask[1]), 8));
> + }
> + }
> + else
> + {
> + // center coverage is the same for all samples; just broadcast to the sample slots
> + uint32_t centerCoverage = ((uint32_t)(*coverageMask) & MASK);
> + if(MultisampleTraits<sampleCountT>::numSamples == 1)
> + {
> + sampleCoverage[0] = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, centerCoverage);
> + }
> + else if(MultisampleTraits<sampleCountT>::numSamples == 2)
> + {
> + sampleCoverage[0] = _mm256_set_epi32(0, 0, 0, 0, 0, 0, centerCoverage, centerCoverage);
> + }
> + else if(MultisampleTraits<sampleCountT>::numSamples == 4)
> + {
> + sampleCoverage[0] = _mm256_set_epi32(0, 0, 0, 0, centerCoverage, centerCoverage, centerCoverage, centerCoverage);
> + }
> + else if(MultisampleTraits<sampleCountT>::numSamples == 8)
> + {
> + sampleCoverage[0] = _mm256_set1_epi32(centerCoverage);
> + }
> + else if(MultisampleTraits<sampleCountT>::numSamples == 16)
> + {
> + sampleCoverage[0] = _mm256_set1_epi32(centerCoverage);
> + sampleCoverage[1] = _mm256_set1_epi32(centerCoverage);
> + }
> + }
> +
> + mask[0] = _mm256_set_epi8(-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0xC, 0x8, 0x4, 0x0,
> + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0xC, 0x8, 0x4, 0x0);
> + // pull out the the 8bit 4x2 coverage for samples 0-7 into the lower 32 bits of each 128bit lane
> + __m256i packedCoverage0 = _simd_shuffle_epi8(sampleCoverage[0], mask[0]);
> +
> + __m256i packedCoverage1;
> + if(MultisampleTraits<sampleCountT>::numSamples > 8)
> + {
> + // pull out the the 8bit 4x2 coverage for samples 8-15 into the lower 32 bits of each 128bit lane
> + packedCoverage1 = _simd_shuffle_epi8(sampleCoverage[1], mask[0]);
> + }
> +
> +#if (KNOB_ARCH == KNOB_ARCH_AVX)
> + // pack lower 32 bits of each 128 bit lane into lower 64 bits of single 128 bit lane
> + __m256i hiToLow = _mm256_permute2f128_si256(packedCoverage0, packedCoverage0, 0x83);
> + __m256 shufRes = _mm256_shuffle_ps(_mm256_castsi256_ps(hiToLow), _mm256_castsi256_ps(hiToLow), _MM_SHUFFLE(1, 1, 0, 1));
> + packedCoverage0 = _mm256_castps_si256(_mm256_blend_ps(_mm256_castsi256_ps(packedCoverage0), shufRes, 0xFE));
> +
> + __m256i packedSampleCoverage;
> + if(MultisampleTraits<sampleCountT>::numSamples > 8)
> + {
> + // pack lower 32 bits of each 128 bit lane into upper 64 bits of single 128 bit lane
> + hiToLow = _mm256_permute2f128_si256(packedCoverage1, packedCoverage1, 0x83);
> + shufRes = _mm256_shuffle_ps(_mm256_castsi256_ps(hiToLow), _mm256_castsi256_ps(hiToLow), _MM_SHUFFLE(1, 1, 0, 1));
> + shufRes = _mm256_blend_ps(_mm256_castsi256_ps(packedCoverage1), shufRes, 0xFE);
> + packedCoverage1 = _mm256_castps_si256(_mm256_castpd_ps(_mm256_shuffle_pd(_mm256_castps_pd(shufRes), _mm256_castps_pd(shufRes), 0x01)));
> + packedSampleCoverage = _mm256_castps_si256(_mm256_blend_ps(_mm256_castsi256_ps(packedCoverage0), _mm256_castsi256_ps(packedCoverage1), 0xFC));
> + }
> + else
> + {
> + packedSampleCoverage = packedCoverage0;
> + }
> +#else
> + __m256i permMask = _mm256_set_epi32(0x7, 0x7, 0x7, 0x7, 0x7, 0x7, 0x4, 0x0);
> + // pack lower 32 bits of each 128 bit lane into lower 64 bits of single 128 bit lane
> + packedCoverage0 = _mm256_permutevar8x32_epi32(packedCoverage0, permMask);
> +
> + __m256i packedSampleCoverage;
> + if(MultisampleTraits<sampleCountT>::numSamples > 8)
> + {
> + permMask = _mm256_set_epi32(0x7, 0x7, 0x7, 0x7, 0x4, 0x0, 0x7, 0x7);
> + // pack lower 32 bits of each 128 bit lane into upper 64 bits of single 128 bit lane
> + packedCoverage1 = _mm256_permutevar8x32_epi32(packedCoverage1, permMask);
> +
> + // blend coverage masks for samples 0-7 and samples 8-15 into single 128 bit lane
> + packedSampleCoverage = _mm256_blend_epi32(packedCoverage0, packedCoverage1, 0x0C);
> + }
> + else
> + {
> + packedSampleCoverage = packedCoverage0;
> + }
> +#endif
> +
> + for(int32_t i = KNOB_SIMD_WIDTH - 1; i >= 0; i--)
> + {
> + // convert packed sample coverage masks into single coverage masks for all samples for each pixel in the 4x2
> + inputMask[i] = _simd_movemask_epi8(packedSampleCoverage);
> +
> + if(!bForcedSampleCount)
> + {
> + // input coverage has to be anded with sample mask if MSAA isn't forced on
> + inputMask[i] &= sampleMask;
> + }
> +
> + // shift to the next pixel in the 4x2
> + packedSampleCoverage = _simd_slli_epi32(packedSampleCoverage, 1);
> + }
> +}
> +
> +template<SWR_MULTISAMPLE_COUNT sampleCountT, bool bIsStandardPattern, bool bForcedSampleCount>
> +INLINE void generateInputCoverage(const uint64_t *const coverageMask, __m256 &inputCoverage, const uint32_t sampleMask)
> +{
> + uint32_t inputMask[KNOB_SIMD_WIDTH];
> + generateInputCoverage<sampleCountT, bIsStandardPattern, bForcedSampleCount>(coverageMask, inputMask, sampleMask);
> + inputCoverage = _simd_castsi_ps(_mm256_set_epi32(inputMask[7], inputMask[6], inputMask[5], inputMask[4], inputMask[3], inputMask[2], inputMask[1], inputMask[0]));
> +}
> diff --git a/src/gallium/drivers/swr/rasterizer/core/clip.cpp b/src/gallium/drivers/swr/rasterizer/core/clip.cpp
> index ce27bf7..3a2a8b3 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/clip.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/clip.cpp
> @@ -31,6 +31,9 @@
> #include "common/os.h"
> #include "core/clip.h"
>
> +// Temp storage used by the clipper
> +THREAD simdvertex tlsTempVertices[7];
> +
> float ComputeInterpFactor(float boundaryCoord0, float boundaryCoord1)
> {
> return (boundaryCoord0 / (boundaryCoord0 - boundaryCoord1));
> diff --git a/src/gallium/drivers/swr/rasterizer/core/clip.h b/src/gallium/drivers/swr/rasterizer/core/clip.h
> index 49494a4..ba5870a 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/clip.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/clip.h
> @@ -32,6 +32,9 @@
> #include "core/pa.h"
> #include "rdtsc_core.h"
>
> +// Temp storage used by the clipper
> +extern THREAD simdvertex tlsTempVertices[7];
> +
> enum SWR_CLIPCODES
> {
> // Shift clip codes out of the mantissa to prevent denormalized values when used in float compare.
> @@ -354,6 +357,25 @@ public:
> }
> }
>
> + // assemble user clip distances if enabled
> + if (this->state.rastState.clipDistanceMask & 0xf)
> + {
> + pa.Assemble(VERTEX_CLIPCULL_DIST_LO_SLOT, tmpVector);
> + for (uint32_t i = 0; i < NumVertsPerPrim; ++i)
> + {
> + vertices[i].attrib[VERTEX_CLIPCULL_DIST_LO_SLOT] = tmpVector[i];
> + }
> + }
> +
> + if (this->state.rastState.clipDistanceMask & 0xf0)
> + {
> + pa.Assemble(VERTEX_CLIPCULL_DIST_HI_SLOT, tmpVector);
> + for (uint32_t i = 0; i < NumVertsPerPrim; ++i)
> + {
> + vertices[i].attrib[VERTEX_CLIPCULL_DIST_HI_SLOT] = tmpVector[i];
> + }
> + }
> +
> uint32_t numAttribs = maxSlot + 1;
>
> simdscalari vNumClippedVerts = ClipPrims((float*)&vertices[0], vPrimMask, vClipMask, numAttribs);
> @@ -436,6 +458,27 @@ public:
> }
> }
>
> + // transpose user clip distances if enabled
> + if (this->state.rastState.clipDistanceMask & 0xf)
> + {
> + pBase = (uint8_t*)(&vertices[0].attrib[VERTEX_CLIPCULL_DIST_LO_SLOT]) + sizeof(float) * inputPrim;
> + for (uint32_t c = 0; c < 4; ++c)
> + {
> + transposedPrims[0].attrib[VERTEX_CLIPCULL_DIST_LO_SLOT][c] = _simd_mask_i32gather_ps(_mm256_undefined_ps(), (const float*)pBase, vOffsets, vMask, 1);
> + pBase += sizeof(simdscalar);
> + }
> + }
> +
> + if (this->state.rastState.clipDistanceMask & 0xf0)
> + {
> + pBase = (uint8_t*)(&vertices[0].attrib[VERTEX_CLIPCULL_DIST_HI_SLOT]) + sizeof(float) * inputPrim;
> + for (uint32_t c = 0; c < 4; ++c)
> + {
> + transposedPrims[0].attrib[VERTEX_CLIPCULL_DIST_HI_SLOT][c] = _simd_mask_i32gather_ps(_mm256_undefined_ps(), (const float*)pBase, vOffsets, vMask, 1);
> + pBase += sizeof(simdscalar);
> + }
> + }
> +
> PA_STATE_OPT clipPa(this->pDC, numEmittedPrims, (uint8_t*)&transposedPrims[0], numEmittedVerts, true, clipTopology);
>
> while (clipPa.GetNextStreamOutput())
> @@ -630,6 +673,31 @@ private:
> ScatterComponent(pOutVerts, attribSlot, vActiveMask, outIndex, c, vOutAttrib);
> }
> }
> +
> + // interpolate clip distance if enabled
> + if (this->state.rastState.clipDistanceMask & 0xf)
> + {
> + uint32_t attribSlot = VERTEX_CLIPCULL_DIST_LO_SLOT;
> + for (uint32_t c = 0; c < 4; ++c)
> + {
> + simdscalar vAttrib0 = GatherComponent(pInVerts, attribSlot, vActiveMask, s, c);
> + simdscalar vAttrib1 = GatherComponent(pInVerts, attribSlot, vActiveMask, p, c);
> + simdscalar vOutAttrib = _simd_fmadd_ps(_simd_sub_ps(vAttrib1, vAttrib0), t, vAttrib0);
> + ScatterComponent(pOutVerts, attribSlot, vActiveMask, outIndex, c, vOutAttrib);
> + }
> + }
> +
> + if (this->state.rastState.clipDistanceMask & 0xf0)
> + {
> + uint32_t attribSlot = VERTEX_CLIPCULL_DIST_HI_SLOT;
> + for (uint32_t c = 0; c < 4; ++c)
> + {
> + simdscalar vAttrib0 = GatherComponent(pInVerts, attribSlot, vActiveMask, s, c);
> + simdscalar vAttrib1 = GatherComponent(pInVerts, attribSlot, vActiveMask, p, c);
> + simdscalar vOutAttrib = _simd_fmadd_ps(_simd_sub_ps(vAttrib1, vAttrib0), t, vAttrib0);
> + ScatterComponent(pOutVerts, attribSlot, vActiveMask, outIndex, c, vOutAttrib);
> + }
> + }
> }
>
> template<SWR_CLIPCODES ClippingPlane>
> @@ -700,6 +768,27 @@ private:
> }
> }
>
> + // store clip distance if enabled
> + if (this->state.rastState.clipDistanceMask & 0xf)
> + {
> + uint32_t attribSlot = VERTEX_CLIPCULL_DIST_LO_SLOT;
> + for (uint32_t c = 0; c < 4; ++c)
> + {
> + simdscalar vAttrib = GatherComponent(pInVerts, attribSlot, s_in, s, c);
> + ScatterComponent(pOutVerts, attribSlot, s_in, vOutIndex, c, vAttrib);
> + }
> + }
> +
> + if (this->state.rastState.clipDistanceMask & 0xf0)
> + {
> + uint32_t attribSlot = VERTEX_CLIPCULL_DIST_HI_SLOT;
> + for (uint32_t c = 0; c < 4; ++c)
> + {
> + simdscalar vAttrib = GatherComponent(pInVerts, attribSlot, s_in, s, c);
> + ScatterComponent(pOutVerts, attribSlot, s_in, vOutIndex, c, vAttrib);
> + }
> + }
> +
> // increment outIndex
> vOutIndex = _simd_blendv_epi32(vOutIndex, _simd_add_epi32(vOutIndex, _simd_set1_epi32(1)), s_in);
> }
> @@ -818,8 +907,7 @@ private:
> simdscalari ClipPrims(float* pVertices, const simdscalar& vPrimMask, const simdscalar& vClipMask, int numAttribs)
> {
> // temp storage
> - simdvertex tempVertices[7];
> - float* pTempVerts = (float*)&tempVertices[0];
> + float* pTempVerts = (float*)&tlsTempVertices[0];
>
> // zero out num input verts for non-active lanes
> simdscalari vNumInPts = _simd_set1_epi32(NumVertsPerPrim);
> @@ -854,9 +942,9 @@ private:
> return vNumOutPts;
> }
>
> - const uint32_t workerId;
> - const DRIVER_TYPE driverType;
> - DRAW_CONTEXT* pDC;
> + const uint32_t workerId{ 0 };
> + const DRIVER_TYPE driverType{ DX };
> + DRAW_CONTEXT* pDC{ nullptr };
> const API_STATE& state;
> simdscalar clipCodes[NumVertsPerPrim];
> };
> diff --git a/src/gallium/drivers/swr/rasterizer/core/context.h b/src/gallium/drivers/swr/rasterizer/core/context.h
> index 4a214af..b8f15ca 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/context.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/context.h
> @@ -41,6 +41,7 @@
> #include "core/knobs.h"
> #include "common/simdintrin.h"
> #include "core/threads.h"
> +#include "ringbuffer.h"
>
> // x.8 fixed point precision values
> #define FIXED_POINT_SHIFT 8
> @@ -82,6 +83,7 @@ struct SWR_TRIANGLE_DESC
> float *pUserClipBuffer;
>
> uint64_t coverageMask[SWR_MAX_NUM_MULTISAMPLES];
> + uint64_t anyCoveredSamples;
>
> TRI_FLAGS triFlags;
> };
> @@ -109,12 +111,16 @@ struct CLEAR_DESC
> CLEAR_FLAGS flags;
> float clearRTColor[4]; // RGBA_32F
> float clearDepth; // [0..1]
> - BYTE clearStencil;
> + uint8_t clearStencil;
> };
>
> -struct INVALIDATE_TILES_DESC
> +struct DISCARD_INVALIDATE_TILES_DESC
> {
> uint32_t attachmentMask;
> + SWR_RECT rect;
> + SWR_TILE_STATE newTileState;
> + bool createNewTiles;
> + bool fullTilesOnly;
> };
>
> struct SYNC_DESC
> @@ -150,7 +156,7 @@ enum WORK_TYPE
> SYNC,
> DRAW,
> CLEAR,
> - INVALIDATETILES,
> + DISCARDINVALIDATETILES,
> STORETILES,
> QUERYSTATS,
> };
> @@ -164,7 +170,7 @@ struct BE_WORK
> SYNC_DESC sync;
> TRIANGLE_WORK_DESC tri;
> CLEAR_DESC clear;
> - INVALIDATE_TILES_DESC invalidateTiles;
> + DISCARD_INVALIDATE_TILES_DESC discardInvalidateTiles;
> STORE_TILES_DESC storeTiles;
> QUERY_DESC queryStats;
> } desc;
> @@ -201,7 +207,7 @@ struct FE_WORK
> SYNC_DESC sync;
> DRAW_WORK draw;
> CLEAR_DESC clear;
> - INVALIDATE_TILES_DESC invalidateTiles;
> + DISCARD_INVALIDATE_TILES_DESC discardInvalidateTiles;
> STORE_TILES_DESC storeTiles;
> QUERY_DESC queryStats;
> } desc;
> @@ -354,6 +360,7 @@ struct BACKEND_FUNCS
> PFN_OUTPUT_MERGER pfnOutputMerger;
> };
>
> +
> // Draw State
> struct DRAW_STATE
> {
> @@ -365,7 +372,7 @@ struct DRAW_STATE
> BACKEND_FUNCS backendFuncs;
> PFN_PROCESS_PRIMS pfnProcessPrims;
>
> - Arena* pArena; // This should only be used by API thread.
> + CachingArena* pArena; // This should only be used by API thread.
> };
>
> // Draw Context
> @@ -381,23 +388,18 @@ struct DRAW_CONTEXT
>
> FE_WORK FeWork;
> volatile OSALIGNLINE(uint32_t) FeLock;
> - volatile OSALIGNLINE(bool) inUse;
> volatile OSALIGNLINE(bool) doneFE; // Is FE work done for this draw?
> -
> - // Have all worker threads moved past draw in DC ring?
> - volatile OSALIGNLINE(uint32_t) threadsDoneFE;
> - volatile OSALIGNLINE(uint32_t) threadsDoneBE;
> + volatile OSALIGNLINE(int64_t) threadsDone;
>
> uint64_t dependency;
>
> MacroTileMgr* pTileMgr;
>
> // The following fields are valid if isCompute is true.
> - volatile OSALIGNLINE(bool) doneCompute; // Is this dispatch done? (isCompute)
> DispatchQueue* pDispatch; // Queue for thread groups. (isCompute)
>
> DRAW_STATE* pState;
> - Arena* pArena;
> + CachingArena* pArena;
>
> uint8_t* pSpillFill[KNOB_MAX_NUM_THREADS]; // Scratch space used for spill fills.
> };
> @@ -438,7 +440,7 @@ struct SWR_CONTEXT
> // 3. State - When an applications sets state after draw
> // a. Same as step 1.
> // b. State is copied from prev draw context to current.
> - DRAW_CONTEXT* dcRing;
> + RingBuffer<DRAW_CONTEXT> dcRing;
>
> DRAW_CONTEXT *pCurDrawContext; // This points to DC entry in ring for an unsubmitted draw.
> DRAW_CONTEXT *pPrevDrawContext; // This points to DC entry for the previous context submitted that we can copy state from.
> @@ -448,14 +450,10 @@ struct SWR_CONTEXT
> // These split draws all have identical state. So instead of storing the state directly
> // in the Draw Context (DC) we instead store it in a Draw State (DS). This allows multiple DCs
> // to reference a single entry in the DS ring.
> - DRAW_STATE* dsRing;
> + RingBuffer<DRAW_STATE> dsRing;
>
> uint32_t curStateId; // Current index to the next available entry in the DS ring.
>
> - DRAW_STATE* subCtxSave; // Save area for inactive contexts.
> - uint32_t curSubCtxId; // Current index for active state subcontext.
> - uint32_t numSubContexts; // Number of available subcontexts
> -
> uint32_t NumWorkerThreads;
>
> THREAD_POOL threadPool; // Thread pool associated with this context
> @@ -463,13 +461,6 @@ struct SWR_CONTEXT
> std::condition_variable FifosNotEmpty;
> std::mutex WaitLock;
>
> - // Draw Contexts will get a unique drawId generated from this
> - uint64_t nextDrawId;
> -
> - // most recent draw id enqueued by the API thread
> - // written by api thread, read by multiple workers
> - OSALIGNLINE(volatile uint64_t) DrawEnqueued;
> -
> DRIVER_TYPE driverType;
>
> uint32_t privateStateSize;
> @@ -486,6 +477,8 @@ struct SWR_CONTEXT
>
> // Scratch space for workers.
> uint8_t* pScratch[KNOB_MAX_NUM_THREADS];
> +
> + CachingAllocator cachingArenaAllocator;
> };
>
> void WaitForDependencies(SWR_CONTEXT *pContext, uint64_t drawId);
> diff --git a/src/gallium/drivers/swr/rasterizer/core/depthstencil.h b/src/gallium/drivers/swr/rasterizer/core/depthstencil.h
> index 4f245c8..2cc9d40 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/depthstencil.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/depthstencil.h
> @@ -82,7 +82,7 @@ void StencilOp(SWR_STENCILOP op, simdscalar mask, simdscalar stencilRefps, simds
>
> INLINE
> simdscalar DepthStencilTest(const SWR_VIEWPORT* pViewport, const SWR_DEPTH_STENCIL_STATE* pDSState,
> - bool frontFacing, simdscalar interpZ, BYTE* pDepthBase, simdscalar coverageMask, BYTE *pStencilBase,
> + bool frontFacing, simdscalar interpZ, uint8_t* pDepthBase, simdscalar coverageMask, uint8_t *pStencilBase,
> simdscalar* pStencilMask)
> {
> static_assert(KNOB_DEPTH_HOT_TILE_FORMAT == R32_FLOAT, "Unsupported depth hot tile format");
> @@ -177,8 +177,8 @@ simdscalar DepthStencilTest(const SWR_VIEWPORT* pViewport, const SWR_DEPTH_STENC
>
> INLINE
> void DepthStencilWrite(const SWR_VIEWPORT* pViewport, const SWR_DEPTH_STENCIL_STATE* pDSState,
> - bool frontFacing, simdscalar interpZ, BYTE* pDepthBase, const simdscalar& depthMask, const simdscalar& coverageMask,
> - BYTE *pStencilBase, const simdscalar& stencilMask)
> + bool frontFacing, simdscalar interpZ, uint8_t* pDepthBase, const simdscalar& depthMask, const simdscalar& coverageMask,
> + uint8_t *pStencilBase, const simdscalar& stencilMask)
> {
> if (pDSState->depthWriteEnable)
> {
> diff --git a/src/gallium/drivers/swr/rasterizer/core/fifo.hpp b/src/gallium/drivers/swr/rasterizer/core/fifo.hpp
> index 7e55601..ccf0b70 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/fifo.hpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/fifo.hpp
> @@ -49,7 +49,8 @@ struct QUEUE
> static const uint32_t mBlockSizeShift = 6;
> static const uint32_t mBlockSize = 1 << mBlockSizeShift;
>
> - void clear(Arena& arena)
> + template <typename ArenaT>
> + void clear(ArenaT& arena)
> {
> mHead = 0;
> mTail = 0;
> @@ -102,7 +103,8 @@ struct QUEUE
> mNumEntries --;
> }
>
> - bool enqueue_try_nosync(Arena& arena, const T* entry)
> + template <typename ArenaT>
> + bool enqueue_try_nosync(ArenaT& arena, const T* entry)
> {
> memcpy(&mCurBlock[mTail], entry, sizeof(T));
>
> diff --git a/src/gallium/drivers/swr/rasterizer/core/format_conversion.h b/src/gallium/drivers/swr/rasterizer/core/format_conversion.h
> index 83d85fc..344758e 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/format_conversion.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/format_conversion.h
> @@ -34,7 +34,7 @@
> /// @param pSrc - source data in SOA form
> /// @param dst - output data in SOA form
> template<SWR_FORMAT SrcFormat>
> -INLINE void LoadSOA(const BYTE *pSrc, simdvector &dst)
> +INLINE void LoadSOA(const uint8_t *pSrc, simdvector &dst)
> {
> // fast path for float32
> if ((FormatTraits<SrcFormat>::GetType(0) == SWR_TYPE_FLOAT) && (FormatTraits<SrcFormat>::GetBPC(0) == 32))
> @@ -141,7 +141,7 @@ INLINE simdscalar Normalize(simdscalar vComp, uint32_t Component)
> /// @param src - source data in SOA form
> /// @param dst - output data in SOA form
> template<SWR_FORMAT DstFormat>
> -INLINE void StoreSOA(const simdvector &src, BYTE *pDst)
> +INLINE void StoreSOA(const simdvector &src, uint8_t *pDst)
> {
> // fast path for float32
> if ((FormatTraits<DstFormat>::GetType(0) == SWR_TYPE_FLOAT) && (FormatTraits<DstFormat>::GetBPC(0) == 32))
> diff --git a/src/gallium/drivers/swr/rasterizer/core/format_types.h b/src/gallium/drivers/swr/rasterizer/core/format_types.h
> index aa35025..9acf846 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/format_types.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/format_types.h
> @@ -34,8 +34,8 @@ template <uint32_t NumBits, bool Signed = false>
> struct PackTraits
> {
> static const uint32_t MyNumBits = NumBits;
> - static simdscalar loadSOA(const BYTE *pSrc) = delete;
> - static void storeSOA(BYTE *pDst, simdscalar src) = delete;
> + static simdscalar loadSOA(const uint8_t *pSrc) = delete;
> + static void storeSOA(uint8_t *pDst, simdscalar src) = delete;
> static simdscalar unpack(simdscalar &in) = delete;
> static simdscalar pack(simdscalar &in) = delete;
> };
> @@ -48,8 +48,8 @@ struct PackTraits<0, false>
> {
> static const uint32_t MyNumBits = 0;
>
> - static simdscalar loadSOA(const BYTE *pSrc) { return _simd_setzero_ps(); }
> - static void storeSOA(BYTE *pDst, simdscalar src) { return; }
> + static simdscalar loadSOA(const uint8_t *pSrc) { return _simd_setzero_ps(); }
> + static void storeSOA(uint8_t *pDst, simdscalar src) { return; }
> static simdscalar unpack(simdscalar &in) { return _simd_setzero_ps(); }
> static simdscalar pack(simdscalar &in) { return _simd_setzero_ps(); }
> };
> @@ -63,7 +63,7 @@ struct PackTraits<8, false>
> {
> static const uint32_t MyNumBits = 8;
>
> - static simdscalar loadSOA(const BYTE *pSrc)
> + static simdscalar loadSOA(const uint8_t *pSrc)
> {
> #if KNOB_SIMD_WIDTH == 8
> __m256 result = _mm256_setzero_ps();
> @@ -74,7 +74,7 @@ struct PackTraits<8, false>
> #endif
> }
>
> - static void storeSOA(BYTE *pDst, simdscalar src)
> + static void storeSOA(uint8_t *pDst, simdscalar src)
> {
> // store simd bytes
> #if KNOB_SIMD_WIDTH == 8
> @@ -125,7 +125,7 @@ struct PackTraits<8, true>
> {
> static const uint32_t MyNumBits = 8;
>
> - static simdscalar loadSOA(const BYTE *pSrc)
> + static simdscalar loadSOA(const uint8_t *pSrc)
> {
> #if KNOB_SIMD_WIDTH == 8
> __m256 result = _mm256_setzero_ps();
> @@ -136,7 +136,7 @@ struct PackTraits<8, true>
> #endif
> }
>
> - static void storeSOA(BYTE *pDst, simdscalar src)
> + static void storeSOA(uint8_t *pDst, simdscalar src)
> {
> // store simd bytes
> #if KNOB_SIMD_WIDTH == 8
> @@ -188,7 +188,7 @@ struct PackTraits<16, false>
> {
> static const uint32_t MyNumBits = 16;
>
> - static simdscalar loadSOA(const BYTE *pSrc)
> + static simdscalar loadSOA(const uint8_t *pSrc)
> {
> #if KNOB_SIMD_WIDTH == 8
> __m256 result = _mm256_setzero_ps();
> @@ -199,7 +199,7 @@ struct PackTraits<16, false>
> #endif
> }
>
> - static void storeSOA(BYTE *pDst, simdscalar src)
> + static void storeSOA(uint8_t *pDst, simdscalar src)
> {
> #if KNOB_SIMD_WIDTH == 8
> // store 16B (2B * 8)
> @@ -249,7 +249,7 @@ struct PackTraits<16, true>
> {
> static const uint32_t MyNumBits = 16;
>
> - static simdscalar loadSOA(const BYTE *pSrc)
> + static simdscalar loadSOA(const uint8_t *pSrc)
> {
> #if KNOB_SIMD_WIDTH == 8
> __m256 result = _mm256_setzero_ps();
> @@ -260,7 +260,7 @@ struct PackTraits<16, true>
> #endif
> }
>
> - static void storeSOA(BYTE *pDst, simdscalar src)
> + static void storeSOA(uint8_t *pDst, simdscalar src)
> {
> #if KNOB_SIMD_WIDTH == 8
> // store 16B (2B * 8)
> @@ -311,8 +311,8 @@ struct PackTraits<32, false>
> {
> static const uint32_t MyNumBits = 32;
>
> - static simdscalar loadSOA(const BYTE *pSrc) { return _simd_load_ps((const float*)pSrc); }
> - static void storeSOA(BYTE *pDst, simdscalar src) { _simd_store_ps((float*)pDst, src); }
> + static simdscalar loadSOA(const uint8_t *pSrc) { return _simd_load_ps((const float*)pSrc); }
> + static void storeSOA(uint8_t *pDst, simdscalar src) { _simd_store_ps((float*)pDst, src); }
> static simdscalar unpack(simdscalar &in) { return in; }
> static simdscalar pack(simdscalar &in) { return in; }
> };
> @@ -984,7 +984,7 @@ struct ComponentTraits
> return TypeTraits<X, NumBitsX>::fromFloat();
> }
>
> - INLINE static simdscalar loadSOA(uint32_t comp, const BYTE* pSrc)
> + INLINE static simdscalar loadSOA(uint32_t comp, const uint8_t* pSrc)
> {
> switch (comp)
> {
> @@ -1001,7 +1001,7 @@ struct ComponentTraits
> return TypeTraits<X, NumBitsX>::loadSOA(pSrc);
> }
>
> - INLINE static void storeSOA(uint32_t comp, BYTE *pDst, simdscalar src)
> + INLINE static void storeSOA(uint32_t comp, uint8_t *pDst, simdscalar src)
> {
> switch (comp)
> {
> diff --git a/src/gallium/drivers/swr/rasterizer/core/frontend.cpp b/src/gallium/drivers/swr/rasterizer/core/frontend.cpp
> index f43a672..36721e0 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/frontend.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/frontend.cpp
> @@ -193,35 +193,71 @@ void ProcessStoreTiles(
> /// @param workerId - thread's worker id. Even thread has a unique id.
> /// @param pUserData - Pointer to user data passed back to callback.
> /// @todo This should go away when we switch this to use compute threading.
> -void ProcessInvalidateTiles(
> +void ProcessDiscardInvalidateTiles(
> SWR_CONTEXT *pContext,
> DRAW_CONTEXT *pDC,
> uint32_t workerId,
> void *pUserData)
> {
> RDTSC_START(FEProcessInvalidateTiles);
> - INVALIDATE_TILES_DESC *pInv = (INVALIDATE_TILES_DESC*)pUserData;
> + DISCARD_INVALIDATE_TILES_DESC *pInv = (DISCARD_INVALIDATE_TILES_DESC*)pUserData;
> MacroTileMgr *pTileMgr = pDC->pTileMgr;
>
> - const API_STATE& state = GetApiState(pDC);
> + SWR_RECT rect;
> +
> + if (pInv->rect.top | pInv->rect.bottom | pInv->rect.right | pInv->rect.left)
> + {
> + // Valid rect
> + rect = pInv->rect;
> + }
> + else
> + {
> + // Use viewport dimensions
> + const API_STATE& state = GetApiState(pDC);
> +
> + rect.left = (uint32_t)state.vp[0].x;
> + rect.right = (uint32_t)(state.vp[0].x + state.vp[0].width);
> + rect.top = (uint32_t)state.vp[0].y;
> + rect.bottom = (uint32_t)(state.vp[0].y + state.vp[0].height);
> + }
>
> // queue a store to each macro tile
> // compute macro tile bounds for the current render target
> uint32_t macroWidth = KNOB_MACROTILE_X_DIM;
> uint32_t macroHeight = KNOB_MACROTILE_Y_DIM;
>
> - uint32_t numMacroTilesX = ((uint32_t)state.vp[0].width + (uint32_t)state.vp[0].x + (macroWidth - 1)) / macroWidth;
> - uint32_t numMacroTilesY = ((uint32_t)state.vp[0].height + (uint32_t)state.vp[0].y + (macroHeight - 1)) / macroHeight;
> + // Setup region assuming full tiles
> + uint32_t macroTileStartX = (rect.left + (macroWidth - 1)) / macroWidth;
> + uint32_t macroTileStartY = (rect.top + (macroHeight - 1)) / macroHeight;
> +
> + uint32_t macroTileEndX = rect.right / macroWidth;
> + uint32_t macroTileEndY = rect.bottom / macroHeight;
> +
> + if (pInv->fullTilesOnly == false)
> + {
> + // include partial tiles
> + macroTileStartX = rect.left / macroWidth;
> + macroTileStartY = rect.top / macroHeight;
> +
> + macroTileEndX = (rect.right + macroWidth - 1) / macroWidth;
> + macroTileEndY = (rect.bottom + macroHeight - 1) / macroHeight;
> + }
> +
> + SWR_ASSERT(macroTileEndX <= KNOB_NUM_HOT_TILES_X);
> + SWR_ASSERT(macroTileEndY <= KNOB_NUM_HOT_TILES_Y);
> +
> + macroTileEndX = std::min<uint32_t>(macroTileEndX, KNOB_NUM_HOT_TILES_X);
> + macroTileEndY = std::min<uint32_t>(macroTileEndY, KNOB_NUM_HOT_TILES_Y);
>
> // load tiles
> BE_WORK work;
> - work.type = INVALIDATETILES;
> - work.pfnWork = ProcessInvalidateTilesBE;
> - work.desc.invalidateTiles = *pInv;
> + work.type = DISCARDINVALIDATETILES;
> + work.pfnWork = ProcessDiscardInvalidateTilesBE;
> + work.desc.discardInvalidateTiles = *pInv;
>
> - for (uint32_t x = 0; x < numMacroTilesX; ++x)
> + for (uint32_t x = macroTileStartX; x < macroTileEndX; ++x)
> {
> - for (uint32_t y = 0; y < numMacroTilesY; ++y)
> + for (uint32_t y = macroTileStartY; y < macroTileEndY; ++y)
> {
> pTileMgr->enqueue(x, y, &work);
> }
> @@ -630,6 +666,8 @@ void ProcessStreamIdBuffer(uint32_t stream, uint8_t* pStreamIdBase, uint32_t num
> }
> }
>
> +THREAD SWR_GS_CONTEXT tlsGsContext;
> +
> //////////////////////////////////////////////////////////////////////////
> /// @brief Implements GS stage.
> /// @param pDC - pointer to draw context.
> @@ -651,7 +689,6 @@ static void GeometryShaderStage(
> {
> RDTSC_START(FEGeometryShader);
>
> - SWR_GS_CONTEXT gsContext;
> SWR_CONTEXT* pContext = pDC->pContext;
>
> const API_STATE& state = GetApiState(pDC);
> @@ -660,9 +697,9 @@ static void GeometryShaderStage(
> SWR_ASSERT(pGsOut != nullptr, "GS output buffer should be initialized");
> SWR_ASSERT(pCutBuffer != nullptr, "GS output cut buffer should be initialized");
>
> - gsContext.pStream = (uint8_t*)pGsOut;
> - gsContext.pCutOrStreamIdBuffer = (uint8_t*)pCutBuffer;
> - gsContext.PrimitiveID = primID;
> + tlsGsContext.pStream = (uint8_t*)pGsOut;
> + tlsGsContext.pCutOrStreamIdBuffer = (uint8_t*)pCutBuffer;
> + tlsGsContext.PrimitiveID = primID;
>
> uint32_t numVertsPerPrim = NumVertsPerPrim(pa.binTopology, true);
> simdvector attrib[MAX_ATTRIBUTES];
> @@ -675,7 +712,7 @@ static void GeometryShaderStage(
>
> for (uint32_t i = 0; i < numVertsPerPrim; ++i)
> {
> - gsContext.vert[i].attrib[attribSlot] = attrib[i];
> + tlsGsContext.vert[i].attrib[attribSlot] = attrib[i];
> }
> }
>
> @@ -683,7 +720,7 @@ static void GeometryShaderStage(
> pa.Assemble(VERTEX_POSITION_SLOT, attrib);
> for (uint32_t i = 0; i < numVertsPerPrim; ++i)
> {
> - gsContext.vert[i].attrib[VERTEX_POSITION_SLOT] = attrib[i];
> + tlsGsContext.vert[i].attrib[VERTEX_POSITION_SLOT] = attrib[i];
> }
>
> const uint32_t vertexStride = sizeof(simdvertex);
> @@ -710,14 +747,14 @@ static void GeometryShaderStage(
>
> for (uint32_t instance = 0; instance < pState->instanceCount; ++instance)
> {
> - gsContext.InstanceID = instance;
> - gsContext.mask = GenerateMask(numInputPrims);
> + tlsGsContext.InstanceID = instance;
> + tlsGsContext.mask = GenerateMask(numInputPrims);
>
> // execute the geometry shader
> - state.pfnGsFunc(GetPrivateState(pDC), &gsContext);
> + state.pfnGsFunc(GetPrivateState(pDC), &tlsGsContext);
>
> - gsContext.pStream += instanceStride;
> - gsContext.pCutOrStreamIdBuffer += cutInstanceStride;
> + tlsGsContext.pStream += instanceStride;
> + tlsGsContext.pCutOrStreamIdBuffer += cutInstanceStride;
> }
>
> // set up new binner and state for the GS output topology
> @@ -736,7 +773,7 @@ static void GeometryShaderStage(
> // foreach input prim:
> // - setup a new PA based on the emitted verts for that prim
> // - loop over the new verts, calling PA to assemble each prim
> - uint32_t* pVertexCount = (uint32_t*)&gsContext.vertexCount;
> + uint32_t* pVertexCount = (uint32_t*)&tlsGsContext.vertexCount;
> uint32_t* pPrimitiveId = (uint32_t*)&primID;
>
> uint32_t totalPrimsGenerated = 0;
> @@ -844,7 +881,7 @@ static void GeometryShaderStage(
> static INLINE void AllocateGsBuffers(DRAW_CONTEXT* pDC, const API_STATE& state, void** ppGsOut, void** ppCutBuffer,
> void **ppStreamCutBuffer)
> {
> - Arena* pArena = pDC->pArena;
> + auto pArena = pDC->pArena;
> SWR_ASSERT(pArena != nullptr);
> SWR_ASSERT(state.gsState.gsEnable);
> // allocate arena space to hold GS output verts
> @@ -1186,7 +1223,7 @@ void ProcessDraw(
>
> // if the entire index buffer isn't being consumed, set the last index
> // so that fetches < a SIMD wide will be masked off
> - fetchInfo.pLastIndex = (const int32_t*)(((BYTE*)state.indexBuffer.pIndices) + state.indexBuffer.size);
> + fetchInfo.pLastIndex = (const int32_t*)(((uint8_t*)state.indexBuffer.pIndices) + state.indexBuffer.size);
> if (pLastRequestedIndex < fetchInfo.pLastIndex)
> {
> fetchInfo.pLastIndex = pLastRequestedIndex;
> @@ -1362,7 +1399,7 @@ void ProcessDraw(
> i += KNOB_SIMD_WIDTH;
> if (IsIndexedT)
> {
> - fetchInfo.pIndices = (int*)((BYTE*)fetchInfo.pIndices + KNOB_SIMD_WIDTH * indexSize);
> + fetchInfo.pIndices = (int*)((uint8_t*)fetchInfo.pIndices + KNOB_SIMD_WIDTH * indexSize);
> }
> else
> {
> @@ -1776,7 +1813,7 @@ void BinTriangles(
> work.pfnWork = gRasterizerTable[rastState.scissorEnable][SWR_MULTISAMPLE_1X];
> }
>
> - Arena* pArena = pDC->pArena;
> + auto pArena = pDC->pArena;
> SWR_ASSERT(pArena != nullptr);
>
> // store active attribs
> @@ -1948,7 +1985,7 @@ void BinPoints(
>
> work.pfnWork = RasterizeSimplePoint;
>
> - Arena* pArena = pDC->pArena;
> + auto pArena = pDC->pArena;
> SWR_ASSERT(pArena != nullptr);
>
> // store attributes
> @@ -2082,7 +2119,7 @@ void BinPoints(
>
> work.pfnWork = RasterizeTriPoint;
>
> - Arena* pArena = pDC->pArena;
> + auto pArena = pDC->pArena;
> SWR_ASSERT(pArena != nullptr);
>
> // store active attribs
> @@ -2299,7 +2336,7 @@ void BinLines(
>
> work.pfnWork = RasterizeLine;
>
> - Arena* pArena = pDC->pArena;
> + auto pArena = pDC->pArena;
> SWR_ASSERT(pArena != nullptr);
>
> // store active attribs
> diff --git a/src/gallium/drivers/swr/rasterizer/core/frontend.h b/src/gallium/drivers/swr/rasterizer/core/frontend.h
> index acb935f..f92f88c 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/frontend.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/frontend.h
> @@ -146,14 +146,13 @@ float calcDeterminantInt(const __m128i vA, const __m128i vB)
> //vMul = [A1*B2 - B1*A2]
> vMul = _mm_sub_epi64(vMul, vMul2);
>
> - // According to emmintrin.h __mm_store1_pd(), address must be 16-byte aligned
> - OSALIGN(int64_t, 16) result;
> - _mm_store1_pd((double*)&result, _mm_castsi128_pd(vMul));
> + int64_t result;
> + _mm_store_sd((double*)&result, _mm_castsi128_pd(vMul));
>
> - double fResult = (double)result;
> - fResult = fResult * (1.0 / FIXED_POINT16_SCALE);
> + double dResult = (double)result;
> + dResult = dResult * (1.0 / FIXED_POINT16_SCALE);
>
> - return (float)fResult;
> + return (float)dResult;
> }
>
> INLINE
> @@ -316,7 +315,7 @@ void ProcessDraw(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC, uint32_t workerId, vo
>
> void ProcessClear(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC, uint32_t workerId, void *pUserData);
> void ProcessStoreTiles(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC, uint32_t workerId, void *pUserData);
> -void ProcessInvalidateTiles(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC, uint32_t workerId, void *pUserData);
> +void ProcessDiscardInvalidateTiles(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC, uint32_t workerId, void *pUserData);
> void ProcessSync(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC, uint32_t workerId, void *pUserData);
> void ProcessQueryStats(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC, uint32_t workerId, void *pUserData);
>
> diff --git a/src/gallium/drivers/swr/rasterizer/core/knobs_init.h b/src/gallium/drivers/swr/rasterizer/core/knobs_init.h
> index 3f19555..adf738c 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/knobs_init.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/knobs_init.h
> @@ -80,6 +80,11 @@ static inline void ConvertEnvToKnob(const char* pOverride, float& knobValue)
> }
> }
>
> +static inline void ConvertEnvToKnob(const char* pOverride, std::string& knobValue)
> +{
> + knobValue = pOverride;
> +}
> +
> template <typename T>
> static inline void InitKnob(T& knob)
> {
> diff --git a/src/gallium/drivers/swr/rasterizer/core/pa.h b/src/gallium/drivers/swr/rasterizer/core/pa.h
> index 2028d9f..f8f1a33 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/pa.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/pa.h
> @@ -34,12 +34,12 @@
>
> struct PA_STATE
> {
> - DRAW_CONTEXT *pDC; // draw context
> - uint8_t* pStreamBase; // vertex stream
> - uint32_t streamSizeInVerts; // total size of the input stream in verts
> + DRAW_CONTEXT *pDC{ nullptr }; // draw context
> + uint8_t* pStreamBase{ nullptr }; // vertex stream
> + uint32_t streamSizeInVerts{ 0 }; // total size of the input stream in verts
>
> // The topology the binner will use. In some cases the FE changes the topology from the api state.
> - PRIMITIVE_TOPOLOGY binTopology;
> + PRIMITIVE_TOPOLOGY binTopology{ TOP_UNKNOWN };
>
> PA_STATE() {}
> PA_STATE(DRAW_CONTEXT *in_pDC, uint8_t* in_pStreamBase, uint32_t in_streamSizeInVerts) :
> @@ -76,37 +76,37 @@ struct PA_STATE
> // cuts
> struct PA_STATE_OPT : public PA_STATE
> {
> - simdvertex leadingVertex; // For tri-fan
> - uint32_t numPrims; // Total number of primitives for draw.
> - uint32_t numPrimsComplete; // Total number of complete primitives.
> + simdvertex leadingVertex; // For tri-fan
> + uint32_t numPrims{ 0 }; // Total number of primitives for draw.
> + uint32_t numPrimsComplete{ 0 }; // Total number of complete primitives.
>
> - uint32_t numSimdPrims; // Number of prims in current simd.
> + uint32_t numSimdPrims{ 0 }; // Number of prims in current simd.
>
> - uint32_t cur; // index to current VS output.
> - uint32_t prev; // index to prev VS output. Not really needed in the state.
> - uint32_t first; // index to first VS output. Used for trifan.
> + uint32_t cur{ 0 }; // index to current VS output.
> + uint32_t prev{ 0 }; // index to prev VS output. Not really needed in the state.
> + uint32_t first{ 0 }; // index to first VS output. Used for trifan.
>
> - uint32_t counter; // state counter
> - bool reset; // reset state
> + uint32_t counter{ 0 }; // state counter
> + bool reset{ false }; // reset state
>
> - uint32_t primIDIncr; // how much to increment for each vector (typically vector / {1, 2})
> + uint32_t primIDIncr{ 0 }; // how much to increment for each vector (typically vector / {1, 2})
> simdscalari primID;
>
> typedef bool(*PFN_PA_FUNC)(PA_STATE_OPT& state, uint32_t slot, simdvector verts[]);
> typedef void(*PFN_PA_SINGLE_FUNC)(PA_STATE_OPT& pa, uint32_t slot, uint32_t primIndex, __m128 verts[]);
>
> - PFN_PA_FUNC pfnPaFunc; // PA state machine function for assembling 4 triangles.
> - PFN_PA_SINGLE_FUNC pfnPaSingleFunc; // PA state machine function for assembling single triangle.
> - PFN_PA_FUNC pfnPaFuncReset; // initial state to set on reset
> + PFN_PA_FUNC pfnPaFunc{ nullptr }; // PA state machine function for assembling 4 triangles.
> + PFN_PA_SINGLE_FUNC pfnPaSingleFunc{ nullptr }; // PA state machine function for assembling single triangle.
> + PFN_PA_FUNC pfnPaFuncReset{ nullptr }; // initial state to set on reset
>
> // state used to advance the PA when Next is called
> - PFN_PA_FUNC pfnPaNextFunc;
> - uint32_t nextNumSimdPrims;
> - uint32_t nextNumPrimsIncrement;
> - bool nextReset;
> - bool isStreaming;
> + PFN_PA_FUNC pfnPaNextFunc{ nullptr };
> + uint32_t nextNumSimdPrims{ 0 };
> + uint32_t nextNumPrimsIncrement{ 0 };
> + bool nextReset{ false };
> + bool isStreaming{ false };
>
> - simdmask tmpIndices; // temporary index store for unused virtual function
> + simdmask tmpIndices{ 0 }; // temporary index store for unused virtual function
>
> PA_STATE_OPT() {}
> PA_STATE_OPT(DRAW_CONTEXT* pDC, uint32_t numPrims, uint8_t* pStream, uint32_t streamSizeInVerts,
> @@ -333,33 +333,33 @@ INLINE __m128 swizzleLaneN(const simdvector &a, int lane)
> // Cut-aware primitive assembler.
> struct PA_STATE_CUT : public PA_STATE
> {
> - simdmask* pCutIndices; // cut indices buffer, 1 bit per vertex
> - uint32_t numVerts; // number of vertices available in buffer store
> - uint32_t numAttribs; // number of attributes
> - int32_t numRemainingVerts; // number of verts remaining to be assembled
> - uint32_t numVertsToAssemble; // total number of verts to assemble for the draw
> + simdmask* pCutIndices{ nullptr }; // cut indices buffer, 1 bit per vertex
> + uint32_t numVerts{ 0 }; // number of vertices available in buffer store
> + uint32_t numAttribs{ 0 }; // number of attributes
> + int32_t numRemainingVerts{ 0 }; // number of verts remaining to be assembled
> + uint32_t numVertsToAssemble{ 0 }; // total number of verts to assemble for the draw
> OSALIGNSIMD(uint32_t) indices[MAX_NUM_VERTS_PER_PRIM][KNOB_SIMD_WIDTH]; // current index buffer for gather
> simdscalari vOffsets[MAX_NUM_VERTS_PER_PRIM]; // byte offsets for currently assembling simd
> - uint32_t numPrimsAssembled; // number of primitives that are fully assembled
> - uint32_t headVertex; // current unused vertex slot in vertex buffer store
> - uint32_t tailVertex; // beginning vertex currently assembling
> - uint32_t curVertex; // current unprocessed vertex
> - uint32_t startPrimId; // starting prim id
> - simdscalari vPrimId; // vector of prim ID
> - bool needOffsets; // need to compute gather offsets for current SIMD
> - uint32_t vertsPerPrim;
> - simdvertex tmpVertex; // temporary simdvertex for unimplemented API
> - bool processCutVerts; // vertex indices with cuts should be processed as normal, otherwise they
> - // are ignored. Fetch shader sends invalid verts on cuts that should be ignored
> - // while the GS sends valid verts for every index
> + uint32_t numPrimsAssembled{ 0 }; // number of primitives that are fully assembled
> + uint32_t headVertex{ 0 }; // current unused vertex slot in vertex buffer store
> + uint32_t tailVertex{ 0 }; // beginning vertex currently assembling
> + uint32_t curVertex{ 0 }; // current unprocessed vertex
> + uint32_t startPrimId{ 0 }; // starting prim id
> + simdscalari vPrimId; // vector of prim ID
> + bool needOffsets{ false }; // need to compute gather offsets for current SIMD
> + uint32_t vertsPerPrim{ 0 };
> + simdvertex tmpVertex; // temporary simdvertex for unimplemented API
> + bool processCutVerts{ false }; // vertex indices with cuts should be processed as normal, otherwise they
> + // are ignored. Fetch shader sends invalid verts on cuts that should be ignored
> + // while the GS sends valid verts for every index
> // Topology state tracking
> uint32_t vert[MAX_NUM_VERTS_PER_PRIM];
> - uint32_t curIndex;
> - bool reverseWinding; // indicates reverse winding for strips
> - int32_t adjExtraVert; // extra vert uses for tristrip w/ adj
> + uint32_t curIndex{ 0 };
> + bool reverseWinding{ false }; // indicates reverse winding for strips
> + int32_t adjExtraVert{ 0 }; // extra vert uses for tristrip w/ adj
>
> typedef void(PA_STATE_CUT::* PFN_PA_FUNC)(uint32_t vert, bool finish);
> - PFN_PA_FUNC pfnPa; // per-topology function that processes a single vert
> + PFN_PA_FUNC pfnPa{ nullptr }; // per-topology function that processes a single vert
>
> PA_STATE_CUT() {}
> PA_STATE_CUT(DRAW_CONTEXT* pDC, uint8_t* in_pStream, uint32_t in_streamSizeInVerts, simdmask* in_pIndices, uint32_t in_numVerts,
> @@ -1199,9 +1199,9 @@ struct PA_FACTORY
>
> PA_STATE_OPT paOpt;
> PA_STATE_CUT paCut;
> - bool cutPA;
> + bool cutPA{ false };
>
> - PRIMITIVE_TOPOLOGY topo;
> + PRIMITIVE_TOPOLOGY topo{ TOP_UNKNOWN };
>
> simdvertex vertexStore[MAX_NUM_VERTS_PER_PRIM];
> simdmask indexStore[MAX_NUM_VERTS_PER_PRIM];
> diff --git a/src/gallium/drivers/swr/rasterizer/core/rasterizer.cpp b/src/gallium/drivers/swr/rasterizer/core/rasterizer.cpp
> index 587e336..52fb7c8 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/rasterizer.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/rasterizer.cpp
> @@ -690,9 +690,10 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
>
> // Evaluate edge equations at sample positions of each of the 4 corners of a raster tile
> // used to for testing if entire raster tile is inside a triangle
> - vEdgeFix16[0] = _mm256_add_pd(vEdgeFix16[0], rastEdges[0].vRasterTileOffsets);
> - vEdgeFix16[1] = _mm256_add_pd(vEdgeFix16[1], rastEdges[1].vRasterTileOffsets);
> - vEdgeFix16[2] = _mm256_add_pd(vEdgeFix16[2], rastEdges[2].vRasterTileOffsets);
> + for (uint32_t e = 0; e < numEdges; ++e)
> + {
> + vEdgeFix16[e] = _mm256_add_pd(vEdgeFix16[e], rastEdges[e].vRasterTileOffsets);
> + }
>
> // at this point vEdge has been evaluated at the UL pixel corners of raster tile bbox
> // step sample positions to the raster tile bbox of multisample points
> @@ -700,7 +701,7 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
> // | |
> // | |
> // min(xSamples),max(ySamples) ------ max(xSamples),max(ySamples)
> - __m256d vEdge0TileBbox, vEdge1TileBbox, vEdge2TileBbox;
> + __m256d vEdgeTileBbox[3];
> if (sampleCount > SWR_MULTISAMPLE_1X)
> {
> __m128i vTileSampleBBoxXh = MultisampleTraits<sampleCount>::TileSampleOffsetsX();
> @@ -711,17 +712,12 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
>
> // step edge equation tests from Tile
> // used to for testing if entire raster tile is inside a triangle
> - __m256d vResultAxFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[0].a), vTileSampleBBoxXFix8);
> - __m256d vResultByFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[0].b), vTileSampleBBoxYFix8);
> - vEdge0TileBbox = _mm256_add_pd(vResultAxFix16, vResultByFix16);
> -
> - vResultAxFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[1].a), vTileSampleBBoxXFix8);
> - vResultByFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[1].b), vTileSampleBBoxYFix8);
> - vEdge1TileBbox = _mm256_add_pd(vResultAxFix16, vResultByFix16);
> -
> - vResultAxFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[2].a), vTileSampleBBoxXFix8);
> - vResultByFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[2].b), vTileSampleBBoxYFix8);
> - vEdge2TileBbox = _mm256_add_pd(vResultAxFix16, vResultByFix16);
> + for (uint32_t e = 0; e < 3; ++e)
> + {
> + __m256d vResultAxFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[e].a), vTileSampleBBoxXFix8);
> + __m256d vResultByFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[e].b), vTileSampleBBoxYFix8);
> + vEdgeTileBbox[e] = _mm256_add_pd(vResultAxFix16, vResultByFix16);
> + }
> }
>
> RDTSC_STOP(BEStepSetup, 0, pDC->drawId);
> @@ -756,7 +752,7 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
>
> for (uint32_t tileX = tX; tileX <= maxX; ++tileX)
> {
> - uint64_t anyCoveredSamples = 0;
> + triDesc.anyCoveredSamples = 0;
>
> // is the corner of the edge outside of the raster tile? (vEdge < 0)
> int mask0, mask1, mask2;
> @@ -770,9 +766,9 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
> {
> __m256d vSampleBboxTest0, vSampleBboxTest1, vSampleBboxTest2;
> // evaluate edge equations at the tile multisample bounding box
> - vSampleBboxTest0 = _mm256_add_pd(vEdge0TileBbox, vEdgeFix16[0]);
> - vSampleBboxTest1 = _mm256_add_pd(vEdge1TileBbox, vEdgeFix16[1]);
> - vSampleBboxTest2 = _mm256_add_pd(vEdge2TileBbox, vEdgeFix16[2]);
> + vSampleBboxTest0 = _mm256_add_pd(vEdgeTileBbox[0], vEdgeFix16[0]);
> + vSampleBboxTest1 = _mm256_add_pd(vEdgeTileBbox[1], vEdgeFix16[1]);
> + vSampleBboxTest2 = _mm256_add_pd(vEdgeTileBbox[2], vEdgeFix16[2]);
> mask0 = _mm256_movemask_pd(vSampleBboxTest0);
> mask1 = _mm256_movemask_pd(vSampleBboxTest1);
> mask2 = _mm256_movemask_pd(vSampleBboxTest2);
> @@ -789,20 +785,21 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
> triDesc.coverageMask[sampleNum] = 0xffffffffffffffffULL;
> if ((mask0 & mask1 & mask2) == 0xf)
> {
> - anyCoveredSamples = triDesc.coverageMask[sampleNum];
> + triDesc.anyCoveredSamples = triDesc.coverageMask[sampleNum];
> // trivial accept, all 4 corners of all 3 edges are negative
> // i.e. raster tile completely inside triangle
> RDTSC_EVENT(BETrivialAccept, 1, 0);
> }
> else
> {
> - __m256d vEdge0AtSample, vEdge1AtSample, vEdge2AtSample;
> + __m256d vEdgeAtSample[numEdges];
> if(sampleCount == SWR_MULTISAMPLE_1X)
> {
> // should get optimized out for single sample case (global value numbering or copy propagation)
> - vEdge0AtSample = vEdgeFix16[0];
> - vEdge1AtSample = vEdgeFix16[1];
> - vEdge2AtSample = vEdgeFix16[2];
> + for (uint32_t e = 0; e < numEdges; ++e)
> + {
> + vEdgeAtSample[e] = vEdgeFix16[e];
> + }
> }
> else
> {
> @@ -815,31 +812,20 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
> // for each edge and broadcasts it before offsetting to individual pixel quads
>
> // step edge equation tests from UL tile corner to pixel sample position
> - __m256d vResultAxFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[0].a), vSampleOffsetX);
> - __m256d vResultByFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[0].b), vSampleOffsetY);
> - vEdge0AtSample = _mm256_add_pd(vResultAxFix16, vResultByFix16);
> - vEdge0AtSample = _mm256_add_pd(vEdgeFix16[0], vEdge0AtSample);
> -
> - vResultAxFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[1].a), vSampleOffsetX);
> - vResultByFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[1].b), vSampleOffsetY);
> - vEdge1AtSample = _mm256_add_pd(vResultAxFix16, vResultByFix16);
> - vEdge1AtSample = _mm256_add_pd(vEdgeFix16[1], vEdge1AtSample);
> -
> - vResultAxFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[2].a), vSampleOffsetX);
> - vResultByFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[2].b), vSampleOffsetY);
> - vEdge2AtSample = _mm256_add_pd(vResultAxFix16, vResultByFix16);
> - vEdge2AtSample = _mm256_add_pd(vEdgeFix16[2], vEdge2AtSample);
> + for (uint32_t e = 0; e < numEdges; ++e)
> + {
> + __m256d vResultAxFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[e].a), vSampleOffsetX);
> + __m256d vResultByFix16 = _mm256_mul_pd(_mm256_set1_pd(rastEdges[e].b), vSampleOffsetY);
> + vEdgeAtSample[e] = _mm256_add_pd(vResultAxFix16, vResultByFix16);
> + vEdgeAtSample[e] = _mm256_add_pd(vEdgeFix16[e], vEdgeAtSample[e]);
> + }
> }
>
> double startQuadEdges[numEdges];
> const __m256i vLane0Mask = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
> - _mm256_maskstore_pd(&startQuadEdges[0], vLane0Mask, vEdge0AtSample);
> - _mm256_maskstore_pd(&startQuadEdges[1], vLane0Mask, vEdge1AtSample);
> - _mm256_maskstore_pd(&startQuadEdges[2], vLane0Mask, vEdge2AtSample);
> -
> - for (uint32_t e = 3; e < numEdges; ++e)
> + for (uint32_t e = 0; e < numEdges; ++e)
> {
> - _mm256_maskstore_pd(&startQuadEdges[e], vLane0Mask, vEdgeFix16[e]);
> + _mm256_maskstore_pd(&startQuadEdges[e], vLane0Mask, vEdgeAtSample[e]);
> }
>
> // not trivial accept or reject, must rasterize full tile
> @@ -854,7 +840,7 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
> }
> RDTSC_STOP(BERasterizePartial, 0, 0);
>
> - anyCoveredSamples |= triDesc.coverageMask[sampleNum];
> + triDesc.anyCoveredSamples |= triDesc.coverageMask[sampleNum];
> }
> }
> else
> @@ -875,7 +861,7 @@ void RasterizeTriangle(DRAW_CONTEXT* pDC, uint32_t workerId, uint32_t macroTile,
> }
> else
> #endif
> - if(anyCoveredSamples)
> + if(triDesc.anyCoveredSamples)
> {
> RDTSC_START(BEPixelBackend);
> backendFuncs.pfnBackend(pDC, workerId, tileX << KNOB_TILE_X_DIM_SHIFT, tileY << KNOB_TILE_Y_DIM_SHIFT, triDesc, renderBuffers);
> diff --git a/src/gallium/drivers/swr/rasterizer/core/ringbuffer.h b/src/gallium/drivers/swr/rasterizer/core/ringbuffer.h
> new file mode 100644
> index 0000000..7ff109d
> --- /dev/null
> +++ b/src/gallium/drivers/swr/rasterizer/core/ringbuffer.h
> @@ -0,0 +1,102 @@
> +/****************************************************************************
> +* Copyright (C) 2016 Intel Corporation. All Rights Reserved.
> +*
> +* Permission is hereby granted, free of charge, to any person obtaining a
> +* copy of this software and associated documentation files (the "Software"),
> +* to deal in the Software without restriction, including without limitation
> +* the rights to use, copy, modify, merge, publish, distribute, sublicense,
> +* and/or sell copies of the Software, and to permit persons to whom the
> +* Software is furnished to do so, subject to the following conditions:
> +*
> +* The above copyright notice and this permission notice (including the next
> +* paragraph) shall be included in all copies or substantial portions of the
> +* Software.
> +*
> +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> +* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> +* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> +* THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> +* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> +* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> +* IN THE SOFTWARE.
> +*
> +* @file arena.h
> +*
> +* @brief RingBuffer
> +* The RingBuffer class manages all aspects of the ring buffer including
> +* the head/tail indices, etc.
> +*
> +******************************************************************************/
> +#pragma once
> +
> +template<typename T>
> +class RingBuffer
> +{
> +public:
> + RingBuffer()
> + : mpRingBuffer(nullptr), mNumEntries(0), mRingHead(0), mRingTail(0)
> + {
> + }
> +
> + ~RingBuffer()
> + {
> + Destroy();
> + }
> +
> + void Init(uint32_t numEntries)
> + {
> + SWR_ASSERT(numEntries > 0);
> + mNumEntries = numEntries;
> + mpRingBuffer = (T*)_aligned_malloc(sizeof(T)*numEntries, 64);
> + SWR_ASSERT(mpRingBuffer != nullptr);
> + memset(mpRingBuffer, 0, sizeof(T)*numEntries);
> + }
> +
> + void Destroy()
> + {
> + _aligned_free(mpRingBuffer);
> + mpRingBuffer = nullptr;
> + }
> +
> + T& operator[](const uint32_t index)
> + {
> + SWR_ASSERT(index < mNumEntries);
> + return mpRingBuffer[index];
> + }
> +
> + INLINE void Enqueue()
> + {
> + mRingHead++; // There's only one producer.
> + }
> +
> + INLINE void Dequeue()
> + {
> + InterlockedIncrement(&mRingTail); // There are multiple consumers.
> + }
> +
> + INLINE bool IsEmpty()
> + {
> + return (GetHead() == GetTail());
> + }
> +
> + INLINE bool IsFull()
> + {
> + ///@note We don't handle wrap case due to using 64-bit indices.
> + /// It would take 11 million years to wrap at 50,000 DCs per sec.
> + /// If we used 32-bit indices then its about 23 hours to wrap.
> + uint64_t numEnqueued = GetHead() - GetTail();
> + SWR_ASSERT(numEnqueued <= mNumEntries);
> +
> + return (numEnqueued == mNumEntries);
> + }
> +
> + INLINE volatile uint64_t GetTail() { return mRingTail; }
> + INLINE volatile uint64_t GetHead() { return mRingHead; }
> +
> +protected:
> + T* mpRingBuffer;
> + uint32_t mNumEntries;
> +
> + OSALIGNLINE(volatile uint64_t) mRingHead; // Consumer Counter
> + OSALIGNLINE(volatile uint64_t) mRingTail; // Producer Counter
> +};
> diff --git a/src/gallium/drivers/swr/rasterizer/core/state.h b/src/gallium/drivers/swr/rasterizer/core/state.h
> index 2758555..5752094 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/state.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/state.h
> @@ -307,6 +307,8 @@ struct PixelPositions
> simdscalar centroid;
> };
>
> +#define SWR_MAX_NUM_MULTISAMPLES 16
> +
> //////////////////////////////////////////////////////////////////////////
> /// SWR_PS_CONTEXT
> /// @brief Input to pixel shader.
> @@ -338,6 +340,7 @@ struct SWR_PS_CONTEXT
> uint32_t frontFace; // IN: front- 1, back- 0
> uint32_t primID; // IN: primitive ID
> uint32_t sampleIndex; // IN: sampleIndex
> +
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -748,7 +751,6 @@ struct SWR_RENDER_TARGET_BLEND_STATE
> };
> static_assert(sizeof(SWR_RENDER_TARGET_BLEND_STATE) == 1, "Invalid SWR_RENDER_TARGET_BLEND_STATE size");
>
> -#define SWR_MAX_NUM_MULTISAMPLES 16
> enum SWR_MULTISAMPLE_COUNT
> {
> SWR_MULTISAMPLE_1X = 0,
> @@ -786,7 +788,8 @@ typedef void(__cdecl *PFN_GS_FUNC)(HANDLE hPrivateData, SWR_GS_CONTEXT* pGsConte
> typedef void(__cdecl *PFN_CS_FUNC)(HANDLE hPrivateData, SWR_CS_CONTEXT* pCsContext);
> typedef void(__cdecl *PFN_SO_FUNC)(SWR_STREAMOUT_CONTEXT& soContext);
> typedef void(__cdecl *PFN_PIXEL_KERNEL)(HANDLE hPrivateData, SWR_PS_CONTEXT *pContext);
> -typedef void(__cdecl *PFN_BLEND_JIT_FUNC)(const SWR_BLEND_STATE*, simdvector&, simdvector&, uint32_t, BYTE*, simdvector&, simdscalari*, simdscalari*);
> +typedef void(__cdecl *PFN_CPIXEL_KERNEL)(HANDLE hPrivateData, SWR_PS_CONTEXT *pContext);
> +typedef void(__cdecl *PFN_BLEND_JIT_FUNC)(const SWR_BLEND_STATE*, simdvector&, simdvector&, uint32_t, uint8_t*, simdvector&, simdscalari*, simdscalari*);
>
> //////////////////////////////////////////////////////////////////////////
> /// FRONTEND_STATE
> @@ -941,6 +944,7 @@ struct SWR_BACKEND_STATE
> uint8_t numComponents[KNOB_NUM_ATTRIBUTES];
> };
>
> +
> union SWR_DEPTH_STENCIL_STATE
> {
> struct
> @@ -980,7 +984,6 @@ enum SWR_SHADING_RATE
> {
> SWR_SHADING_RATE_PIXEL,
> SWR_SHADING_RATE_SAMPLE,
> - SWR_SHADING_RATE_COARSE,
> SWR_SHADING_RATE_MAX,
> };
>
> @@ -1024,4 +1027,5 @@ struct SWR_PS_STATE
> uint32_t barycentricsMask : 3; // which type(s) of barycentric coords does the PS interpolate attributes with
> uint32_t usesUAV : 1; // pixel shader accesses UAV
> uint32_t forceEarlyZ : 1; // force execution of early depth/stencil test
> +
> };
> diff --git a/src/gallium/drivers/swr/rasterizer/core/threads.cpp b/src/gallium/drivers/swr/rasterizer/core/threads.cpp
> index 24c5588..ce8646f 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/threads.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/threads.cpp
> @@ -24,7 +24,6 @@
> #include <stdio.h>
> #include <thread>
> #include <algorithm>
> -#include <unordered_set>
> #include <float.h>
> #include <vector>
> #include <utility>
> @@ -44,7 +43,6 @@
> #include "rasterizer.h"
> #include "rdtsc_core.h"
> #include "tilemgr.h"
> -#include "core/multisample.h"
>
>
>
> @@ -265,9 +263,7 @@ void bindThread(uint32_t threadId, uint32_t procGroupId = 0, bool bindProcGroup=
> INLINE
> uint64_t GetEnqueuedDraw(SWR_CONTEXT *pContext)
> {
> - //uint64_t result = _InterlockedCompareExchange64((volatile __int64*)&pContext->DrawEnqueued, 0, 0);
> - //return result;
> - return pContext->DrawEnqueued;
> + return pContext->dcRing.GetHead();
> }
>
> INLINE
> @@ -283,169 +279,21 @@ bool CheckDependency(SWR_CONTEXT *pContext, DRAW_CONTEXT *pDC, uint64_t lastReti
> return (pDC->dependency > lastRetiredDraw);
> }
>
> -void ClearColorHotTile(const HOTTILE* pHotTile) // clear a macro tile from float4 clear data.
> -{
> - // Load clear color into SIMD register...
> - float *pClearData = (float*)(pHotTile->clearData);
> - simdscalar valR = _simd_broadcast_ss(&pClearData[0]);
> - simdscalar valG = _simd_broadcast_ss(&pClearData[1]);
> - simdscalar valB = _simd_broadcast_ss(&pClearData[2]);
> - simdscalar valA = _simd_broadcast_ss(&pClearData[3]);
> -
> - float *pfBuf = (float*)pHotTile->pBuffer;
> - uint32_t numSamples = pHotTile->numSamples;
>
> - for (uint32_t row = 0; row < KNOB_MACROTILE_Y_DIM; row += KNOB_TILE_Y_DIM)
> - {
> - for (uint32_t col = 0; col < KNOB_MACROTILE_X_DIM; col += KNOB_TILE_X_DIM)
> - {
> - for (uint32_t si = 0; si < (KNOB_TILE_X_DIM * KNOB_TILE_Y_DIM * numSamples); si += SIMD_TILE_X_DIM * SIMD_TILE_Y_DIM) //SIMD_TILE_X_DIM * SIMD_TILE_Y_DIM); si++)
> - {
> - _simd_store_ps(pfBuf, valR);
> - pfBuf += KNOB_SIMD_WIDTH;
> - _simd_store_ps(pfBuf, valG);
> - pfBuf += KNOB_SIMD_WIDTH;
> - _simd_store_ps(pfBuf, valB);
> - pfBuf += KNOB_SIMD_WIDTH;
> - _simd_store_ps(pfBuf, valA);
> - pfBuf += KNOB_SIMD_WIDTH;
> - }
> - }
> - }
> -}
>
> -void ClearDepthHotTile(const HOTTILE* pHotTile) // clear a macro tile from float4 clear data.
> +INLINE void CompleteDrawContext(SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC)
> {
> - // Load clear color into SIMD register...
> - float *pClearData = (float*)(pHotTile->clearData);
> - simdscalar valZ = _simd_broadcast_ss(&pClearData[0]);
> + int64_t result = InterlockedDecrement64(&pDC->threadsDone);
>
> - float *pfBuf = (float*)pHotTile->pBuffer;
> - uint32_t numSamples = pHotTile->numSamples;
> -
> - for (uint32_t row = 0; row < KNOB_MACROTILE_Y_DIM; row += KNOB_TILE_Y_DIM)
> + if (result == 0)
> {
> - for (uint32_t col = 0; col < KNOB_MACROTILE_X_DIM; col += KNOB_TILE_X_DIM)
> - {
> - for (uint32_t si = 0; si < (KNOB_TILE_X_DIM * KNOB_TILE_Y_DIM * numSamples); si += SIMD_TILE_X_DIM * SIMD_TILE_Y_DIM)
> - {
> - _simd_store_ps(pfBuf, valZ);
> - pfBuf += KNOB_SIMD_WIDTH;
> - }
> - }
> - }
> -}
> -
> -void ClearStencilHotTile(const HOTTILE* pHotTile)
> -{
> - // convert from F32 to U8.
> - uint8_t clearVal = (uint8_t)(pHotTile->clearData[0]);
> - //broadcast 32x into __m256i...
> - simdscalari valS = _simd_set1_epi8(clearVal);
> -
> - simdscalari* pBuf = (simdscalari*)pHotTile->pBuffer;
> - uint32_t numSamples = pHotTile->numSamples;
> -
> - for (uint32_t row = 0; row < KNOB_MACROTILE_Y_DIM; row += KNOB_TILE_Y_DIM)
> - {
> - for (uint32_t col = 0; col < KNOB_MACROTILE_X_DIM; col += KNOB_TILE_X_DIM)
> - {
> - // We're putting 4 pixels in each of the 32-bit slots, so increment 4 times as quickly.
> - for (uint32_t si = 0; si < (KNOB_TILE_X_DIM * KNOB_TILE_Y_DIM * numSamples); si += SIMD_TILE_X_DIM * SIMD_TILE_Y_DIM * 4)
> - {
> - _simd_store_si(pBuf, valS);
> - pBuf += 1;
> - }
> - }
> - }
> -}
> -
> -// for draw calls, we initialize the active hot tiles and perform deferred
> -// load on them if tile is in invalid state. we do this in the outer thread loop instead of inside
> -// the draw routine itself mainly for performance, to avoid unnecessary setup
> -// every triangle
> -// @todo support deferred clear
> -INLINE
> -void InitializeHotTiles(SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC, uint32_t macroID, const TRIANGLE_WORK_DESC* pWork)
> -{
> - const API_STATE& state = GetApiState(pDC);
> - HotTileMgr *pHotTileMgr = pContext->pHotTileMgr;
> -
> - uint32_t x, y;
> - MacroTileMgr::getTileIndices(macroID, x, y);
> - x *= KNOB_MACROTILE_X_DIM;
> - y *= KNOB_MACROTILE_Y_DIM;
> -
> - uint32_t numSamples = GetNumSamples(state.rastState.sampleCount);
> -
> - // check RT if enabled
> - unsigned long rtSlot = 0;
> - uint32_t colorHottileEnableMask = state.colorHottileEnable;
> - while(_BitScanForward(&rtSlot, colorHottileEnableMask))
> - {
> - HOTTILE* pHotTile = pHotTileMgr->GetHotTile(pContext, pDC, macroID, (SWR_RENDERTARGET_ATTACHMENT)(SWR_ATTACHMENT_COLOR0 + rtSlot), true, numSamples);
> -
> - if (pHotTile->state == HOTTILE_INVALID)
> - {
> - RDTSC_START(BELoadTiles);
> - // invalid hottile before draw requires a load from surface before we can draw to it
> - pContext->pfnLoadTile(GetPrivateState(pDC), KNOB_COLOR_HOT_TILE_FORMAT, (SWR_RENDERTARGET_ATTACHMENT)(SWR_ATTACHMENT_COLOR0 + rtSlot), x, y, pHotTile->renderTargetArrayIndex, pHotTile->pBuffer);
> - pHotTile->state = HOTTILE_DIRTY;
> - RDTSC_STOP(BELoadTiles, 0, 0);
> - }
> - else if (pHotTile->state == HOTTILE_CLEAR)
> - {
> - RDTSC_START(BELoadTiles);
> - // Clear the tile.
> - ClearColorHotTile(pHotTile);
> - pHotTile->state = HOTTILE_DIRTY;
> - RDTSC_STOP(BELoadTiles, 0, 0);
> - }
> - colorHottileEnableMask &= ~(1 << rtSlot);
> - }
> + _ReadWriteBarrier();
>
> - // check depth if enabled
> - if (state.depthHottileEnable)
> - {
> - HOTTILE* pHotTile = pHotTileMgr->GetHotTile(pContext, pDC, macroID, SWR_ATTACHMENT_DEPTH, true, numSamples);
> - if (pHotTile->state == HOTTILE_INVALID)
> - {
> - RDTSC_START(BELoadTiles);
> - // invalid hottile before draw requires a load from surface before we can draw to it
> - pContext->pfnLoadTile(GetPrivateState(pDC), KNOB_DEPTH_HOT_TILE_FORMAT, SWR_ATTACHMENT_DEPTH, x, y, pHotTile->renderTargetArrayIndex, pHotTile->pBuffer);
> - pHotTile->state = HOTTILE_DIRTY;
> - RDTSC_STOP(BELoadTiles, 0, 0);
> - }
> - else if (pHotTile->state == HOTTILE_CLEAR)
> - {
> - RDTSC_START(BELoadTiles);
> - // Clear the tile.
> - ClearDepthHotTile(pHotTile);
> - pHotTile->state = HOTTILE_DIRTY;
> - RDTSC_STOP(BELoadTiles, 0, 0);
> - }
> - }
> + // Cleanup memory allocations
> + pDC->pArena->Reset(true);
> + pDC->pTileMgr->initialize();
>
> - // check stencil if enabled
> - if (state.stencilHottileEnable)
> - {
> - HOTTILE* pHotTile = pHotTileMgr->GetHotTile(pContext, pDC, macroID, SWR_ATTACHMENT_STENCIL, true, numSamples);
> - if (pHotTile->state == HOTTILE_INVALID)
> - {
> - RDTSC_START(BELoadTiles);
> - // invalid hottile before draw requires a load from surface before we can draw to it
> - pContext->pfnLoadTile(GetPrivateState(pDC), KNOB_STENCIL_HOT_TILE_FORMAT, SWR_ATTACHMENT_STENCIL, x, y, pHotTile->renderTargetArrayIndex, pHotTile->pBuffer);
> - pHotTile->state = HOTTILE_DIRTY;
> - RDTSC_STOP(BELoadTiles, 0, 0);
> - }
> - else if (pHotTile->state == HOTTILE_CLEAR)
> - {
> - RDTSC_START(BELoadTiles);
> - // Clear the tile.
> - ClearStencilHotTile(pHotTile);
> - pHotTile->state = HOTTILE_DIRTY;
> - RDTSC_STOP(BELoadTiles, 0, 0);
> - }
> + pContext->dcRing.Dequeue(); // Remove from tail
> }
> }
>
> @@ -466,7 +314,7 @@ INLINE bool FindFirstIncompleteDraw(SWR_CONTEXT* pContext, uint64_t& curDrawBE)
> if (isWorkComplete)
> {
> curDrawBE++;
> - InterlockedIncrement(&pDC->threadsDoneBE);
> + CompleteDrawContext(pContext, pDC);
> }
> else
> {
> @@ -496,7 +344,7 @@ void WorkOnFifoBE(
> SWR_CONTEXT *pContext,
> uint32_t workerId,
> uint64_t &curDrawBE,
> - std::unordered_set<uint32_t>& lockedTiles)
> + TileSet& lockedTiles)
> {
> // Find the first incomplete draw that has pending work. If no such draw is found then
> // return. FindFirstIncompleteDraw is responsible for incrementing the curDrawBE.
> @@ -558,7 +406,7 @@ void WorkOnFifoBE(
> SWR_ASSERT(pWork);
> if (pWork->type == DRAW)
> {
> - InitializeHotTiles(pContext, pDC, tileID, (const TRIANGLE_WORK_DESC*)&pWork->desc);
> + pContext->pHotTileMgr->InitializeHotTiles(pContext, pDC, tileID);
> }
> }
>
> @@ -579,7 +427,7 @@ void WorkOnFifoBE(
> {
> // We can increment the current BE and safely move to next draw since we know this draw is complete.
> curDrawBE++;
> - InterlockedIncrement(&pDC->threadsDoneBE);
> + CompleteDrawContext(pContext, pDC);
>
> lastRetiredDraw++;
>
> @@ -598,7 +446,7 @@ void WorkOnFifoBE(
> }
> }
>
> -void WorkOnFifoFE(SWR_CONTEXT *pContext, uint32_t workerId, uint64_t &curDrawFE, UCHAR numaNode)
> +void WorkOnFifoFE(SWR_CONTEXT *pContext, uint32_t workerId, uint64_t &curDrawFE, int numaNode)
> {
> // Try to grab the next DC from the ring
> uint64_t drawEnqueued = GetEnqueuedDraw(pContext);
> @@ -608,8 +456,8 @@ void WorkOnFifoFE(SWR_CONTEXT *pContext, uint32_t workerId, uint64_t &curDrawFE,
> DRAW_CONTEXT *pDC = &pContext->dcRing[dcSlot];
> if (pDC->isCompute || pDC->doneFE || pDC->FeLock)
> {
> + CompleteDrawContext(pContext, pDC);
> curDrawFE++;
> - InterlockedIncrement(&pDC->threadsDoneFE);
> }
> else
> {
> @@ -673,22 +521,12 @@ void WorkOnCompute(
> // Is there any work remaining?
> if (queue.getNumQueued() > 0)
> {
> - bool lastToComplete = false;
> -
> uint32_t threadGroupId = 0;
> while (queue.getWork(threadGroupId))
> {
> ProcessComputeBE(pDC, workerId, threadGroupId);
>
> - lastToComplete = queue.finishedWork();
> - }
> -
> - _ReadWriteBarrier();
> -
> - if (lastToComplete)
> - {
> - SWR_ASSERT(queue.isWorkComplete() == true);
> - pDC->doneCompute = true;
> + queue.finishedWork();
> }
> }
> }
> @@ -711,7 +549,7 @@ DWORD workerThreadMain(LPVOID pData)
>
> // Track tiles locked by other threads. If we try to lock a macrotile and find its already
> // locked then we'll add it to this list so that we don't try and lock it again.
> - std::unordered_set<uint32_t> lockedTiles;
> + TileSet lockedTiles;
>
> // each worker has the ability to work on any of the queued draws as long as certain
> // conditions are met. the data associated
> @@ -732,10 +570,10 @@ DWORD workerThreadMain(LPVOID pData)
> // the worker can safely increment its oldestDraw counter and move on to the next draw.
> std::unique_lock<std::mutex> lock(pContext->WaitLock, std::defer_lock);
>
> - auto threadHasWork = [&](uint64_t curDraw) { return curDraw != pContext->DrawEnqueued; };
> + auto threadHasWork = [&](uint64_t curDraw) { return curDraw != pContext->dcRing.GetHead(); };
>
> - uint64_t curDrawBE = 1;
> - uint64_t curDrawFE = 1;
> + uint64_t curDrawBE = 0;
> + uint64_t curDrawFE = 0;
>
> while (pContext->threadPool.inThreadShutdown == false)
> {
> @@ -853,9 +691,12 @@ void CreateThreadPool(SWR_CONTEXT *pContext, THREAD_POOL *pPool)
> numThreads, KNOB_MAX_NUM_THREADS);
> }
>
> + uint32_t numAPIReservedThreads = 1;
> +
> +
> if (numThreads == 1)
> {
> - // If only 1 worker thread, try to move it to an available
> + // If only 1 worker threads, try to move it to an available
> // HW thread. If that fails, use the API thread.
> if (numCoresPerNode < numHWCoresPerNode)
> {
> @@ -878,8 +719,15 @@ void CreateThreadPool(SWR_CONTEXT *pContext, THREAD_POOL *pPool)
> }
> else
> {
> - // Save a HW thread for the API thread.
> - numThreads--;
> + // Save HW threads for the API if we can
> + if (numThreads > numAPIReservedThreads)
> + {
> + numThreads -= numAPIReservedThreads;
> + }
> + else
> + {
> + numAPIReservedThreads = 0;
> + }
> }
>
> pPool->numThreads = numThreads;
> @@ -918,9 +766,9 @@ void CreateThreadPool(SWR_CONTEXT *pContext, THREAD_POOL *pPool)
> auto& core = node.cores[c];
> for (uint32_t t = 0; t < numHyperThreads; ++t)
> {
> - if (c == 0 && n == 0 && t == 0)
> + if (numAPIReservedThreads)
> {
> - // Skip core 0, thread0 on node 0 to reserve for API thread
> + --numAPIReservedThreads;
> continue;
> }
>
> diff --git a/src/gallium/drivers/swr/rasterizer/core/threads.h b/src/gallium/drivers/swr/rasterizer/core/threads.h
> index 0fa7196..6b37e3a 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/threads.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/threads.h
> @@ -54,10 +54,12 @@ struct THREAD_POOL
> THREAD_DATA *pThreadData;
> };
>
> +typedef std::unordered_set<uint32_t> TileSet;
> +
> void CreateThreadPool(SWR_CONTEXT *pContext, THREAD_POOL *pPool);
> void DestroyThreadPool(SWR_CONTEXT *pContext, THREAD_POOL *pPool);
>
> // Expose FE and BE worker functions to the API thread if single threaded
> -void WorkOnFifoFE(SWR_CONTEXT *pContext, uint32_t workerId, uint64_t &curDrawFE, UCHAR numaNode);
> -void WorkOnFifoBE(SWR_CONTEXT *pContext, uint32_t workerId, uint64_t &curDrawBE, std::unordered_set<uint32_t> &usedTiles);
> +void WorkOnFifoFE(SWR_CONTEXT *pContext, uint32_t workerId, uint64_t &curDrawFE, int numaNode);
> +void WorkOnFifoBE(SWR_CONTEXT *pContext, uint32_t workerId, uint64_t &curDrawBE, TileSet &usedTiles);
> void WorkOnCompute(SWR_CONTEXT *pContext, uint32_t workerId, uint64_t &curDrawBE);
> diff --git a/src/gallium/drivers/swr/rasterizer/core/tilemgr.cpp b/src/gallium/drivers/swr/rasterizer/core/tilemgr.cpp
> index 8603936..89c779e 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/tilemgr.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/tilemgr.cpp
> @@ -29,7 +29,9 @@
> #include <unordered_map>
>
> #include "fifo.hpp"
> -#include "tilemgr.h"
> +#include "core/tilemgr.h"
> +#include "core/multisample.h"
> +#include "rdtsc_core.h"
>
> #define TILE_ID(x,y) ((x << 16 | y))
>
> @@ -54,24 +56,21 @@ void DispatchQueue::operator delete(void *p)
> _aligned_free(p);
> }
>
> -MacroTileMgr::MacroTileMgr(Arena& arena) : mArena(arena)
> +MacroTileMgr::MacroTileMgr(CachingArena& arena) : mArena(arena)
> {
> }
>
> -void MacroTileMgr::initialize()
> -{
> - mWorkItemsProduced = 0;
> - mWorkItemsConsumed = 0;
> -
> - mDirtyTiles.clear();
> -}
> -
> void MacroTileMgr::enqueue(uint32_t x, uint32_t y, BE_WORK *pWork)
> {
> // Should not enqueue more then what we have backing for in the hot tile manager.
> SWR_ASSERT(x < KNOB_NUM_HOT_TILES_X);
> SWR_ASSERT(y < KNOB_NUM_HOT_TILES_Y);
>
> + if ((x & ~(KNOB_NUM_HOT_TILES_X-1)) | (y & ~(KNOB_NUM_HOT_TILES_Y-1)))
> + {
> + return;
> + }
> +
> uint32_t id = TILE_ID(x, y);
>
> MacroTileQueue &tile = mTiles[id];
> @@ -103,3 +102,282 @@ void MacroTileMgr::markTileComplete(uint32_t id)
> tile.mWorkItemsFE = 0;
> tile.mWorkItemsBE = 0;
> }
> +
> +HOTTILE* HotTileMgr::GetHotTile(SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC, uint32_t macroID, SWR_RENDERTARGET_ATTACHMENT attachment, bool create, uint32_t numSamples,
> + uint32_t renderTargetArrayIndex)
> +{
> + uint32_t x, y;
> + MacroTileMgr::getTileIndices(macroID, x, y);
> +
> + SWR_ASSERT(x < KNOB_NUM_HOT_TILES_X);
> + SWR_ASSERT(y < KNOB_NUM_HOT_TILES_Y);
> +
> + HotTileSet &tile = mHotTiles[x][y];
> + HOTTILE& hotTile = tile.Attachment[attachment];
> + if (hotTile.pBuffer == NULL)
> + {
> + if (create)
> + {
> + uint32_t size = numSamples * mHotTileSize[attachment];
> + hotTile.pBuffer = (uint8_t*)_aligned_malloc(size, KNOB_SIMD_WIDTH * 4);
> + hotTile.state = HOTTILE_INVALID;
> + hotTile.numSamples = numSamples;
> + hotTile.renderTargetArrayIndex = renderTargetArrayIndex;
> + }
> + else
> + {
> + return NULL;
> + }
> + }
> + else
> + {
> + // free the old tile and create a new one with enough space to hold all samples
> + if (numSamples > hotTile.numSamples)
> + {
> + // tile should be either uninitialized or resolved if we're deleting and switching to a
> + // new sample count
> + SWR_ASSERT((hotTile.state == HOTTILE_INVALID) ||
> + (hotTile.state == HOTTILE_RESOLVED) ||
> + (hotTile.state == HOTTILE_CLEAR));
> + _aligned_free(hotTile.pBuffer);
> +
> + uint32_t size = numSamples * mHotTileSize[attachment];
> + hotTile.pBuffer = (uint8_t*)_aligned_malloc(size, KNOB_SIMD_WIDTH * 4);
> + hotTile.state = HOTTILE_INVALID;
> + hotTile.numSamples = numSamples;
> + }
> +
> + // if requested render target array index isn't currently loaded, need to store out the current hottile
> + // and load the requested array slice
> + if (renderTargetArrayIndex != hotTile.renderTargetArrayIndex)
> + {
> + SWR_FORMAT format;
> + switch (attachment)
> + {
> + case SWR_ATTACHMENT_COLOR0:
> + case SWR_ATTACHMENT_COLOR1:
> + case SWR_ATTACHMENT_COLOR2:
> + case SWR_ATTACHMENT_COLOR3:
> + case SWR_ATTACHMENT_COLOR4:
> + case SWR_ATTACHMENT_COLOR5:
> + case SWR_ATTACHMENT_COLOR6:
> + case SWR_ATTACHMENT_COLOR7: format = KNOB_COLOR_HOT_TILE_FORMAT; break;
> + case SWR_ATTACHMENT_DEPTH: format = KNOB_DEPTH_HOT_TILE_FORMAT; break;
> + case SWR_ATTACHMENT_STENCIL: format = KNOB_STENCIL_HOT_TILE_FORMAT; break;
> + default: SWR_ASSERT(false, "Unknown attachment: %d", attachment); format = KNOB_COLOR_HOT_TILE_FORMAT; break;
> + }
> +
> + if (hotTile.state == HOTTILE_DIRTY)
> + {
> + pContext->pfnStoreTile(GetPrivateState(pDC), format, attachment,
> + x * KNOB_MACROTILE_X_DIM, y * KNOB_MACROTILE_Y_DIM, hotTile.renderTargetArrayIndex, hotTile.pBuffer);
> + }
> +
> + pContext->pfnLoadTile(GetPrivateState(pDC), format, attachment,
> + x * KNOB_MACROTILE_X_DIM, y * KNOB_MACROTILE_Y_DIM, renderTargetArrayIndex, hotTile.pBuffer);
> +
> + hotTile.renderTargetArrayIndex = renderTargetArrayIndex;
> + hotTile.state = HOTTILE_DIRTY;
> + }
> + }
> + return &tile.Attachment[attachment];
> +}
> +
> +HOTTILE* HotTileMgr::GetHotTileNoLoad(
> + SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC, uint32_t macroID,
> + SWR_RENDERTARGET_ATTACHMENT attachment, bool create, uint32_t numSamples)
> +{
> + uint32_t x, y;
> + MacroTileMgr::getTileIndices(macroID, x, y);
> +
> + SWR_ASSERT(x < KNOB_NUM_HOT_TILES_X);
> + SWR_ASSERT(y < KNOB_NUM_HOT_TILES_Y);
> +
> + HotTileSet &tile = mHotTiles[x][y];
> + HOTTILE& hotTile = tile.Attachment[attachment];
> + if (hotTile.pBuffer == NULL)
> + {
> + if (create)
> + {
> + uint32_t size = numSamples * mHotTileSize[attachment];
> + hotTile.pBuffer = (uint8_t*)_aligned_malloc(size, KNOB_SIMD_WIDTH * 4);
> + hotTile.state = HOTTILE_INVALID;
> + hotTile.numSamples = numSamples;
> + hotTile.renderTargetArrayIndex = 0;
> + }
> + else
> + {
> + return NULL;
> + }
> + }
> +
> + return &hotTile;
> +}
> +
> +void HotTileMgr::ClearColorHotTile(const HOTTILE* pHotTile) // clear a macro tile from float4 clear data.
> +{
> + // Load clear color into SIMD register...
> + float *pClearData = (float*)(pHotTile->clearData);
> + simdscalar valR = _simd_broadcast_ss(&pClearData[0]);
> + simdscalar valG = _simd_broadcast_ss(&pClearData[1]);
> + simdscalar valB = _simd_broadcast_ss(&pClearData[2]);
> + simdscalar valA = _simd_broadcast_ss(&pClearData[3]);
> +
> + float *pfBuf = (float*)pHotTile->pBuffer;
> + uint32_t numSamples = pHotTile->numSamples;
> +
> + for (uint32_t row = 0; row < KNOB_MACROTILE_Y_DIM; row += KNOB_TILE_Y_DIM)
> + {
> + for (uint32_t col = 0; col < KNOB_MACROTILE_X_DIM; col += KNOB_TILE_X_DIM)
> + {
> + for (uint32_t si = 0; si < (KNOB_TILE_X_DIM * KNOB_TILE_Y_DIM * numSamples); si += SIMD_TILE_X_DIM * SIMD_TILE_Y_DIM) //SIMD_TILE_X_DIM * SIMD_TILE_Y_DIM); si++)
> + {
> + _simd_store_ps(pfBuf, valR);
> + pfBuf += KNOB_SIMD_WIDTH;
> + _simd_store_ps(pfBuf, valG);
> + pfBuf += KNOB_SIMD_WIDTH;
> + _simd_store_ps(pfBuf, valB);
> + pfBuf += KNOB_SIMD_WIDTH;
> + _simd_store_ps(pfBuf, valA);
> + pfBuf += KNOB_SIMD_WIDTH;
> + }
> + }
> + }
> +}
> +
> +void HotTileMgr::ClearDepthHotTile(const HOTTILE* pHotTile) // clear a macro tile from float4 clear data.
> +{
> + // Load clear color into SIMD register...
> + float *pClearData = (float*)(pHotTile->clearData);
> + simdscalar valZ = _simd_broadcast_ss(&pClearData[0]);
> +
> + float *pfBuf = (float*)pHotTile->pBuffer;
> + uint32_t numSamples = pHotTile->numSamples;
> +
> + for (uint32_t row = 0; row < KNOB_MACROTILE_Y_DIM; row += KNOB_TILE_Y_DIM)
> + {
> + for (uint32_t col = 0; col < KNOB_MACROTILE_X_DIM; col += KNOB_TILE_X_DIM)
> + {
> + for (uint32_t si = 0; si < (KNOB_TILE_X_DIM * KNOB_TILE_Y_DIM * numSamples); si += SIMD_TILE_X_DIM * SIMD_TILE_Y_DIM)
> + {
> + _simd_store_ps(pfBuf, valZ);
> + pfBuf += KNOB_SIMD_WIDTH;
> + }
> + }
> + }
> +}
> +
> +void HotTileMgr::ClearStencilHotTile(const HOTTILE* pHotTile)
> +{
> + // convert from F32 to U8.
> + uint8_t clearVal = (uint8_t)(pHotTile->clearData[0]);
> + //broadcast 32x into __m256i...
> + simdscalari valS = _simd_set1_epi8(clearVal);
> +
> + simdscalari* pBuf = (simdscalari*)pHotTile->pBuffer;
> + uint32_t numSamples = pHotTile->numSamples;
> +
> + for (uint32_t row = 0; row < KNOB_MACROTILE_Y_DIM; row += KNOB_TILE_Y_DIM)
> + {
> + for (uint32_t col = 0; col < KNOB_MACROTILE_X_DIM; col += KNOB_TILE_X_DIM)
> + {
> + // We're putting 4 pixels in each of the 32-bit slots, so increment 4 times as quickly.
> + for (uint32_t si = 0; si < (KNOB_TILE_X_DIM * KNOB_TILE_Y_DIM * numSamples); si += SIMD_TILE_X_DIM * SIMD_TILE_Y_DIM * 4)
> + {
> + _simd_store_si(pBuf, valS);
> + pBuf += 1;
> + }
> + }
> + }
> +}
> +
> +//////////////////////////////////////////////////////////////////////////
> +/// @brief InitializeHotTiles
> +/// for draw calls, we initialize the active hot tiles and perform deferred
> +/// load on them if tile is in invalid state. we do this in the outer thread
> +/// loop instead of inside the draw routine itself mainly for performance,
> +/// to avoid unnecessary setup every triangle
> +/// @todo support deferred clear
> +/// @param pCreateInfo - pointer to creation info.
> +void HotTileMgr::InitializeHotTiles(SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC, uint32_t macroID)
> +{
> + const API_STATE& state = GetApiState(pDC);
> + HotTileMgr *pHotTileMgr = pContext->pHotTileMgr;
> +
> + uint32_t x, y;
> + MacroTileMgr::getTileIndices(macroID, x, y);
> + x *= KNOB_MACROTILE_X_DIM;
> + y *= KNOB_MACROTILE_Y_DIM;
> +
> + uint32_t numSamples = GetNumSamples(state.rastState.sampleCount);
> +
> + // check RT if enabled
> + unsigned long rtSlot = 0;
> + uint32_t colorHottileEnableMask = state.colorHottileEnable;
> + while (_BitScanForward(&rtSlot, colorHottileEnableMask))
> + {
> + HOTTILE* pHotTile = GetHotTile(pContext, pDC, macroID, (SWR_RENDERTARGET_ATTACHMENT)(SWR_ATTACHMENT_COLOR0 + rtSlot), true, numSamples);
> +
> + if (pHotTile->state == HOTTILE_INVALID)
> + {
> + RDTSC_START(BELoadTiles);
> + // invalid hottile before draw requires a load from surface before we can draw to it
> + pContext->pfnLoadTile(GetPrivateState(pDC), KNOB_COLOR_HOT_TILE_FORMAT, (SWR_RENDERTARGET_ATTACHMENT)(SWR_ATTACHMENT_COLOR0 + rtSlot), x, y, pHotTile->renderTargetArrayIndex, pHotTile->pBuffer);
> + pHotTile->state = HOTTILE_DIRTY;
> + RDTSC_STOP(BELoadTiles, 0, 0);
> + }
> + else if (pHotTile->state == HOTTILE_CLEAR)
> + {
> + RDTSC_START(BELoadTiles);
> + // Clear the tile.
> + ClearColorHotTile(pHotTile);
> + pHotTile->state = HOTTILE_DIRTY;
> + RDTSC_STOP(BELoadTiles, 0, 0);
> + }
> + colorHottileEnableMask &= ~(1 << rtSlot);
> + }
> +
> + // check depth if enabled
> + if (state.depthHottileEnable)
> + {
> + HOTTILE* pHotTile = GetHotTile(pContext, pDC, macroID, SWR_ATTACHMENT_DEPTH, true, numSamples);
> + if (pHotTile->state == HOTTILE_INVALID)
> + {
> + RDTSC_START(BELoadTiles);
> + // invalid hottile before draw requires a load from surface before we can draw to it
> + pContext->pfnLoadTile(GetPrivateState(pDC), KNOB_DEPTH_HOT_TILE_FORMAT, SWR_ATTACHMENT_DEPTH, x, y, pHotTile->renderTargetArrayIndex, pHotTile->pBuffer);
> + pHotTile->state = HOTTILE_DIRTY;
> + RDTSC_STOP(BELoadTiles, 0, 0);
> + }
> + else if (pHotTile->state == HOTTILE_CLEAR)
> + {
> + RDTSC_START(BELoadTiles);
> + // Clear the tile.
> + ClearDepthHotTile(pHotTile);
> + pHotTile->state = HOTTILE_DIRTY;
> + RDTSC_STOP(BELoadTiles, 0, 0);
> + }
> + }
> +
> + // check stencil if enabled
> + if (state.stencilHottileEnable)
> + {
> + HOTTILE* pHotTile = GetHotTile(pContext, pDC, macroID, SWR_ATTACHMENT_STENCIL, true, numSamples);
> + if (pHotTile->state == HOTTILE_INVALID)
> + {
> + RDTSC_START(BELoadTiles);
> + // invalid hottile before draw requires a load from surface before we can draw to it
> + pContext->pfnLoadTile(GetPrivateState(pDC), KNOB_STENCIL_HOT_TILE_FORMAT, SWR_ATTACHMENT_STENCIL, x, y, pHotTile->renderTargetArrayIndex, pHotTile->pBuffer);
> + pHotTile->state = HOTTILE_DIRTY;
> + RDTSC_STOP(BELoadTiles, 0, 0);
> + }
> + else if (pHotTile->state == HOTTILE_CLEAR)
> + {
> + RDTSC_START(BELoadTiles);
> + // Clear the tile.
> + ClearStencilHotTile(pHotTile);
> + pHotTile->state = HOTTILE_DIRTY;
> + RDTSC_STOP(BELoadTiles, 0, 0);
> + }
> + }
> +}
> diff --git a/src/gallium/drivers/swr/rasterizer/core/tilemgr.h b/src/gallium/drivers/swr/rasterizer/core/tilemgr.h
> index 9137941..cf9d2fe 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/tilemgr.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/tilemgr.h
> @@ -59,7 +59,8 @@ struct MacroTileQueue
>
> //////////////////////////////////////////////////////////////////////////
> /// @brief Clear fifo and unlock it.
> - void clear(Arena& arena)
> + template <typename ArenaT>
> + void clear(ArenaT& arena)
> {
> mFifo.clear(arena);
> }
> @@ -71,7 +72,8 @@ struct MacroTileQueue
> return mFifo.peek();
> }
>
> - bool enqueue_try_nosync(Arena& arena, const BE_WORK* entry)
> + template <typename ArenaT>
> + bool enqueue_try_nosync(ArenaT& arena, const BE_WORK* entry)
> {
> return mFifo.enqueue_try_nosync(arena, entry);
> }
> @@ -104,7 +106,7 @@ private:
> class MacroTileMgr
> {
> public:
> - MacroTileMgr(Arena& arena);
> + MacroTileMgr(CachingArena& arena);
> ~MacroTileMgr()
> {
> for (auto &tile : mTiles)
> @@ -113,7 +115,14 @@ public:
> }
> }
>
> - void initialize();
> + INLINE void initialize()
> + {
> + mWorkItemsProduced = 0;
> + mWorkItemsConsumed = 0;
> +
> + mDirtyTiles.clear();
> + }
> +
> INLINE std::vector<uint32_t>& getDirtyTiles() { return mDirtyTiles; }
> INLINE MacroTileQueue& getMacroTileQueue(uint32_t id) { return mTiles[id]; }
> void markTileComplete(uint32_t id);
> @@ -135,15 +144,14 @@ public:
> void operator delete (void *p);
>
> private:
> - Arena& mArena;
> - SWR_FORMAT mFormat;
> + CachingArena& mArena;
> std::unordered_map<uint32_t, MacroTileQueue> mTiles;
>
> // Any tile that has work queued to it is a dirty tile.
> std::vector<uint32_t> mDirtyTiles;
>
> - OSALIGNLINE(LONG) mWorkItemsProduced;
> - OSALIGNLINE(volatile LONG) mWorkItemsConsumed;
> + OSALIGNLINE(LONG) mWorkItemsProduced { 0 };
> + OSALIGNLINE(volatile LONG) mWorkItemsConsumed { 0 };
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -224,7 +232,7 @@ public:
> void *operator new(size_t size);
> void operator delete (void *p);
>
> - void* mpTaskData; // The API thread will set this up and the callback task function will interpet this.
> + void* mpTaskData{ nullptr }; // The API thread will set this up and the callback task function will interpet this.
>
> OSALIGNLINE(volatile LONG) mTasksAvailable{ 0 };
> OSALIGNLINE(volatile LONG) mTasksOutstanding{ 0 };
> @@ -241,7 +249,7 @@ enum HOTTILE_STATE
>
> struct HOTTILE
> {
> - BYTE *pBuffer;
> + uint8_t *pBuffer;
> HOTTILE_STATE state;
> DWORD clearData[4]; // May need to change based on pfnClearTile implementation. Reorder for alignment?
> uint32_t numSamples;
> @@ -293,95 +301,16 @@ public:
> }
> }
>
> - HOTTILE *GetHotTile(SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC, uint32_t macroID, SWR_RENDERTARGET_ATTACHMENT attachment, bool create, uint32_t numSamples = 1,
> - uint32_t renderTargetArrayIndex = 0)
> - {
> - uint32_t x, y;
> - MacroTileMgr::getTileIndices(macroID, x, y);
> + void InitializeHotTiles(SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC, uint32_t macroID);
>
> - assert(x < KNOB_NUM_HOT_TILES_X);
> - assert(y < KNOB_NUM_HOT_TILES_Y);
> + HOTTILE *GetHotTile(SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC, uint32_t macroID, SWR_RENDERTARGET_ATTACHMENT attachment, bool create, uint32_t numSamples = 1,
> + uint32_t renderTargetArrayIndex = 0);
>
> - HotTileSet &tile = mHotTiles[x][y];
> - HOTTILE& hotTile = tile.Attachment[attachment];
> - if (hotTile.pBuffer == NULL)
> - {
> - if (create)
> - {
> - uint32_t size = numSamples * mHotTileSize[attachment];
> - hotTile.pBuffer = (BYTE*)_aligned_malloc(size, KNOB_SIMD_WIDTH * 4);
> - hotTile.state = HOTTILE_INVALID;
> - hotTile.numSamples = numSamples;
> - hotTile.renderTargetArrayIndex = renderTargetArrayIndex;
> - }
> - else
> - {
> - return NULL;
> - }
> - }
> - else
> - {
> - // free the old tile and create a new one with enough space to hold all samples
> - if (numSamples > hotTile.numSamples)
> - {
> - // tile should be either uninitialized or resolved if we're deleting and switching to a
> - // new sample count
> - assert((hotTile.state == HOTTILE_INVALID) ||
> - (hotTile.state == HOTTILE_RESOLVED) ||
> - (hotTile.state == HOTTILE_CLEAR));
> - _aligned_free(hotTile.pBuffer);
> -
> - uint32_t size = numSamples * mHotTileSize[attachment];
> - hotTile.pBuffer = (BYTE*)_aligned_malloc(size, KNOB_SIMD_WIDTH * 4);
> - hotTile.state = HOTTILE_INVALID;
> - hotTile.numSamples = numSamples;
> - }
> + HOTTILE *GetHotTileNoLoad(SWR_CONTEXT* pContext, DRAW_CONTEXT* pDC, uint32_t macroID, SWR_RENDERTARGET_ATTACHMENT attachment, bool create, uint32_t numSamples = 1);
>
> - // if requested render target array index isn't currently loaded, need to store out the current hottile
> - // and load the requested array slice
> - if (renderTargetArrayIndex != hotTile.renderTargetArrayIndex)
> - {
> - SWR_FORMAT format;
> - switch (attachment)
> - {
> - case SWR_ATTACHMENT_COLOR0:
> - case SWR_ATTACHMENT_COLOR1:
> - case SWR_ATTACHMENT_COLOR2:
> - case SWR_ATTACHMENT_COLOR3:
> - case SWR_ATTACHMENT_COLOR4:
> - case SWR_ATTACHMENT_COLOR5:
> - case SWR_ATTACHMENT_COLOR6:
> - case SWR_ATTACHMENT_COLOR7: format = KNOB_COLOR_HOT_TILE_FORMAT; break;
> - case SWR_ATTACHMENT_DEPTH: format = KNOB_DEPTH_HOT_TILE_FORMAT; break;
> - case SWR_ATTACHMENT_STENCIL: format = KNOB_STENCIL_HOT_TILE_FORMAT; break;
> - default: SWR_ASSERT(false, "Unknown attachment: %d", attachment); format = KNOB_COLOR_HOT_TILE_FORMAT; break;
> - }
> -
> - if (hotTile.state == HOTTILE_DIRTY)
> - {
> - pContext->pfnStoreTile(GetPrivateState(pDC), format, attachment,
> - x * KNOB_MACROTILE_X_DIM, y * KNOB_MACROTILE_Y_DIM, hotTile.renderTargetArrayIndex, hotTile.pBuffer);
> - }
> -
> - pContext->pfnLoadTile(GetPrivateState(pDC), format, attachment,
> - x * KNOB_MACROTILE_X_DIM, y * KNOB_MACROTILE_Y_DIM, renderTargetArrayIndex, hotTile.pBuffer);
> -
> - hotTile.renderTargetArrayIndex = renderTargetArrayIndex;
> - hotTile.state = HOTTILE_DIRTY;
> - }
> - }
> - return &tile.Attachment[attachment];
> - }
> -
> - HotTileSet &GetHotTile(uint32_t macroID)
> - {
> - uint32_t x, y;
> - MacroTileMgr::getTileIndices(macroID, x, y);
> - assert(x < KNOB_NUM_HOT_TILES_X);
> - assert(y < KNOB_NUM_HOT_TILES_Y);
> -
> - return mHotTiles[x][y];
> - }
> + static void ClearColorHotTile(const HOTTILE* pHotTile);
> + static void ClearDepthHotTile(const HOTTILE* pHotTile);
> + static void ClearStencilHotTile(const HOTTILE* pHotTile);
>
> private:
> HotTileSet mHotTiles[KNOB_NUM_HOT_TILES_X][KNOB_NUM_HOT_TILES_Y];
> diff --git a/src/gallium/drivers/swr/rasterizer/core/utils.cpp b/src/gallium/drivers/swr/rasterizer/core/utils.cpp
> index f36452f..a1d665e 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/utils.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/core/utils.cpp
> @@ -27,6 +27,11 @@
> ******************************************************************************/
> #if defined(_WIN32)
>
> +#if defined(NOMINMAX)
> +// GDI Plus requires non-std min / max macros be defined :(
> +#undef NOMINMAX
> +#endif
> +
> #include<Windows.h>
> #include <Gdiplus.h>
> #include <Gdiplusheaders.h>
> diff --git a/src/gallium/drivers/swr/rasterizer/core/utils.h b/src/gallium/drivers/swr/rasterizer/core/utils.h
> index b9dc48c..60a3a6a 100644
> --- a/src/gallium/drivers/swr/rasterizer/core/utils.h
> +++ b/src/gallium/drivers/swr/rasterizer/core/utils.h
> @@ -46,8 +46,7 @@ void OpenBitmapFromFile(
> uint32_t *height);
> #endif
>
> -/// @todo assume linux is always 64 bit
> -#if defined(_WIN64) || defined(__linux__) || defined(__gnu_linux__)
> +#if defined(_WIN64) || defined(__x86_64__)
> #define _MM_INSERT_EPI64 _mm_insert_epi64
> #define _MM_EXTRACT_EPI64 _mm_extract_epi64
> #else
> @@ -89,7 +88,10 @@ INLINE __m128i _MM_INSERT_EPI64(__m128i a, INT64 b, const int32_t ndx)
>
> OSALIGNLINE(struct) BBOX
> {
> - int top, bottom, left, right;
> + int top{ 0 };
> + int bottom{ 0 };
> + int left{ 0 };
> + int right{ 0 };
>
> BBOX() {}
> BBOX(int t, int b, int l, int r) : top(t), bottom(b), left(l), right(r) {}
> @@ -110,7 +112,10 @@ OSALIGNLINE(struct) BBOX
>
> struct simdBBox
> {
> - simdscalari top, bottom, left, right;
> + simdscalari top;
> + simdscalari bottom;
> + simdscalari left;
> + simdscalari right;
> };
>
> INLINE
> @@ -271,7 +276,7 @@ struct TransposeSingleComponent
> /// @brief Pass-thru for single component.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> memcpy(pDst, pSrc, (bpp * KNOB_SIMD_WIDTH) / 8);
> }
> @@ -286,7 +291,7 @@ struct Transpose8_8_8_8
> /// @brief Performs an SOA to AOS conversion for packed 8_8_8_8 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> simdscalari src = _simd_load_si((const simdscalari*)pSrc);
> #if KNOB_SIMD_WIDTH == 8
> @@ -325,7 +330,7 @@ struct Transpose8_8_8
> /// @brief Performs an SOA to AOS conversion for packed 8_8_8 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -337,7 +342,7 @@ struct Transpose8_8
> /// @brief Performs an SOA to AOS conversion for packed 8_8 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> simdscalari src = _simd_load_si((const simdscalari*)pSrc);
>
> @@ -361,7 +366,7 @@ struct Transpose32_32_32_32
> /// @brief Performs an SOA to AOS conversion for packed 32_32_32_32 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> #if KNOB_SIMD_WIDTH == 8
> simdscalar src0 = _simd_load_ps((const float*)pSrc);
> @@ -394,7 +399,7 @@ struct Transpose32_32_32
> /// @brief Performs an SOA to AOS conversion for packed 32_32_32 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> #if KNOB_SIMD_WIDTH == 8
> simdscalar src0 = _simd_load_ps((const float*)pSrc);
> @@ -426,7 +431,7 @@ struct Transpose32_32
> /// @brief Performs an SOA to AOS conversion for packed 32_32 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> const float* pfSrc = (const float*)pSrc;
> __m128 src_r0 = _mm_load_ps(pfSrc + 0);
> @@ -456,7 +461,7 @@ struct Transpose16_16_16_16
> /// @brief Performs an SOA to AOS conversion for packed 16_16_16_16 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> #if KNOB_SIMD_WIDTH == 8
> simdscalari src_rg = _simd_load_si((const simdscalari*)pSrc);
> @@ -496,7 +501,7 @@ struct Transpose16_16_16
> /// @brief Performs an SOA to AOS conversion for packed 16_16_16 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> #if KNOB_SIMD_WIDTH == 8
> simdscalari src_rg = _simd_load_si((const simdscalari*)pSrc);
> @@ -535,7 +540,7 @@ struct Transpose16_16
> /// @brief Performs an SOA to AOS conversion for packed 16_16 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - INLINE static void Transpose(const BYTE* pSrc, BYTE* pDst)
> + INLINE static void Transpose(const uint8_t* pSrc, uint8_t* pDst)
> {
> simdscalar src = _simd_load_ps((const float*)pSrc);
>
> @@ -566,7 +571,7 @@ struct Transpose24_8
> /// @brief Performs an SOA to AOS conversion for packed 24_8 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -578,7 +583,7 @@ struct Transpose32_8_24
> /// @brief Performs an SOA to AOS conversion for packed 32_8_24 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
>
> @@ -592,7 +597,7 @@ struct Transpose4_4_4_4
> /// @brief Performs an SOA to AOS conversion for packed 4_4_4_4 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -604,7 +609,7 @@ struct Transpose5_6_5
> /// @brief Performs an SOA to AOS conversion for packed 5_6_5 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -616,7 +621,7 @@ struct Transpose9_9_9_5
> /// @brief Performs an SOA to AOS conversion for packed 9_9_9_5 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -628,7 +633,7 @@ struct Transpose5_5_5_1
> /// @brief Performs an SOA to AOS conversion for packed 5_5_5_1 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -640,7 +645,7 @@ struct Transpose10_10_10_2
> /// @brief Performs an SOA to AOS conversion for packed 10_10_10_2 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
> //////////////////////////////////////////////////////////////////////////
> @@ -652,7 +657,7 @@ struct Transpose11_11_10
> /// @brief Performs an SOA to AOS conversion for packed 11_11_10 data.
> /// @param pSrc - source data in SOA form
> /// @param pDst - output data in AOS form
> - static void Transpose(const BYTE* pSrc, BYTE* pDst) = delete;
> + static void Transpose(const uint8_t* pSrc, uint8_t* pDst) = delete;
> };
>
> // helper function to unroll loops
> @@ -694,7 +699,7 @@ uint32_t ComputeCRC(uint32_t crc, const void *pData, uint32_t size)
> }
> #endif
>
> - BYTE* pRemainderBytes = (BYTE*)pDataWords;
> + uint8_t* pRemainderBytes = (uint8_t*)pDataWords;
> for (uint32_t i = 0; i < sizeRemainderBytes; ++i)
> {
> crc = _mm_crc32_u8(crc, *pRemainderBytes++);
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/JitManager.cpp b/src/gallium/drivers/swr/rasterizer/jitter/JitManager.cpp
> index 734c897..de856c4 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/JitManager.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/JitManager.cpp
> @@ -47,6 +47,10 @@
> #include "llvm/Analysis/CFGPrinter.h"
> #include "llvm/IRReader/IRReader.h"
>
> +#if LLVM_USE_INTEL_JITEVENTS
> +#include "llvm/ExecutionEngine/JITEventListener.h"
> +#endif
> +
> #include "core/state.h"
> #include "common/containers.hpp"
>
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/JitManager.h b/src/gallium/drivers/swr/rasterizer/jitter/JitManager.h
> index c974a61..4ffb0fb 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/JitManager.h
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/JitManager.h
> @@ -53,6 +53,10 @@
> #include "llvm/Config/config.h"
> #endif
>
> +#ifndef HAVE_LLVM
> +#define HAVE_LLVM (LLVM_VERSION_MAJOR << 8) || LLVM_VERSION_MINOR
> +#endif
> +
> #include "llvm/IR/Verifier.h"
> #include "llvm/ExecutionEngine/MCJIT.h"
> #include "llvm/Support/FileSystem.h"
> @@ -60,11 +64,10 @@
>
> #include "llvm/Analysis/Passes.h"
>
> -#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR == 6
> +#if HAVE_LLVM == 0x306
> #include "llvm/PassManager.h"
> #else
> #include "llvm/IR/LegacyPassManager.h"
> -using namespace llvm::legacy;
> #endif
>
> #include "llvm/CodeGen/Passes.h"
> @@ -166,7 +169,6 @@ struct JitManager
> FunctionType* mTrinaryFPTy;
> FunctionType* mUnaryIntTy;
> FunctionType* mBinaryIntTy;
> - FunctionType* mTrinaryIntTy;
>
> Type* mSimtFP32Ty;
> Type* mSimtInt32Ty;
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/blend_jit.cpp b/src/gallium/drivers/swr/rasterizer/jitter/blend_jit.cpp
> index 954524a..2fed2bf 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/blend_jit.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/blend_jit.cpp
> @@ -717,7 +717,13 @@ struct BlendJit : public Builder
>
> JitManager::DumpToFile(blendFunc, "");
>
> - FunctionPassManager passes(JM()->mpCurrentModule);
> +#if HAVE_LLVM == 0x306
> + FunctionPassManager
> +#else
> + llvm::legacy::FunctionPassManager
> +#endif
> + passes(JM()->mpCurrentModule);
> +
> passes.add(createBreakCriticalEdgesPass());
> passes.add(createCFGSimplificationPass());
> passes.add(createEarlyCSEPass());
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/builder.cpp b/src/gallium/drivers/swr/rasterizer/jitter/builder.cpp
> index c15bdf1..757ea3f 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/builder.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/builder.cpp
> @@ -38,6 +38,8 @@ using namespace llvm;
> Builder::Builder(JitManager *pJitMgr)
> : mpJitMgr(pJitMgr)
> {
> + mVWidth = pJitMgr->mVWidth;
> +
> mpIRBuilder = &pJitMgr->mBuilder;
>
> mVoidTy = Type::getVoidTy(pJitMgr->mContext);
> @@ -48,14 +50,18 @@ Builder::Builder(JitManager *pJitMgr)
> mInt8Ty = Type::getInt8Ty(pJitMgr->mContext);
> mInt16Ty = Type::getInt16Ty(pJitMgr->mContext);
> mInt32Ty = Type::getInt32Ty(pJitMgr->mContext);
> + mInt8PtrTy = PointerType::get(mInt8Ty, 0);
> + mInt16PtrTy = PointerType::get(mInt16Ty, 0);
> + mInt32PtrTy = PointerType::get(mInt32Ty, 0);
> mInt64Ty = Type::getInt64Ty(pJitMgr->mContext);
> mV4FP32Ty = StructType::get(pJitMgr->mContext, std::vector<Type*>(4, mFP32Ty), false); // vector4 float type (represented as structure)
> mV4Int32Ty = StructType::get(pJitMgr->mContext, std::vector<Type*>(4, mInt32Ty), false); // vector4 int type
> - mSimdInt16Ty = VectorType::get(mInt16Ty, mpJitMgr->mVWidth);
> - mSimdInt32Ty = VectorType::get(mInt32Ty, mpJitMgr->mVWidth);
> - mSimdInt64Ty = VectorType::get(mInt64Ty, mpJitMgr->mVWidth);
> - mSimdFP16Ty = VectorType::get(mFP16Ty, mpJitMgr->mVWidth);
> - mSimdFP32Ty = VectorType::get(mFP32Ty, mpJitMgr->mVWidth);
> + mSimdInt16Ty = VectorType::get(mInt16Ty, mVWidth);
> + mSimdInt32Ty = VectorType::get(mInt32Ty, mVWidth);
> + mSimdInt64Ty = VectorType::get(mInt64Ty, mVWidth);
> + mSimdFP16Ty = VectorType::get(mFP16Ty, mVWidth);
> + mSimdFP32Ty = VectorType::get(mFP32Ty, mVWidth);
> + mSimdVectorTy = StructType::get(pJitMgr->mContext, std::vector<Type*>(4, mSimdFP32Ty), false);
>
> if (sizeof(uint32_t*) == 4)
> {
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/builder.h b/src/gallium/drivers/swr/rasterizer/jitter/builder.h
> index 4921661..239ef2a 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/builder.h
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/builder.h
> @@ -43,6 +43,8 @@ struct Builder
> JitManager* mpJitMgr;
> IRBuilder<>* mpIRBuilder;
>
> + uint32_t mVWidth;
> +
> // Built in types.
> Type* mVoidTy;
> Type* mInt1Ty;
> @@ -54,12 +56,16 @@ struct Builder
> Type* mFP16Ty;
> Type* mFP32Ty;
> Type* mDoubleTy;
> + Type* mInt8PtrTy;
> + Type* mInt16PtrTy;
> + Type* mInt32PtrTy;
> Type* mSimdFP16Ty;
> Type* mSimdFP32Ty;
> Type* mSimdInt16Ty;
> Type* mSimdInt32Ty;
> Type* mSimdInt64Ty;
> Type* mSimdIntPtrTy;
> + Type* mSimdVectorTy;
> StructType* mV4FP32Ty;
> StructType* mV4Int32Ty;
>
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/builder_misc.cpp b/src/gallium/drivers/swr/rasterizer/jitter/builder_misc.cpp
> index 5394fc7..c6cf793 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/builder_misc.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/builder_misc.cpp
> @@ -28,6 +28,8 @@
> *
> ******************************************************************************/
> #include "builder.h"
> +#include "common/rdtsc_buckets.h"
> +
> #include "llvm/Support/DynamicLibrary.h"
>
> void __cdecl CallPrint(const char* fmt, ...);
> @@ -189,32 +191,32 @@ Constant *Builder::PRED(bool pred)
>
> Value *Builder::VIMMED1(int i)
> {
> - return ConstantVector::getSplat(JM()->mVWidth, cast<ConstantInt>(C(i)));
> + return ConstantVector::getSplat(mVWidth, cast<ConstantInt>(C(i)));
> }
>
> Value *Builder::VIMMED1(uint32_t i)
> {
> - return ConstantVector::getSplat(JM()->mVWidth, cast<ConstantInt>(C(i)));
> + return ConstantVector::getSplat(mVWidth, cast<ConstantInt>(C(i)));
> }
>
> Value *Builder::VIMMED1(float i)
> {
> - return ConstantVector::getSplat(JM()->mVWidth, cast<ConstantFP>(C(i)));
> + return ConstantVector::getSplat(mVWidth, cast<ConstantFP>(C(i)));
> }
>
> Value *Builder::VIMMED1(bool i)
> {
> - return ConstantVector::getSplat(JM()->mVWidth, cast<ConstantInt>(C(i)));
> + return ConstantVector::getSplat(mVWidth, cast<ConstantInt>(C(i)));
> }
>
> Value *Builder::VUNDEF_IPTR()
> {
> - return UndefValue::get(VectorType::get(PointerType::get(mInt32Ty, 0),JM()->mVWidth));
> + return UndefValue::get(VectorType::get(mInt32PtrTy,mVWidth));
> }
>
> Value *Builder::VUNDEF_I()
> {
> - return UndefValue::get(VectorType::get(mInt32Ty, JM()->mVWidth));
> + return UndefValue::get(VectorType::get(mInt32Ty, mVWidth));
> }
>
> Value *Builder::VUNDEF(Type *ty, uint32_t size)
> @@ -224,15 +226,15 @@ Value *Builder::VUNDEF(Type *ty, uint32_t size)
>
> Value *Builder::VUNDEF_F()
> {
> - return UndefValue::get(VectorType::get(mFP32Ty, JM()->mVWidth));
> + return UndefValue::get(VectorType::get(mFP32Ty, mVWidth));
> }
>
> Value *Builder::VUNDEF(Type* t)
> {
> - return UndefValue::get(VectorType::get(t, JM()->mVWidth));
> + return UndefValue::get(VectorType::get(t, mVWidth));
> }
>
> -#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR == 6
> +#if HAVE_LLVM == 0x306
> Value *Builder::VINSERT(Value *vec, Value *val, uint64_t index)
> {
> return VINSERT(vec, val, C((int64_t)index));
> @@ -247,7 +249,7 @@ Value *Builder::VBROADCAST(Value *src)
> return src;
> }
>
> - return VECTOR_SPLAT(JM()->mVWidth, src);
> + return VECTOR_SPLAT(mVWidth, src);
> }
>
> uint32_t Builder::IMMED(Value* v)
> @@ -257,6 +259,13 @@ uint32_t Builder::IMMED(Value* v)
> return pValConst->getZExtValue();
> }
>
> +int32_t Builder::S_IMMED(Value* v)
> +{
> + SWR_ASSERT(isa<ConstantInt>(v));
> + ConstantInt *pValConst = cast<ConstantInt>(v);
> + return pValConst->getSExtValue();
> +}
> +
> Value *Builder::GEP(Value* ptr, const std::initializer_list<Value*> &indexList)
> {
> std::vector<Value*> indices;
> @@ -342,8 +351,8 @@ Value *Builder::MASKLOADD(Value* src,Value* mask)
> else
> {
> Function *func = Intrinsic::getDeclaration(JM()->mpCurrentModule,Intrinsic::x86_avx_maskload_ps_256);
> - Value* fMask = BITCAST(mask,VectorType::get(mFP32Ty,JM()->mVWidth));
> - vResult = BITCAST(CALL(func,{src,fMask}), VectorType::get(mInt32Ty,JM()->mVWidth));
> + Value* fMask = BITCAST(mask,VectorType::get(mFP32Ty,mVWidth));
> + vResult = BITCAST(CALL(func,{src,fMask}), VectorType::get(mInt32Ty,mVWidth));
> }
> return vResult;
> }
> @@ -512,7 +521,7 @@ CallInst *Builder::PRINT(const std::string &printStr,const std::initializer_list
>
> // get a pointer to the first character in the constant string array
> std::vector<Constant*> geplist{C(0),C(0)};
> -#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR == 6
> +#if HAVE_LLVM == 0x306
> Constant *strGEP = ConstantExpr::getGetElementPtr(gvPtr,geplist,false);
> #else
> Constant *strGEP = ConstantExpr::getGetElementPtr(nullptr, gvPtr,geplist,false);
> @@ -575,7 +584,7 @@ Value *Builder::GATHERPS(Value* vSrc, Value* pBase, Value* vIndices, Value* vMas
> Value *vScaleVec = VBROADCAST(Z_EXT(scale,mInt32Ty));
> Value *vOffsets = MUL(vIndices,vScaleVec);
> Value *mask = MASK(vMask);
> - for(uint32_t i = 0; i < JM()->mVWidth; ++i)
> + for(uint32_t i = 0; i < mVWidth; ++i)
> {
> // single component byte index
> Value *offset = VEXTRACT(vOffsets,C(i));
> @@ -625,7 +634,7 @@ Value *Builder::GATHERDD(Value* vSrc, Value* pBase, Value* vIndices, Value* vMas
> Value *vScaleVec = VBROADCAST(Z_EXT(scale, mInt32Ty));
> Value *vOffsets = MUL(vIndices, vScaleVec);
> Value *mask = MASK(vMask);
> - for(uint32_t i = 0; i < JM()->mVWidth; ++i)
> + for(uint32_t i = 0; i < mVWidth; ++i)
> {
> // single component byte index
> Value *offset = VEXTRACT(vOffsets, C(i));
> @@ -774,12 +783,61 @@ Value *Builder::PERMD(Value* a, Value* idx)
> }
> else
> {
> - res = VSHUFFLE(a, a, idx);
> + if (isa<Constant>(idx))
> + {
> + res = VSHUFFLE(a, a, idx);
> + }
> + else
> + {
> + res = VUNDEF_I();
> + for (uint32_t l = 0; l < JM()->mVWidth; ++l)
> + {
> + Value* pIndex = VEXTRACT(idx, C(l));
> + Value* pVal = VEXTRACT(a, pIndex);
> + res = VINSERT(res, pVal, C(l));
> + }
> + }
> }
> return res;
> }
>
> //////////////////////////////////////////////////////////////////////////
> +/// @brief Generate a VPERMPS operation (shuffle 32 bit float values
> +/// across 128 bit lanes) in LLVM IR. If not supported on the underlying
> +/// platform, emulate it
> +/// @param a - 256bit SIMD lane(8x32bit) of float values.
> +/// @param idx - 256bit SIMD lane(8x32bit) of 3 bit lane index values
> +Value *Builder::PERMPS(Value* a, Value* idx)
> +{
> + Value* res;
> + // use avx2 permute instruction if available
> + if (JM()->mArch.AVX2())
> + {
> + // llvm 3.6.0 swapped the order of the args to vpermd
> + res = VPERMPS(idx, a);
> + }
> + else
> + {
> + if (isa<Constant>(idx))
> + {
> + res = VSHUFFLE(a, a, idx);
> + }
> + else
> + {
> + res = VUNDEF_F();
> + for (uint32_t l = 0; l < JM()->mVWidth; ++l)
> + {
> + Value* pIndex = VEXTRACT(idx, C(l));
> + Value* pVal = VEXTRACT(a, pIndex);
> + res = VINSERT(res, pVal, C(l));
> + }
> + }
> + }
> +
> + return res;
> +}
> +
> +//////////////////////////////////////////////////////////////////////////
> /// @brief Generate a VCVTPH2PS operation (float16->float32 conversion)
> /// in LLVM IR. If not supported on the underlying platform, emulate it
> /// @param a - 128bit SIMD lane(8x16bit) of float16 in int16 format.
> @@ -800,7 +858,7 @@ Value *Builder::CVTPH2PS(Value* a)
> }
>
> Value* pResult = UndefValue::get(mSimdFP32Ty);
> - for (uint32_t i = 0; i < JM()->mVWidth; ++i)
> + for (uint32_t i = 0; i < mVWidth; ++i)
> {
> Value* pSrc = VEXTRACT(a, C(i));
> Value* pConv = CALL(pCvtPh2Ps, std::initializer_list<Value*>{pSrc});
> @@ -833,7 +891,7 @@ Value *Builder::CVTPS2PH(Value* a, Value* rounding)
> }
>
> Value* pResult = UndefValue::get(mSimdInt16Ty);
> - for (uint32_t i = 0; i < JM()->mVWidth; ++i)
> + for (uint32_t i = 0; i < mVWidth; ++i)
> {
> Value* pSrc = VEXTRACT(a, C(i));
> Value* pConv = CALL(pCvtPs2Ph, std::initializer_list<Value*>{pSrc});
> @@ -1085,8 +1143,8 @@ void Builder::GATHER4DD(const SWR_FORMAT_INFO &info, Value* pSrcBase, Value* byt
> void Builder::Shuffle16bpcGather4(const SWR_FORMAT_INFO &info, Value* vGatherInput[2], Value* vGatherOutput[4], bool bPackedOutput)
> {
> // cast types
> - Type* vGatherTy = VectorType::get(IntegerType::getInt32Ty(JM()->mContext), JM()->mVWidth);
> - Type* v32x8Ty = VectorType::get(mInt8Ty, JM()->mVWidth * 4); // vwidth is units of 32 bits
> + Type* vGatherTy = VectorType::get(IntegerType::getInt32Ty(JM()->mContext), mVWidth);
> + Type* v32x8Ty = VectorType::get(mInt8Ty, mVWidth * 4); // vwidth is units of 32 bits
>
> // input could either be float or int vector; do shuffle work in int
> vGatherInput[0] = BITCAST(vGatherInput[0], mSimdInt32Ty);
> @@ -1094,7 +1152,7 @@ void Builder::Shuffle16bpcGather4(const SWR_FORMAT_INFO &info, Value* vGatherInp
>
> if(bPackedOutput)
> {
> - Type* v128bitTy = VectorType::get(IntegerType::getIntNTy(JM()->mContext, 128), JM()->mVWidth / 4); // vwidth is units of 32 bits
> + Type* v128bitTy = VectorType::get(IntegerType::getIntNTy(JM()->mContext, 128), mVWidth / 4); // vwidth is units of 32 bits
>
> // shuffle mask
> Value* vConstMask = C<char>({0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15,
> @@ -1179,12 +1237,12 @@ void Builder::Shuffle16bpcGather4(const SWR_FORMAT_INFO &info, Value* vGatherInp
> void Builder::Shuffle8bpcGather4(const SWR_FORMAT_INFO &info, Value* vGatherInput, Value* vGatherOutput[], bool bPackedOutput)
> {
> // cast types
> - Type* vGatherTy = VectorType::get(IntegerType::getInt32Ty(JM()->mContext), JM()->mVWidth);
> - Type* v32x8Ty = VectorType::get(mInt8Ty, JM()->mVWidth * 4 ); // vwidth is units of 32 bits
> + Type* vGatherTy = VectorType::get(IntegerType::getInt32Ty(JM()->mContext), mVWidth);
> + Type* v32x8Ty = VectorType::get(mInt8Ty, mVWidth * 4 ); // vwidth is units of 32 bits
>
> if(bPackedOutput)
> {
> - Type* v128Ty = VectorType::get(IntegerType::getIntNTy(JM()->mContext, 128), JM()->mVWidth / 4); // vwidth is units of 32 bits
> + Type* v128Ty = VectorType::get(IntegerType::getIntNTy(JM()->mContext, 128), mVWidth / 4); // vwidth is units of 32 bits
> // shuffle mask
> Value* vConstMask = C<char>({0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15,
> 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15});
> @@ -1286,16 +1344,18 @@ void Builder::SCATTERPS(Value* pDst, Value* vSrc, Value* vOffsets, Value* vMask)
> {
> Value* pStack = STACKSAVE();
>
> + Type* pSrcTy = vSrc->getType()->getVectorElementType();
> +
> // allocate tmp stack for masked off lanes
> - Value* vTmpPtr = ALLOCA(vSrc->getType()->getVectorElementType());
> + Value* vTmpPtr = ALLOCA(pSrcTy);
>
> Value *mask = MASK(vMask);
> - for (uint32_t i = 0; i < JM()->mVWidth; ++i)
> + for (uint32_t i = 0; i < mVWidth; ++i)
> {
> Value *offset = VEXTRACT(vOffsets, C(i));
> // byte pointer to component
> Value *storeAddress = GEP(pDst, offset);
> - storeAddress = BITCAST(storeAddress, PointerType::get(mFP32Ty, 0));
> + storeAddress = BITCAST(storeAddress, PointerType::get(pSrcTy, 0));
> Value *selMask = VEXTRACT(mask, C(i));
> Value *srcElem = VEXTRACT(vSrc, C(i));
> // switch in a safe address to load if we're trying to access a vertex
> @@ -1349,7 +1409,7 @@ Value *Builder::FCLAMP(Value* src, float low, float high)
> Value* Builder::STACKSAVE()
> {
> Function* pfnStackSave = Intrinsic::getDeclaration(JM()->mpCurrentModule, Intrinsic::stacksave);
> -#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR == 6
> +#if HAVE_LLVM == 0x306
> return CALL(pfnStackSave);
> #else
> return CALLA(pfnStackSave);
> @@ -1401,11 +1461,13 @@ void __cdecl CallPrint(const char* fmt, ...)
> vsnprintf_s(strBuf, _TRUNCATE, fmt, args);
> OutputDebugString(strBuf);
> #endif
> +
> + va_end(args);
> }
>
> Value *Builder::VEXTRACTI128(Value* a, Constant* imm8)
> {
> -#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR == 6
> +#if HAVE_LLVM == 0x306
> Function *func =
> Intrinsic::getDeclaration(JM()->mpCurrentModule,
> Intrinsic::x86_avx_vextractf128_si_256);
> @@ -1413,8 +1475,8 @@ Value *Builder::VEXTRACTI128(Value* a, Constant* imm8)
> #else
> bool flag = !imm8->isZeroValue();
> SmallVector<Constant*,8> idx;
> - for (unsigned i = 0; i < JM()->mVWidth / 2; i++) {
> - idx.push_back(C(flag ? i + JM()->mVWidth / 2 : i));
> + for (unsigned i = 0; i < mVWidth / 2; i++) {
> + idx.push_back(C(flag ? i + mVWidth / 2 : i));
> }
> return VSHUFFLE(a, VUNDEF_I(), ConstantVector::get(idx));
> #endif
> @@ -1422,7 +1484,7 @@ Value *Builder::VEXTRACTI128(Value* a, Constant* imm8)
>
> Value *Builder::VINSERTI128(Value* a, Value* b, Constant* imm8)
> {
> -#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR == 6
> +#if HAVE_LLVM == 0x306
> Function *func =
> Intrinsic::getDeclaration(JM()->mpCurrentModule,
> Intrinsic::x86_avx_vinsertf128_si_256);
> @@ -1430,18 +1492,54 @@ Value *Builder::VINSERTI128(Value* a, Value* b, Constant* imm8)
> #else
> bool flag = !imm8->isZeroValue();
> SmallVector<Constant*,8> idx;
> - for (unsigned i = 0; i < JM()->mVWidth; i++) {
> + for (unsigned i = 0; i < mVWidth; i++) {
> idx.push_back(C(i));
> }
> Value *inter = VSHUFFLE(b, VUNDEF_I(), ConstantVector::get(idx));
>
> SmallVector<Constant*,8> idx2;
> - for (unsigned i = 0; i < JM()->mVWidth / 2; i++) {
> - idx2.push_back(C(flag ? i : i + JM()->mVWidth));
> + for (unsigned i = 0; i < mVWidth / 2; i++) {
> + idx2.push_back(C(flag ? i : i + mVWidth));
> }
> - for (unsigned i = JM()->mVWidth / 2; i < JM()->mVWidth; i++) {
> - idx2.push_back(C(flag ? i + JM()->mVWidth / 2 : i));
> + for (unsigned i = mVWidth / 2; i < mVWidth; i++) {
> + idx2.push_back(C(flag ? i + mVWidth / 2 : i));
> }
> return VSHUFFLE(a, inter, ConstantVector::get(idx2));
> #endif
> }
> +
> +// rdtsc buckets macros
> +void Builder::RDTSC_START(Value* pBucketMgr, Value* pId)
> +{
> + std::vector<Type*> args{
> + PointerType::get(mInt32Ty, 0), // pBucketMgr
> + mInt32Ty // id
> + };
> +
> + FunctionType* pFuncTy = FunctionType::get(Type::getVoidTy(JM()->mContext), args, false);
> + Function* pFunc = cast<Function>(JM()->mpCurrentModule->getOrInsertFunction("BucketManager_StartBucket", pFuncTy));
> + if (sys::DynamicLibrary::SearchForAddressOfSymbol("BucketManager_StartBucket") == nullptr)
> + {
> + sys::DynamicLibrary::AddSymbol("BucketManager_StartBucket", (void*)&BucketManager_StartBucket);
> + }
> +
> + CALL(pFunc, { pBucketMgr, pId });
> +}
> +
> +void Builder::RDTSC_STOP(Value* pBucketMgr, Value* pId)
> +{
> + std::vector<Type*> args{
> + PointerType::get(mInt32Ty, 0), // pBucketMgr
> + mInt32Ty // id
> + };
> +
> + FunctionType* pFuncTy = FunctionType::get(Type::getVoidTy(JM()->mContext), args, false);
> + Function* pFunc = cast<Function>(JM()->mpCurrentModule->getOrInsertFunction("BucketManager_StopBucket", pFuncTy));
> + if (sys::DynamicLibrary::SearchForAddressOfSymbol("BucketManager_StopBucket") == nullptr)
> + {
> + sys::DynamicLibrary::AddSymbol("BucketManager_StopBucket", (void*)&BucketManager_StopBucket);
> + }
> +
> + CALL(pFunc, { pBucketMgr, pId });
> +}
> +
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/builder_misc.h b/src/gallium/drivers/swr/rasterizer/jitter/builder_misc.h
> index 48e0558..f43ef69 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/builder_misc.h
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/builder_misc.h
> @@ -59,7 +59,7 @@ Value *VUNDEF_F();
> Value *VUNDEF_I();
> Value *VUNDEF(Type* ty, uint32_t size);
> Value *VUNDEF_IPTR();
> -#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR == 6
> +#if HAVE_LLVM == 0x306
> Value *VINSERT(Value *vec, Value *val, uint64_t index);
> #endif
> Value *VBROADCAST(Value *src);
> @@ -67,6 +67,7 @@ Value *VRCP(Value *va);
> Value *VPLANEPS(Value* vA, Value* vB, Value* vC, Value* &vX, Value* &vY);
>
> uint32_t IMMED(Value* i);
> +int32_t S_IMMED(Value* i);
>
> Value *GEP(Value* ptr, const std::initializer_list<Value*> &indexList);
> Value *GEP(Value* ptr, const std::initializer_list<uint32_t> &indexList);
> @@ -115,6 +116,7 @@ Value *PSHUFB(Value* a, Value* b);
> Value *PMOVSXBD(Value* a);
> Value *PMOVSXWD(Value* a);
> Value *PERMD(Value* a, Value* idx);
> +Value *PERMPS(Value* a, Value* idx);
> Value *CVTPH2PS(Value* a);
> Value *CVTPS2PH(Value* a, Value* rounding);
> Value *PMAXSD(Value* a, Value* b);
> @@ -147,3 +149,7 @@ Value* INT3() { return INTERRUPT(C((uint8_t)3)); }
>
> Value *VEXTRACTI128(Value* a, Constant* imm8);
> Value *VINSERTI128(Value* a, Value* b, Constant* imm8);
> +
> +// rdtsc buckets macros
> +void RDTSC_START(Value* pBucketMgr, Value* pId);
> +void RDTSC_STOP(Value* pBucketMgr, Value* pId);
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/fetch_jit.cpp b/src/gallium/drivers/swr/rasterizer/jitter/fetch_jit.cpp
> index c5a180e..2c2c56b 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/fetch_jit.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/fetch_jit.cpp
> @@ -105,7 +105,7 @@ Function* FetchJit::Create(const FETCH_COMPILE_STATE& fetchState)
> std::vector<Value*> vtxInputIndices(2, C(0));
> // GEP
> pVtxOut = GEP(pVtxOut, C(0));
> - pVtxOut = BITCAST(pVtxOut, PointerType::get(VectorType::get(mFP32Ty, JM()->mVWidth), 0));
> + pVtxOut = BITCAST(pVtxOut, PointerType::get(VectorType::get(mFP32Ty, mVWidth), 0));
>
> // SWR_FETCH_CONTEXT::pStreams
> Value* streams = LOAD(fetchInfo,{0, SWR_FETCH_CONTEXT_pStreams});
> @@ -174,7 +174,12 @@ Function* FetchJit::Create(const FETCH_COMPILE_STATE& fetchState)
>
> verifyFunction(*fetch);
>
> - FunctionPassManager setupPasses(JM()->mpCurrentModule);
> +#if HAVE_LLVM == 0x306
> + FunctionPassManager
> +#else
> + llvm::legacy::FunctionPassManager
> +#endif
> + setupPasses(JM()->mpCurrentModule);
>
> ///@todo We don't need the CFG passes for fetch. (e.g. BreakCriticalEdges and CFGSimplification)
> setupPasses.add(createBreakCriticalEdgesPass());
> @@ -186,7 +191,12 @@ Function* FetchJit::Create(const FETCH_COMPILE_STATE& fetchState)
>
> JitManager::DumpToFile(fetch, "se");
>
> - FunctionPassManager optPasses(JM()->mpCurrentModule);
> +#if HAVE_LLVM == 0x306
> + FunctionPassManager
> +#else
> + llvm::legacy::FunctionPassManager
> +#endif
> + optPasses(JM()->mpCurrentModule);
>
> ///@todo Haven't touched these either. Need to remove some of these and add others.
> optPasses.add(createCFGSimplificationPass());
> @@ -220,8 +230,8 @@ void FetchJit::JitLoadVertices(const FETCH_COMPILE_STATE &fetchState, Value* fet
>
> SWRL::UncheckedFixedVector<Value*, 16> vectors;
>
> - std::vector<Constant*> pMask(JM()->mVWidth);
> - for(uint32_t i = 0; i < JM()->mVWidth; ++i)
> + std::vector<Constant*> pMask(mVWidth);
> + for(uint32_t i = 0; i < mVWidth; ++i)
> {
> pMask[i] = (C(i < 4 ? i : 4));
> }
> @@ -254,7 +264,7 @@ void FetchJit::JitLoadVertices(const FETCH_COMPILE_STATE &fetchState, Value* fet
> Value* startVertexOffset = MUL(Z_EXT(startVertex, mInt64Ty), stride);
>
> // Load from the stream.
> - for(uint32_t lane = 0; lane < JM()->mVWidth; ++lane)
> + for(uint32_t lane = 0; lane < mVWidth; ++lane)
> {
> // Get index
> Value* index = VEXTRACT(vIndices, C(lane));
> @@ -380,44 +390,44 @@ void FetchJit::JitLoadVertices(const FETCH_COMPILE_STATE &fetchState, Value* fet
> vectors.push_back(wvec);
> }
>
> - std::vector<Constant*> v01Mask(JM()->mVWidth);
> - std::vector<Constant*> v23Mask(JM()->mVWidth);
> - std::vector<Constant*> v02Mask(JM()->mVWidth);
> - std::vector<Constant*> v13Mask(JM()->mVWidth);
> + std::vector<Constant*> v01Mask(mVWidth);
> + std::vector<Constant*> v23Mask(mVWidth);
> + std::vector<Constant*> v02Mask(mVWidth);
> + std::vector<Constant*> v13Mask(mVWidth);
>
> // Concatenate the vectors together.
> elements[0] = VUNDEF_F();
> elements[1] = VUNDEF_F();
> elements[2] = VUNDEF_F();
> elements[3] = VUNDEF_F();
> - for(uint32_t b = 0, num4Wide = JM()->mVWidth / 4; b < num4Wide; ++b)
> + for(uint32_t b = 0, num4Wide = mVWidth / 4; b < num4Wide; ++b)
> {
> v01Mask[4 * b + 0] = C(0 + 4 * b);
> v01Mask[4 * b + 1] = C(1 + 4 * b);
> - v01Mask[4 * b + 2] = C(0 + 4 * b + JM()->mVWidth);
> - v01Mask[4 * b + 3] = C(1 + 4 * b + JM()->mVWidth);
> + v01Mask[4 * b + 2] = C(0 + 4 * b + mVWidth);
> + v01Mask[4 * b + 3] = C(1 + 4 * b + mVWidth);
>
> v23Mask[4 * b + 0] = C(2 + 4 * b);
> v23Mask[4 * b + 1] = C(3 + 4 * b);
> - v23Mask[4 * b + 2] = C(2 + 4 * b + JM()->mVWidth);
> - v23Mask[4 * b + 3] = C(3 + 4 * b + JM()->mVWidth);
> + v23Mask[4 * b + 2] = C(2 + 4 * b + mVWidth);
> + v23Mask[4 * b + 3] = C(3 + 4 * b + mVWidth);
>
> v02Mask[4 * b + 0] = C(0 + 4 * b);
> v02Mask[4 * b + 1] = C(2 + 4 * b);
> - v02Mask[4 * b + 2] = C(0 + 4 * b + JM()->mVWidth);
> - v02Mask[4 * b + 3] = C(2 + 4 * b + JM()->mVWidth);
> + v02Mask[4 * b + 2] = C(0 + 4 * b + mVWidth);
> + v02Mask[4 * b + 3] = C(2 + 4 * b + mVWidth);
>
> v13Mask[4 * b + 0] = C(1 + 4 * b);
> v13Mask[4 * b + 1] = C(3 + 4 * b);
> - v13Mask[4 * b + 2] = C(1 + 4 * b + JM()->mVWidth);
> - v13Mask[4 * b + 3] = C(3 + 4 * b + JM()->mVWidth);
> + v13Mask[4 * b + 2] = C(1 + 4 * b + mVWidth);
> + v13Mask[4 * b + 3] = C(3 + 4 * b + mVWidth);
>
> - std::vector<Constant*> iMask(JM()->mVWidth);
> - for(uint32_t i = 0; i < JM()->mVWidth; ++i)
> + std::vector<Constant*> iMask(mVWidth);
> + for(uint32_t i = 0; i < mVWidth; ++i)
> {
> if(((4 * b) <= i) && (i < (4 * (b + 1))))
> {
> - iMask[i] = C(i % 4 + JM()->mVWidth);
> + iMask[i] = C(i % 4 + mVWidth);
> }
> else
> {
> @@ -805,7 +815,7 @@ Value* FetchJit::GetSimdValid8bitIndices(Value* pIndices, Value* pLastIndex)
> STORE(C((uint8_t)0), pZeroIndex);
>
> // Load a SIMD of index pointers
> - for(int64_t lane = 0; lane < JM()->mVWidth; lane++)
> + for(int64_t lane = 0; lane < mVWidth; lane++)
> {
> // Calculate the address of the requested index
> Value *pIndex = GEP(pIndices, C(lane));
> @@ -840,7 +850,7 @@ Value* FetchJit::GetSimdValid16bitIndices(Value* pIndices, Value* pLastIndex)
> STORE(C((uint16_t)0), pZeroIndex);
>
> // Load a SIMD of index pointers
> - for(int64_t lane = 0; lane < JM()->mVWidth; lane++)
> + for(int64_t lane = 0; lane < mVWidth; lane++)
> {
> // Calculate the address of the requested index
> Value *pIndex = GEP(pIndices, C(lane));
> @@ -925,13 +935,13 @@ void FetchJit::Shuffle8bpcGatherd(Shuffle8bpcArgs &args)
> const uint32_t (&swizzle)[4] = std::get<9>(args);
>
> // cast types
> - Type* vGatherTy = VectorType::get(IntegerType::getInt32Ty(JM()->mContext), JM()->mVWidth);
> - Type* v32x8Ty = VectorType::get(mInt8Ty, JM()->mVWidth * 4 ); // vwidth is units of 32 bits
> + Type* vGatherTy = mSimdInt32Ty;
> + Type* v32x8Ty = VectorType::get(mInt8Ty, mVWidth * 4 ); // vwidth is units of 32 bits
>
> // have to do extra work for sign extending
> if ((extendType == Instruction::CastOps::SExt) || (extendType == Instruction::CastOps::SIToFP)){
> - Type* v16x8Ty = VectorType::get(mInt8Ty, JM()->mVWidth * 2); // 8x16bit ints in a 128bit lane
> - Type* v128Ty = VectorType::get(IntegerType::getIntNTy(JM()->mContext, 128), JM()->mVWidth / 4); // vwidth is units of 32 bits
> + Type* v16x8Ty = VectorType::get(mInt8Ty, mVWidth * 2); // 8x16bit ints in a 128bit lane
> + Type* v128Ty = VectorType::get(IntegerType::getIntNTy(JM()->mContext, 128), mVWidth / 4); // vwidth is units of 32 bits
>
> // shuffle mask, including any swizzling
> const char x = (char)swizzle[0]; const char y = (char)swizzle[1];
> @@ -1138,8 +1148,8 @@ void FetchJit::Shuffle16bpcGather(Shuffle16bpcArgs &args)
> Value* (&vVertexElements)[4] = std::get<8>(args);
>
> // cast types
> - Type* vGatherTy = VectorType::get(IntegerType::getInt32Ty(JM()->mContext), JM()->mVWidth);
> - Type* v32x8Ty = VectorType::get(mInt8Ty, JM()->mVWidth * 4); // vwidth is units of 32 bits
> + Type* vGatherTy = VectorType::get(IntegerType::getInt32Ty(JM()->mContext), mVWidth);
> + Type* v32x8Ty = VectorType::get(mInt8Ty, mVWidth * 4); // vwidth is units of 32 bits
>
> // have to do extra work for sign extending
> if ((extendType == Instruction::CastOps::SExt) || (extendType == Instruction::CastOps::SIToFP)||
> @@ -1149,7 +1159,7 @@ void FetchJit::Shuffle16bpcGather(Shuffle16bpcArgs &args)
> bool bFP = (extendType == Instruction::CastOps::FPExt) ? true : false;
>
> Type* v8x16Ty = VectorType::get(mInt16Ty, 8); // 8x16bit in a 128bit lane
> - Type* v128bitTy = VectorType::get(IntegerType::getIntNTy(JM()->mContext, 128), JM()->mVWidth / 4); // vwidth is units of 32 bits
> + Type* v128bitTy = VectorType::get(IntegerType::getIntNTy(JM()->mContext, 128), mVWidth / 4); // vwidth is units of 32 bits
>
> // shuffle mask
> Value* vConstMask = C<char>({0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15,
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/scripts/gen_llvm_ir_macros.py b/src/gallium/drivers/swr/rasterizer/jitter/scripts/gen_llvm_ir_macros.py
> index 1814b7c..e73b232 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/scripts/gen_llvm_ir_macros.py
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/scripts/gen_llvm_ir_macros.py
> @@ -27,7 +27,7 @@ import json as JSON
> import operator
>
> header = r"""/****************************************************************************
> -* Copyright (C) 2014-2015 Intel Corporation. All Rights Reserved.
> +* Copyright (C) 2014-2016 Intel Corporation. All Rights Reserved.
> *
> * Permission is hereby granted, free of charge, to any person obtaining a
> * copy of this software and associated documentation files (the "Software"),
> @@ -84,16 +84,16 @@ inst_aliases = {
> }
>
> intrinsics = [
> - ["VGATHERPS", "x86_avx2_gather_d_ps_256", ["src", "pBase", "indices", "mask", "scale"]],
> + ["VGATHERPS", "x86_avx2_gather_d_ps_256", ["src", "pBase", "indices", "mask", "scale"]],
> ["VGATHERDD", "x86_avx2_gather_d_d_256", ["src", "pBase", "indices", "mask", "scale"]],
> - ["VSQRTPS", "x86_avx_sqrt_ps_256", ["a"]],
> - ["VRSQRTPS", "x86_avx_rsqrt_ps_256", ["a"]],
> - ["VRCPPS", "x86_avx_rcp_ps_256", ["a"]],
> - ["VMINPS", "x86_avx_min_ps_256", ["a", "b"]],
> - ["VMAXPS", "x86_avx_max_ps_256", ["a", "b"]],
> - ["VPMINSD", "x86_avx2_pmins_d", ["a", "b"]],
> - ["VPMAXSD", "x86_avx2_pmaxs_d", ["a", "b"]],
> - ["VROUND", "x86_avx_round_ps_256", ["a", "rounding"]],
> + ["VSQRTPS", "x86_avx_sqrt_ps_256", ["a"]],
> + ["VRSQRTPS", "x86_avx_rsqrt_ps_256", ["a"]],
> + ["VRCPPS", "x86_avx_rcp_ps_256", ["a"]],
> + ["VMINPS", "x86_avx_min_ps_256", ["a", "b"]],
> + ["VMAXPS", "x86_avx_max_ps_256", ["a", "b"]],
> + ["VPMINSD", "x86_avx2_pmins_d", ["a", "b"]],
> + ["VPMAXSD", "x86_avx2_pmaxs_d", ["a", "b"]],
> + ["VROUND", "x86_avx_round_ps_256", ["a", "rounding"]],
> ["VCMPPS", "x86_avx_cmp_ps_256", ["a", "b", "cmpop"]],
> ["VBLENDVPS", "x86_avx_blendv_ps_256", ["a", "b", "mask"]],
> ["BEXTR_32", "x86_bmi_bextr_32", ["src", "control"]],
> @@ -103,6 +103,7 @@ intrinsics = [
> ["VPMOVSXBD", "x86_avx2_pmovsxbd", ["a"]], # sign extend packed 8bit components
> ["VPMOVSXWD", "x86_avx2_pmovsxwd", ["a"]], # sign extend packed 16bit components
> ["VPERMD", "x86_avx2_permd", ["idx", "a"]],
> + ["VPERMPS", "x86_avx2_permps", ["idx", "a"]],
> ["VCVTPH2PS", "x86_vcvtph2ps_256", ["a"]],
> ["VCVTPS2PH", "x86_vcvtps2ph_256", ["a", "round"]],
> ["VHSUBPS", "x86_avx_hsub_ps_256", ["a", "b"]],
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/scripts/gen_llvm_types.py b/src/gallium/drivers/swr/rasterizer/jitter/scripts/gen_llvm_types.py
> index 7bba435..0b53a92 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/scripts/gen_llvm_types.py
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/scripts/gen_llvm_types.py
> @@ -28,7 +28,7 @@ import operator
>
> header = r"""
> /****************************************************************************
> -* Copyright (C) 2014-2015 Intel Corporation. All Rights Reserved.
> +* Copyright (C) 2014-2016 Intel Corporation. All Rights Reserved.
> *
> * Permission is hereby granted, free of charge, to any person obtaining a
> * copy of this software and associated documentation files (the "Software"),
> diff --git a/src/gallium/drivers/swr/rasterizer/jitter/streamout_jit.cpp b/src/gallium/drivers/swr/rasterizer/jitter/streamout_jit.cpp
> index 6c5f22b..36baa8d 100644
> --- a/src/gallium/drivers/swr/rasterizer/jitter/streamout_jit.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/jitter/streamout_jit.cpp
> @@ -293,7 +293,13 @@ struct StreamOutJit : public Builder
>
> JitManager::DumpToFile(soFunc, "SoFunc");
>
> - FunctionPassManager passes(JM()->mpCurrentModule);
> +#if HAVE_LLVM == 0x306
> + FunctionPassManager
> +#else
> + llvm::legacy::FunctionPassManager
> +#endif
> + passes(JM()->mpCurrentModule);
> +
> passes.add(createBreakCriticalEdgesPass());
> passes.add(createCFGSimplificationPass());
> passes.add(createEarlyCSEPass());
> diff --git a/src/gallium/drivers/swr/rasterizer/memory/ClearTile.cpp b/src/gallium/drivers/swr/rasterizer/memory/ClearTile.cpp
> index ad73cd8..d001cb6 100644
> --- a/src/gallium/drivers/swr/rasterizer/memory/ClearTile.cpp
> +++ b/src/gallium/drivers/swr/rasterizer/memory/ClearTile.cpp
> @@ -33,7 +33,7 @@
> #include "memory/tilingtraits.h"
> #include "memory/Convert.h"
>
> -typedef void(*PFN_STORE_TILES_CLEAR)(const FLOAT*, SWR_SURFACE_STATE*, UINT, UINT);
> +typedef void(*PFN_STORE_TILES_CLEAR)(const float*, SWR_SURFACE_STATE*, UINT, UINT);
>
> //////////////////////////////////////////////////////////////////////////
> /// Clear Raster Tile Function Tables.
> @@ -54,17 +54,17 @@ struct StoreRasterTileClear
> /// @param pDstSurface - Destination surface state
> /// @param x, y - Coordinates to raster tile.
> INLINE static void StoreClear(
> - const BYTE* dstFormattedColor,
> + const uint8_t* dstFormattedColor,
> UINT dstBytesPerPixel,
> SWR_SURFACE_STATE* pDstSurface,
> UINT x, UINT y) // (x, y) pixel coordinate to start of raster tile.
> {
> // Compute destination address for raster tile.
> - BYTE* pDstTile = (BYTE*)pDstSurface->pBaseAddress +
> + uint8_t* pDstTile = (uint8_t*)pDstSurface->pBaseAddress +
> (y * pDstSurface->pitch) + (x * dstBytesPerPixel);
>
> // start of first row
> - BYTE* pDst = pDstTile;
> + uint8_t* pDst = pDstTile;
> UINT dstBytesPerRow = 0;
>
> // For each raster tile pixel in row 0 (rx, 0)
> @@ -104,15 +104,15 @@ struct StoreMacroTileClear
> /// @param pDstSurface - Destination surface state
> /// @param x, y - Coordinates to macro tile
> static void StoreClear(
> - const FLOAT *pColor,
> + const float *pColor,
> SWR_SURFACE_STATE* pDstSurface,
> UINT x, UINT y)
> {
> UINT dstBytesPerPixel = (FormatTraits<DstFormat>::bpp / 8);
>
> - BYTE dstFormattedColor[16]; // max bpp is 128, so 16 is all we need here for one pixel
> + uint8_t dstFormattedColor[16]; // max bpp is 128, so 16 is all we need here for one pixel
>
> - FLOAT srcColor[4];
> + float srcColor[4];
>
> for (UINT comp = 0; comp < FormatTraits<DstFormat>::numComps; ++comp)
> {
> diff --git a/src/gallium/drivers/swr/rasterizer/memory/Convert.h b/src/gallium/drivers/swr/rasterizer/memory/Convert.h
> index 0f9e0ad..7c185e5 100644
> --- a/src/gallium/drivers/swr/rasterizer/memory/Convert.h
> +++ b/src/gallium/drivers/swr/rasterizer/memory/Convert.h
> @@ -227,10 +227,10 @@ static uint16_t Convert32To16Float(float val)
> /// @param srcPixel - Pointer to source pixel (pre-swizzled according to dest).
> template<SWR_FORMAT DstFormat>
> static void ConvertPixelFromFloat(
> - BYTE* pDstPixel,
> + uint8_t* pDstPixel,
> const float srcPixel[4])
> {
> - UINT outColor[4]; // typeless bits
> + uint32_t outColor[4] = { 0 }; // typeless bits
>
> // Store component
> for (UINT comp = 0; comp < FormatTraits<DstFormat>::numComps; ++comp)
> @@ -390,9 +390,9 @@ static void ConvertPixelFromFloat(
> template<SWR_FORMAT SrcFormat>
> INLINE static void ConvertPixelToFloat(
> float dstPixel[4],
> - const BYTE* pSrc)
> + const uint8_t* pSrc)
> {
> - UINT srcColor[4]; // typeless bits
> + uint32_t srcColor[4]; // typeless bits
>
> // unpack src pixel
> typename FormatTraits<SrcFormat>::FormatT* pPixel = (typename FormatTraits<SrcFormat>::FormatT*)pSrc;
> @@ -421,11 +421,11 @@ INLINE static void ConvertPixelToFloat(
> }
>
> // Convert components
> - for (UINT comp = 0; comp < FormatTraits<SrcFormat>::numComps; ++comp)
> + for (uint32_t comp = 0; comp < FormatTraits<SrcFormat>::numComps; ++comp)
> {
> SWR_TYPE type = FormatTraits<SrcFormat>::GetType(comp);
>
> - UINT src = srcColor[comp];
> + uint32_t src = srcColor[comp];
>
> switch (type)
> {
> @@ -486,7 +486,7 @@ INLINE static void ConvertPixelToFloat(
> }
> case SWR_TYPE_UINT:
> {
> - UINT dst = (UINT)src;
> + uint32_t dst = (uint32_t)src;
> dstPixel[FormatTraits<SrcFormat>::swizzle(comp)] = *(float*)&dst;
> break;
> }
> diff --git a/src/gallium/drivers/swr/rasterizer/memory/tilingtraits.h b/src/gallium/drivers/swr/rasterizer/memory/tilingtraits.h
> index 50f8e57..381ac89 100644
> --- a/src/gallium/drivers/swr/rasterizer/memory/tilingtraits.h
> +++ b/src/gallium/drivers/swr/rasterizer/memory/tilingtraits.h
> @@ -28,6 +28,7 @@
> #pragma once
>
> #include "core/state.h"
> +#include "common/simdintrin.h"
>
> template<SWR_TILE_MODE mode, int>
> struct TilingTraits
> @@ -130,63 +131,6 @@ template<int X> struct TilingTraits <SWR_TILE_MODE_WMAJOR, X>
> static UINT GetPdepY() { return 0x1ea; }
> };
>
> -INLINE
> -UINT pdep_u32(UINT a, UINT mask)
> -{
> -#if KNOB_ARCH==KNOB_ARCH_AVX2
> - return _pdep_u32(a, mask);
> -#else
> - UINT result = 0;
> -
> - // copied from http://wm.ite.pl/articles/pdep-soft-emu.html
> - // using bsf instead of funky loop
> - DWORD maskIndex;
> - while (_BitScanForward(&maskIndex, mask))
> - {
> - // 1. isolate lowest set bit of mask
> - const UINT lowest = 1 << maskIndex;
> -
> - // 2. populate LSB from src
> - const UINT LSB = (UINT)((int)(a << 31) >> 31);
> -
> - // 3. copy bit from mask
> - result |= LSB & lowest;
> -
> - // 4. clear lowest bit
> - mask &= ~lowest;
> -
> - // 5. prepare for next iteration
> - a >>= 1;
> - }
> -
> - return result;
> -#endif
> -}
> -
> -INLINE
> -UINT pext_u32(UINT a, UINT mask)
> -{
> -#if KNOB_ARCH==KNOB_ARCH_AVX2
> - return _pext_u32(a, mask);
> -#else
> - UINT result = 0;
> - DWORD maskIndex;
> - uint32_t currentBit = 0;
> - while (_BitScanForward(&maskIndex, mask))
> - {
> - // 1. isolate lowest set bit of mask
> - const UINT lowest = 1 << maskIndex;
> -
> - // 2. copy bit from mask
> - result |= ((a & lowest) > 0) << currentBit++;
> -
> - // 3. clear lowest bit
> - mask &= ~lowest;
> - }
> - return result;
> -#endif
> -}
> -
> //////////////////////////////////////////////////////////////////////////
> /// @brief Computes the tileID for 2D tiled surfaces
> /// @param pitch - surface pitch in bytes
> diff --git a/src/gallium/drivers/swr/rasterizer/scripts/gen_knobs.py b/src/gallium/drivers/swr/rasterizer/scripts/gen_knobs.py
> index 44ab698..3d003fb 100644
> --- a/src/gallium/drivers/swr/rasterizer/scripts/gen_knobs.py
> +++ b/src/gallium/drivers/swr/rasterizer/scripts/gen_knobs.py
> @@ -1,4 +1,4 @@
> -# Copyright (C) 2014-2015 Intel Corporation. All Rights Reserved.
> +# Copyright (C) 2014-2016 Intel Corporation. All Rights Reserved.
> #
> # Permission is hereby granted, free of charge, to any person obtaining a
> # copy of this software and associated documentation files (the "Software"),
> diff --git a/src/gallium/drivers/swr/rasterizer/scripts/knob_defs.py b/src/gallium/drivers/swr/rasterizer/scripts/knob_defs.py
> index 8c51e1e..0f3ded6 100644
> --- a/src/gallium/drivers/swr/rasterizer/scripts/knob_defs.py
> +++ b/src/gallium/drivers/swr/rasterizer/scripts/knob_defs.py
> @@ -1,4 +1,4 @@
> -# Copyright (C) 2014-2015 Intel Corporation. All Rights Reserved.
> +# Copyright (C) 2014-2016 Intel Corporation. All Rights Reserved.
> #
> # Permission is hereby granted, free of charge, to any person obtaining a
> # copy of this software and associated documentation files (the "Software"),
> @@ -21,24 +21,20 @@
>
> # Python source
> KNOBS = [
> - ['ENABLE_ASSERT_DIALOGS', {
> - 'type' : 'bool',
> - 'default' : 'true',
> - 'desc' : ['Use dialogs when asserts fire.',
> - 'Asserts are only enabled in debug builds'],
> - }],
>
> ['SINGLE_THREADED', {
> 'type' : 'bool',
> 'default' : 'false',
> 'desc' : ['If enabled will perform all rendering on the API thread.',
> 'This is useful mainly for debugging purposes.'],
> + 'category' : 'debug',
> }],
>
> ['DUMP_SHADER_IR', {
> - 'type' : 'bool',
> - 'default' : 'false',
> - 'desc' : ['Dumps shader LLVM IR at various stages of jit compilation.'],
> + 'type' : 'bool',
> + 'default' : 'false',
> + 'desc' : ['Dumps shader LLVM IR at various stages of jit compilation.'],
> + 'category' : 'debug',
> }],
>
> ['USE_GENERIC_STORETILE', {
> @@ -46,6 +42,7 @@ KNOBS = [
> 'default' : 'false',
> 'desc' : ['Always use generic function for performing StoreTile.',
> 'Will be slightly slower than using optimized (jitted) path'],
> + 'category' : 'debug',
> }],
>
> ['FAST_CLEAR', {
> @@ -53,6 +50,7 @@ KNOBS = [
> 'default' : 'true',
> 'desc' : ['Replace 3D primitive execute with a SWRClearRT operation and',
> 'defer clear execution to first backend op on hottile, or hottile store'],
> + 'category' : 'perf',
> }],
>
> ['MAX_NUMA_NODES', {
> @@ -61,6 +59,7 @@ KNOBS = [
> 'desc' : ['Maximum # of NUMA-nodes per system used for worker threads',
> ' 0 == ALL NUMA-nodes in the system',
> ' N == Use at most N NUMA-nodes for rendering'],
> + 'category' : 'perf',
> }],
>
> ['MAX_CORES_PER_NUMA_NODE', {
> @@ -69,6 +68,7 @@ KNOBS = [
> 'desc' : ['Maximum # of cores per NUMA-node used for worker threads.',
> ' 0 == ALL non-API thread cores per NUMA-node',
> ' N == Use at most N cores per NUMA-node'],
> + 'category' : 'perf',
> }],
>
> ['MAX_THREADS_PER_CORE', {
> @@ -77,6 +77,7 @@ KNOBS = [
> 'desc' : ['Maximum # of (hyper)threads per physical core used for worker threads.',
> ' 0 == ALL hyper-threads per core',
> ' N == Use at most N hyper-threads per physical core'],
> + 'category' : 'perf',
> }],
>
> ['MAX_WORKER_THREADS', {
> @@ -87,6 +88,7 @@ KNOBS = [
> 'IMPORTANT: If this is non-zero, no worker threads will be bound to',
> 'specific HW threads. They will all be "floating" SW threads.',
> 'In this case, the above 3 KNOBS will be ignored.'],
> + 'category' : 'perf',
> }],
>
> ['BUCKETS_START_FRAME', {
> @@ -96,6 +98,7 @@ KNOBS = [
> '',
> 'NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h',
> 'for this to have an effect.'],
> + 'category' : 'perf',
> }],
>
> ['BUCKETS_END_FRAME', {
> @@ -105,6 +108,7 @@ KNOBS = [
> '',
> 'NOTE: KNOB_ENABLE_RDTSC must be enabled in core/knobs.h',
> 'for this to have an effect.'],
> + 'category' : 'perf',
> }],
>
> ['WORKER_SPIN_LOOP_COUNT', {
> @@ -112,46 +116,32 @@ KNOBS = [
> 'default' : '5000',
> 'desc' : ['Number of spin-loop iterations worker threads will perform',
> 'before going to sleep when waiting for work'],
> + 'category' : 'perf',
> }],
>
> ['MAX_DRAWS_IN_FLIGHT', {
> 'type' : 'uint32_t',
> - 'default' : '160',
> + 'default' : '96',
> 'desc' : ['Maximum number of draws outstanding before API thread blocks.'],
> + 'category' : 'perf',
> }],
>
> ['MAX_PRIMS_PER_DRAW', {
> - 'type' : 'uint32_t',
> - 'default' : '2040',
> - 'desc' : ['Maximum primitives in a single Draw().',
> + 'type' : 'uint32_t',
> + 'default' : '2040',
> + 'desc' : ['Maximum primitives in a single Draw().',
> 'Larger primitives are split into smaller Draw calls.',
> 'Should be a multiple of (3 * vectorWidth).'],
> + 'category' : 'perf',
> }],
>
> ['MAX_TESS_PRIMS_PER_DRAW', {
> - 'type' : 'uint32_t',
> - 'default' : '16',
> - 'desc' : ['Maximum primitives in a single Draw() with tessellation enabled.',
> + 'type' : 'uint32_t',
> + 'default' : '16',
> + 'desc' : ['Maximum primitives in a single Draw() with tessellation enabled.',
> 'Larger primitives are split into smaller Draw calls.',
> 'Should be a multiple of (vectorWidth).'],
> - }],
> -
> - ['MAX_FRAC_ODD_TESS_FACTOR', {
> - 'type' : 'float',
> - 'default' : '63.0f',
> - 'desc' : ['(DEBUG) Maximum tessellation factor for fractional-odd partitioning.'],
> - }],
> -
> - ['MAX_FRAC_EVEN_TESS_FACTOR', {
> - 'type' : 'float',
> - 'default' : '64.0f',
> - 'desc' : ['(DEBUG) Maximum tessellation factor for fractional-even partitioning.'],
> - }],
> -
> - ['MAX_INTEGER_TESS_FACTOR', {
> - 'type' : 'uint32_t',
> - 'default' : '64',
> - 'desc' : ['(DEBUG) Maximum tessellation factor for integer partitioning.'],
> + 'category' : 'perf',
> }],
>
>
> @@ -159,12 +149,14 @@ KNOBS = [
> 'type' : 'bool',
> 'default' : 'false',
> 'desc' : ['Enable threadviz output.'],
> + 'category' : 'perf',
> }],
>
> ['TOSS_DRAW', {
> 'type' : 'bool',
> 'default' : 'false',
> 'desc' : ['Disable per-draw/dispatch execution'],
> + 'category' : 'perf',
> }],
>
> ['TOSS_QUEUE_FE', {
> @@ -173,6 +165,7 @@ KNOBS = [
> 'desc' : ['Stop per-draw execution at worker FE',
> '',
> 'NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h'],
> + 'category' : 'perf',
> }],
>
> ['TOSS_FETCH', {
> @@ -181,6 +174,7 @@ KNOBS = [
> 'desc' : ['Stop per-draw execution at vertex fetch',
> '',
> 'NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h'],
> + 'category' : 'perf',
> }],
>
> ['TOSS_IA', {
> @@ -189,6 +183,7 @@ KNOBS = [
> 'desc' : ['Stop per-draw execution at input assembler',
> '',
> 'NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h'],
> + 'category' : 'perf',
> }],
>
> ['TOSS_VS', {
> @@ -197,6 +192,7 @@ KNOBS = [
> 'desc' : ['Stop per-draw execution at vertex shader',
> '',
> 'NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h'],
> + 'category' : 'perf',
> }],
>
> ['TOSS_SETUP_TRIS', {
> @@ -205,6 +201,7 @@ KNOBS = [
> 'desc' : ['Stop per-draw execution at primitive setup',
> '',
> 'NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h'],
> + 'category' : 'perf',
> }],
>
> ['TOSS_BIN_TRIS', {
> @@ -213,6 +210,7 @@ KNOBS = [
> 'desc' : ['Stop per-draw execution at primitive binning',
> '',
> 'NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h'],
> + 'category' : 'perf',
> }],
>
> ['TOSS_RS', {
> @@ -221,6 +219,5 @@ KNOBS = [
> 'desc' : ['Stop per-draw execution at rasterizer',
> '',
> 'NOTE: Requires KNOB_ENABLE_TOSS_POINTS to be enabled in core/knobs.h'],
> - }],
> -
> -]
> + 'category' : 'perf',
> + }],]
> diff --git a/src/gallium/drivers/swr/rasterizer/scripts/templates/knobs.template b/src/gallium/drivers/swr/rasterizer/scripts/templates/knobs.template
> index 922117e..521346c 100644
> --- a/src/gallium/drivers/swr/rasterizer/scripts/templates/knobs.template
> +++ b/src/gallium/drivers/swr/rasterizer/scripts/templates/knobs.template
> @@ -10,7 +10,7 @@
> return ' '*(max_len - knob_len)
> %>/******************************************************************************
> *
> -* Copyright 2015
> +* Copyright 2015-2016
> * Intel Corporation
> *
> * Licensed under the Apache License, Version 2.0 (the "License");
> @@ -77,7 +77,11 @@ struct GlobalKnobs
> % for line in knob[1]['desc']:
> // ${line}
> % endfor
> + % if knob[1]['type'] == 'std::string':
> + DEFINE_KNOB(${knob[0]}, ${knob[1]['type']}, "${repr(knob[1]['default'])[1:-1]}");
> + % else:
> DEFINE_KNOB(${knob[0]}, ${knob[1]['type']}, ${knob[1]['default']});
> + % endif
>
> % endfor
> GlobalKnobs();
> @@ -125,7 +129,7 @@ std::string GlobalKnobs::ToString(const char* optPerLinePrefix)
> str << optPerLinePrefix << "KNOB_${knob[0]}:${space_knob(knob[0])}";
> % if knob[1]['type'] == 'bool':
> str << (KNOB_${knob[0]} ? "+\n" : "-\n");
> - % elif knob[1]['type'] != 'float':
> + % elif knob[1]['type'] != 'float' and knob[1]['type'] != 'std::string':
> str << std::hex << std::setw(11) << std::left << KNOB_${knob[0]};
> str << std::dec << KNOB_${knob[0]} << "\n";
> % else:
> diff --git a/src/gallium/drivers/swr/swr_context.cpp b/src/gallium/drivers/swr/swr_context.cpp
> index 78b8fdf..46c79a1 100644
> --- a/src/gallium/drivers/swr/swr_context.cpp
> +++ b/src/gallium/drivers/swr/swr_context.cpp
> @@ -338,7 +338,6 @@ swr_create_context(struct pipe_screen *p_screen, void *priv, unsigned flags)
> SWR_CREATECONTEXT_INFO createInfo;
> createInfo.driver = GL;
> createInfo.privateStateSize = sizeof(swr_draw_context);
> - createInfo.maxSubContexts = 0;
> createInfo.pfnLoadTile = swr_LoadHotTile;
> createInfo.pfnStoreTile = swr_StoreHotTile;
> createInfo.pfnClearTile = swr_StoreHotTileClear;
> --
> 1.9.1
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list