Friday 16 August 2013

Dota 2 Wine optimization for Intel GPUs

Dota 2 for Linux implements it's 3D engine by using a Direct3D to OpenGL translation layer called ToGL. I assume that this layer can be used in different ways, but for Dota 2 it seems to be used in a less than ideal way as documented previously here. In short, Dota for Linux compiles 11000 shaders on startup, compared to just 220 the Wine version does. This causes much higher memory usage (1.2 GB vs 2.6 GB) and start-up time (35 seconds vs 1:15 min).


With Wine we actually do get the source of their Direct3D to OpenGL layer called wined3d, since Wine is open source. It's funny, the stack used to run Windows version of Dota 2 is actually more open.

Since Dota 2 for Windows when run on Wine actually outperforms native Linux version in some important aspects, and it's framerate is just slightly less, I decided to take a look on improving its performance.

I've used a tool called apitrace to record a trace of a Dota 2 session with wine so I can analyze the OpenGL calls and look at driver performance warnings (INTEL_DEBUG=perf) with qapitrace.

I optimized two things:

1. Reduce the number of vs and ps constants checked

There were many calls to check values of VS (vertex shader) and PS (pixel shader, also called fragment shaders in OpenGL) constants each frame like this:

532550 glGetUniformLocationARB(programObj = 152, name = "vs_c[4095]") = -1

This function is called from shader_glsl_init_vs_uniform_locations() in glsl_shader.c in
wined3d.

It uses GL_MAX_VERTEX_UNIFORM_COMPONENTS_ARB, defined to be 4096 in #define MAX_UNIFORMS in Mesa source.

Dota 2 doesn't need so many uniforms, most checks return -1, and wined3d checks all of values for both VS
and PS uniforms.

I reduced this number to 256, just enough for Dota 2. This saved thousands of calls per frame.

2. Use fast clear depth more often

Intel driver complains about not being able to use fast depth clears because of scissor being enabled. Turns out that device_clear_render_targets() in wined3d device.c doesn't really need to do glScissor for Dota 2, it's probably an optimization that maps better to Direct3D driver.


A small patch including both optimizations is here:
https://gist.github.com/vrodic/6437312

This patch is a hack, and glScissor part probably breaks other apps, so this is just for Dota 2. It maybe could be made in a better way so it could be merged in Wine, but I'm not wined3d expert.

So how faster is it? A solo mid hero on a setup described in the previous blog post used to get 41 FPS. Now it gets 46-49 FPS. Native version is similar to optimized Wine, but in some situations it gets worse than Wine optimized.

Ideas for improvement:

Dota 2 for Linux needs  ~7500 calls per frame. Wine version, even after my optimizations needs 37000 (EDIT: just as I was writing this post, there were some improvements, now its about 22000).

There is probably a way to optimize this even more, but it's outside of the scope of an afternoon project,  like this was. I'd like to keep on digging though.

Wednesday 14 August 2013

Dota 2 performance: Linux/native vs Linux/Wine vs Windows 7 on Intel GPU

So how well does the mega popular game Dota 2 work on Linux? I've had some time to make detailed tests on my Intel IvyBridge GPU laptop (Lenovo ThinkPad X230). The graphics settings are the same on all versions.


Dota 2 Windows binary under Wine 1.6
Startup: 35 secs (over ntfs3g userspace fs driver that is not that fast)
Mem: 1.0GB
FPS: 37.5 FPS
Power usage LP mode (patched wine): 34W

Dota 2 Linux native
Startup: 1 min 14 secs (native ext4)
Mem: 2.6 GB
FPS: 40 FPS

Dota 2 Windows native - Windows 7
Startup: 25 secs
Mem: 1.2GB (measured by Windows Task manager)
FPS: 80 FPS
Power usage LP mode: 24W

Test setup:

CPU: Core i5 3320M

Resolution: 1366x768

GPU settings (same on all Dota 2 versions): shadows MEDIUM, textures HIGH, render quality: HIGHEST, all other: OFF, vsync: disabled

GPU settings LP mode (for power measurments above): Shadowd low, effects OFF, textures MED, render quality: LOWEST, fpx_max: 30

Mesa version: git-8b5b5fe (with rendering regression fix from here: https://bugs.freedesktop.org/show_bug.cgi?id=67887)

Linux distro: Ubuntu 13.10, kernel 3.11 drm-intel-nightly, running LXDE

FPS Benchmark method:
in "dota 2 beta/dota/cfg/autoexec.cfg"
cl_showfps 2
playdemo test

For FPS measurement number: look at the last 240 frames when the demo is ending

Memory measuring: RES column with `top`
Startup time measuring: stopwatch until the map is loaded

Analysis of the apitrace trace file:

I've made a trace of Dota 2 with apitrace, revealing possible performance issues.

Before the first frame of the game is drawn 11038 shaders are compiled. That is most likely why the load time is so slow and memory usage is so high. In addition a lot of the shaders being used seem to be recompiled by the Intel driver when rendering frames.

There are 162 frames in the trace I've analyzed, 193 shader recompiles, and 643 different shader programs (each program has 1 VS and 1 FS) used.

In contrast, Wine version of Dota 2 compiles only 220 shaders.

Performance feedback from the Intel driver:

glretrace of apitrace prints driver performance warnings. A sample of some that repeat every frame. These include shader recompile warnings.

575332: glDebugOutputCallback: Medium severity API performance issue 13, Clear color unsupported by fast color clear. Falling back to slow clear.
576094: glDebugOutputCallback: Medium severity API performance issue 14, Failed to fast clear depth due to scissor being enabled. Possible 5% performance win if avoided.
577739: glDebugOutputCallback: Medium severity API performance issue 4, Using a blit copy to avoid stalling on 480b glBufferSubData() to a busy buffer object.
577801: glDebugOutputCallback: Medium severity API performance issue 8, Recompiling vertex shader for program 7901
577801: glDebugOutputCallback: Medium severity API performance issue 9, Didn't find previous compile in the shader cache for debug
577801: glDebugOutputCallback: Medium severity API performance issue 1, Recompiling fragment shader for program 7901
577801: glDebugOutputCallback: Medium severity API performance issue 10, Didn't find previous compile in the shader cache for debug
577801: glDebugOutputCallback: Medium severity API performance issue 3, FS compile took 2.266 ms and stalled the GPU

Warnings in sequence in URL below:
https://gist.github.com/vrodic/6235313

It's interesting that most of this warnings were added by request of Valve back in 2012:
http://lists.freedesktop.org/archives/mesa-dev/2012-August/025288.html




Conclusion:

If you have a memory constrained machine and want to run under Linux, maybe using Wine is a better choice.

I hope Valve cares enough about Linux to fix what they can on their side and work with folks from Intel to fix their performance problems.

You will probably be luckier if you run it on nVidia GPU, since people in general are claiming performance very similar to Windows. Though probably still with slower startup times and higher mem usage.

Some info on how to compile Mesa 32 bit on 64bit Ubuntu:

Unfortunately this doesn't work with old Ubuntu versions, just with 13.10. For old version you must also remove 64 bit versions of the compiler, which was a bit too much of a requirement for me.

apt-get install gcc-multilib g++-multilib binutils-multiarch libx11-dev:i386 libdrm-dev:i386 

and as configure complains install others

apt-get build-dep mesa -ai386 

wants to install too much and remove some 64 bit stuff I need, and we don't actually need llvm i386 dev stuff for  compiling just the Intel driver.

I use this script to compile just the i965 driver for mesa:

sh autogen.sh
make clean
export CFLAGS="-m32 -O3 -mtune=native -march=native -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security"
export CXXFLAGS="-m32 -O3 -mtune=native -march=native -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security"
export PKG_CONFIG_PATH=/usr/lib/i386-linux-gnu/pkgconfig
./configure --disable-egl --enable-glx-tls --with-gallium-drivers= --with-dri-drivers=i965 --enable-32-bit
make -j4