YAGU – Yet another global update

– Ps3/Cell: I eventually fixed the binutils in order to get it build for spu-elf, I’m about to unbreak better gcc since someone thought NOTE_INSN_EXPECTED_VALUE wouldn’t be of any use, while the spu.md is using it for the cmp instructions… (needless to say my workaround isn’t working that well…), tomorrow hopefully I’ll update the patchset either with the revert of this patch or with a proper solution (hard since I’m not that proficient in gcc internals =/), I’m afraid of glibc…

– Fenice/lscube: trac + git is a no go, trac itself or setuputils on the fedora server that is streaming is from bad to idiotic or maybe it’s just me unproficient, add also that gitweb seems unavailable as fedora and you get an interesting picture, importing is easy, working almost (once everybody figures the commands), so there some uncertain about a full transition.

– Personal life: You can use anything happens to you to improve, still I’m afraid I won’t develop the ability to teleport and/or timeshift in order to improve the situation, anyway I’m unfeeling better.

– University: I’m forced to do something in mono for an exam and obviously I have about 30h to learn enough gtk# to do that, luckly glade is always glade…

gcc vs glibc vs ppc64

I got eventually a shell on a power5 box to test some stuff.

I’m trying discover what’s broken in glibc and/or gcc when you try to build as ppc64, currently busybox has some problems with setjmp symbols if you try to build it static, while trying snapshots I ended up in having other issues with gcc-4.2 snapshot not building glibc-2.5 and that looks like yet another pain =/

Luckly the box is fast enough to not take that much building and rebuilding glibc =)

One thing there

After getting some sense about memcpy and h264 (ok, my sample was short enough to make relevant some optimizations that apply just on codec init, thus meaningless) I eventually got something in that seems to be relevant enough and tool quite few lines: I enabled prefetch.

It is pretty much a single asm line and in certain cases it meant a 10% of overall decoding time shaved away. Before I tried using altivec prefetch and it didn’t show a great result so I just removed it, 2 days ago I implemented it with the generic instruction and the result was pleasant enough.

If you happen to have non G4 systems please try to benchmark mpegvideo and h264 decoding for me and report results, the commit revision is 6669

Hopefully I’ll try to provide a snapshot for gentoo in this weekend.

ffmpeg, what’s missing?

Ok, the title is misleading on purpose, as you can see from the previous post I got some requests about ffmpeg+ppc (power, cell, plain ppc), in the case of h264 I’m afraid all the useful bits are already vectorized and the little left around will be useful but isn’t really top priority (obviously I’ll try to be on par with i386 optimizations, so weight and loop filter funcs will have their respective in altivec sooner or later). Other codecs have lots missing vectorization wise, say vp{3,5,6} family that many could like/need because of flash embedded videos, or some quick asm bits could be quite useful for our lil embedded ppcs the same way they are already useful and implemented for arm.

My plan for the next week is keep reordering code and put it back in arch specific dirs so it could be implemented in a more agile way (see what I did for the mathops or what I’m about to try for the bitstream read/write functions), hopefully I’ll complete and commit some altivec optimizations like the mdct (even if I should check if in altivec makes the difference or not), the vp idct variant or the h264 latest bits.

I’ll be unavailable for the week end, see you monday =)

oprofiling ffh264

Recently I got some inquiries about h264 and altivec. just testing decode time was disappointing to some user.

I did my test and on my g4 1.6 I got about the double ofthe speed he experienced on his g5 2.4.

time nice –20 ./ffmpeg -i ~ryan/bluesky_HD_CAVLC_JM93_217f.264 -f rawvideo – > /dev/null
real 0m47.685s
user 0m44.304s
sys 0m3.220s

cat /proc/cpuinfo
processor : 0
cpu : ppc970, altivec supported
clock : 2400.000000MHz
revision : 4.0 (pvr 0070 0400)

time nice –20 ./ffmpeg -i /tmp/bluesky_HD_CAVLC_JM93_217f.264 -f
rawvideo – > /dev/null
real 0m25.877s
user 0m23.768s
sys 0m1.904s

cat /proc/cpuinfo
processor : 0
cpu : 7447A, altivec supported
clock : 1666.666000MHz
revision : 0.5 (pvr 8003 0105)

The ffmpeg code is the same, I hadn’t use anything but the stock cflags, same for him.
I was expecting quite a different result, time hunt the slow gear!

I used oprofile

just started and stopped it befor the ffmpeg call, and the asked opreport to compute some statistics about symbols.

an excerpt

CPU: PowerPC G4, speed 1666.67 MHz (estimated)
Counted CYCLES events (Cycles) with a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
60355 23.2602 libc-2.4.so _wordcopy_fwd_aligned
13572 5.2305 ffmpeg_g put_h264_chroma_mc8_altivec
13417 5.1708 ffmpeg_g filter_mb
11379 4.3853 ffmpeg_g put_h264_qpel16_h_lowpass_altivec
9700 3.7383 ffmpeg_g fill_caches
9332 3.5965 ffmpeg_g hl_decode_mb
8201 3.1606 vmlinux __flush_dcache_icache

Looks like I’ll have to replace something… or start thinking about optimized glibc…
(mine is built targeting my cpu and is pretty recent, I wonder if the G5 isn’t running on an older or generic built glibc…)

Poking in Mesa…

Yesterday I had a look at the curent mesa sources…

And I found out that x86 and amd64 had plenty of optimized code, mostly hand made assembly!
What about ppc you’d ask? Well, there is an empty dir with the code to know if the cpu has altivec or not =_=…

I should study for my last exams so I won’t do much about it in the short time =/