Gentoo-Stats – Marius Mauch

Time to say goodbye

So, time has come for me to realize that my time with Gentoo is over. I
haven’t actually been doing much Gentoo work over the last months due
to personal reasons (nothing Gentoo related), and I don’t see that
situation changing in the near future. In fact I’ve already reassigned
or dropped most of my responsibilites in Gentoo a while ago, so there
are just a few pet projects left to give away:
– my gentoo-stats project (in the portage/gentoo-stats svn repository).
I know quite a few people are interested in the idea of collecting
various statistic data from gentoo user systems, and I’d encourage
everyone who wants to implement such a system to at least look at it (I
may have even finished it if I wouldn’t have wasted my time focusing on
the wrong problems). There is quite a bit of documentation also that
should help to get you started
– a graphical security update tool (see bug #190397)

So if anyone wants to adopt those, complete or just parts, just take
them. As for Portage, Zac has practically already filled my role.

So I guess that wraps it up. It’s been a nice ride most of the time,
but now it’s time for me to leave the Gentoo train.

Gentoo-Stats isn’t dead (yet)

I assume some of you have been wondering what has happened to my gentoo-stats project as there haven’t been any news or updates recently. Well, unfortunately there isn’t much going on, I guess I’ve been just a bit too frustrated with it to work on it in the last weeks/months. That frustration mainly comes from the package-filemap module and its crappy performance and the conceptual failure of the auth encryption I had planned/implemented. The latter is just frustrating simply due to the wasted time, but the former means the lack of a key feature, namely finding which packages provide a given file even for uninstalled packages. Already tried several things to get it faster but without real success so far
Now I have two more ideas how to get it still working: First is to simply reduce the amount of data to the bare minimum (e.g. just recording executables and libraries), the second is using a custom storage backend for filenames instead of using MySQL for everything (as the DBMS is the slow part). I really want to avoid the first (as it would reduce functionality and likely just delay the problem a bit) and only use it as a last resort before dropping the module completely, so a while ago I wrote a custom backend based for storing filenames efficiently, but haven’t integrated it yet into the processing module. We’ll see if I can find some time in the coming days/weeks to get this project back on track.

If you’re interested in helping with it:
– I don’t have any design for the web interface yet, so far it’s just basic HTML-2.0 or so. I’m not a big designer, so this is something where I’d definitely welcome external help
– A GUI for the client would be nice (like for selecting data modules or performing complex queries), but I’m not a big fan of GUI programming (though I could help with any missing backend parts in the client)
– Wouldn’t hurt to have someone else who’s an expert with (My)SQL/mod_python/security have a look at the current code/db schema before this service goes into public testing.

gentoo-stats status

So after slacking for about a week or two due to the crappy weather
here (I guess most people would call >=30

gentoo-stats test request 1

Don’t get too excited about the title, most stuff isn’t usable yet, though I think it doesn’t hurt if a few people start testing the parts in the client that are supposed to work. If you feel brave enough start reading the little test-howto (work in progress).

how to authenticate

So now I’m at the point where I need to work on the authentication part for the stats server code, and I noticed that my plan to use http digest authentication doesn’t work as that requires to store the plaintext password of clients on the server which I’d like to avoid (generally one should only store a hash of the passwords in the authentication backend).
Before going into alternatives let me list a few requirements I have for them:
– don’t require the real password in the auth backend
– don’t transmit the real password unsecured over the network
– must work with only http headers, don’t touch the body in any way
– must be easily scriptable
– preemptive authorization (e.g. send the auth data with the first request)
– should work within a webbrowser
So, what options do I have now? Well, I can’t see a single alternative that fits all requirements (if you know one let me know), the closest is http basic auth, but I really don’t want to send the password over network as almost-plaintext. This lead me to the idea of extending it with gpg-encrypting the password, but that’s not transparent when you use the browser (not that important for the current use case) and more importantly gpg adds about 600 bytes of protocol overhead for encrypted data (without using –armor), with the base64 encoding required for http that’s almost one kilobyte just for a password that originally only had a few bytes.
So, right now I have to select between a rather hackish, inefficient and untested but secure solution and a well-tested, relatively efficient and well-specified but insecure one. What would people prefer here?
Or does anyone know another solution to the problem that satisfies the above requirements? (the first four are hard requirements, the other two I could work around)

transport code works, performance issues

Made a big step forward in the last few days as I’ve implemented the basic client ID management and the DRF transport code, in other words I can now register a client and upload stats data to the server 🙂 (in theory at least)
Of course this is very fragile at the moment, partially because the client is a bit too optimistic when generating deltas, and sending a delta record if the server doesn’t have the matching base record isn’t all that useful.

On another note I’m thinking about disabling the packagefilemap module on the serverside as processing the data for it simply takes way too much time in the current state, at least on my box. All I can hope right now is that a box with better IO and/or CPU is going to make a very large difference.
The main problem is simply that inserting a DRF with the packagefilemap data included can result
in several 100k or even a few millions inserts (one or two per file) and selects (to lookup foreign keys). It’s not going to be a very common operation but will happen regulary (at least whenever a client submits the first DRF with packagefilemap data). Now I expected this to take some time, but I didn’t expect it to take over ten minutes, and currently it’s close to 30 (with an almost empty db, would probably be worse for a populated db).
While I know of a few ways to decrease the number of inserts and selects quite a bit I don’t like them as they require dropping certain functionality (like not being able to associate a filemap entry with package use flags anymore).
Well, we’ll see how it goes once I can test this on a not-so-crappy box.

Oh, and I just got the cheque with the initial payment an hour ago. It’s about 390 Euro (stupid exchange rates) and was delivered by FedEx, but no tracking mail, so I was quite suprised (as the “surprise” package was delivered by DHL and included a tracking mail).

hardware problems

Why does almost everytime I start working on something my server decide to kill itselfa little while later? And when it doesn’t why does suddenly my internet connection drop out for a few hours just when I’m about to commit a change to gentoo-stats?

This is really not helping in getting my SoC project (gentoo-stats for those who don’t know about it) done in the estimated timeframe.

As for the server problem, my current guess is that it’s caused by an overheated graphics card (an old passive cooled GF2MX), but that’s just a wild guess as it was a bit warm when I checked temperatures after the last lockup (and the general symptoms indicate a heat problem), I might replace it if I’ll ever get around to search for a cheap low-power, low-heat, no-noise AGP card that works with this board (I’d use the even older TNT, but it’s not compatible with the AGP slot on the board due to the 3,3V/5V issue).