Getting rid of KEYWORDS=-*, step 2

After raising the awareness about KEYWORDS=”-*” being a stupid thing to use in the last months today I decided to eliminate the remaining reason for using it (one couldn’t unmask a package that had KEYWORDS=”” without editing it) by adding support for a new token in package.keywords. So now when portage-2.1.3 goes live all theses live-cvs-completely-unsupported packages can stop using the broken KEYWORDS=”-*” and use KEYWORDS=”” instead without loosing functionality. And once we get the tree clean from those KEYWORDS=”-*” abusers we can also finally fix the -* handling for package.keywords to do what it should do (act like ACCEPT_KEYWORDS).

Gentoo-Stats isn’t dead (yet)

I assume some of you have been wondering what has happened to my gentoo-stats project as there haven’t been any news or updates recently. Well, unfortunately there isn’t much going on, I guess I’ve been just a bit too frustrated with it to work on it in the last weeks/months. That frustration mainly comes from the package-filemap module and its crappy performance and the conceptual failure of the auth encryption I had planned/implemented. The latter is just frustrating simply due to the wasted time, but the former means the lack of a key feature, namely finding which packages provide a given file even for uninstalled packages. Already tried several things to get it faster but without real success so far
Now I have two more ideas how to get it still working: First is to simply reduce the amount of data to the bare minimum (e.g. just recording executables and libraries), the second is using a custom storage backend for filenames instead of using MySQL for everything (as the DBMS is the slow part). I really want to avoid the first (as it would reduce functionality and likely just delay the problem a bit) and only use it as a last resort before dropping the module completely, so a while ago I wrote a custom backend based for storing filenames efficiently, but haven’t integrated it yet into the processing module. We’ll see if I can find some time in the coming days/weeks to get this project back on track.

If you’re interested in helping with it:
– I don’t have any design for the web interface yet, so far it’s just basic HTML-2.0 or so. I’m not a big designer, so this is something where I’d definitely welcome external help
– A GUI for the client would be nice (like for selecting data modules or performing complex queries), but I’m not a big fan of GUI programming (though I could help with any missing backend parts in the client)
– Wouldn’t hurt to have someone else who’s an expert with (My)SQL/mod_python/security have a look at the current code/db schema before this service goes into public testing.

how to authenticate

So now I’m at the point where I need to work on the authentication part for the stats server code, and I noticed that my plan to use http digest authentication doesn’t work as that requires to store the plaintext password of clients on the server which I’d like to avoid (generally one should only store a hash of the passwords in the authentication backend).
Before going into alternatives let me list a few requirements I have for them:
– don’t require the real password in the auth backend
– don’t transmit the real password unsecured over the network
– must work with only http headers, don’t touch the body in any way
– must be easily scriptable
– preemptive authorization (e.g. send the auth data with the first request)
– should work within a webbrowser
So, what options do I have now? Well, I can’t see a single alternative that fits all requirements (if you know one let me know), the closest is http basic auth, but I really don’t want to send the password over network as almost-plaintext. This lead me to the idea of extending it with gpg-encrypting the password, but that’s not transparent when you use the browser (not that important for the current use case) and more importantly gpg adds about 600 bytes of protocol overhead for encrypted data (without using –armor), with the base64 encoding required for http that’s almost one kilobyte just for a password that originally only had a few bytes.
So, right now I have to select between a rather hackish, inefficient and untested but secure solution and a well-tested, relatively efficient and well-specified but insecure one. What would people prefer here?
Or does anyone know another solution to the problem that satisfies the above requirements? (the first four are hard requirements, the other two I could work around)

transport code works, performance issues

Made a big step forward in the last few days as I’ve implemented the basic client ID management and the DRF transport code, in other words I can now register a client and upload stats data to the server 🙂 (in theory at least)
Of course this is very fragile at the moment, partially because the client is a bit too optimistic when generating deltas, and sending a delta record if the server doesn’t have the matching base record isn’t all that useful.

On another note I’m thinking about disabling the packagefilemap module on the serverside as processing the data for it simply takes way too much time in the current state, at least on my box. All I can hope right now is that a box with better IO and/or CPU is going to make a very large difference.
The main problem is simply that inserting a DRF with the packagefilemap data included can result
in several 100k or even a few millions inserts (one or two per file) and selects (to lookup foreign keys). It’s not going to be a very common operation but will happen regulary (at least whenever a client submits the first DRF with packagefilemap data). Now I expected this to take some time, but I didn’t expect it to take over ten minutes, and currently it’s close to 30 (with an almost empty db, would probably be worse for a populated db).
While I know of a few ways to decrease the number of inserts and selects quite a bit I don’t like them as they require dropping certain functionality (like not being able to associate a filemap entry with package use flags anymore).
Well, we’ll see how it goes once I can test this on a not-so-crappy box.

Dear users, …

sometimes I just hate you. Why? Because every now and then someone you haven’t heard of before comes around and asks a question, then doesn’t like the answer and calls you and idiot in one way or the other. Like today when somone joined our portage IRC channel, asked how to test a package on a arch it wasn’t keyworded for and then refused to understand how the keyword system works. He even called using package.keywords for anything other than ~arch a misuse of that feature (which is funny as I wrote that feature initially).
And after trying to explain it three times he still insisted that we should implement another way because the current one (with ACCEPT_KEYWORDS and package.keywords) wasn’t “logical”.
Instead he insisted that we add a feature to allow users to change package metadata (like KEYWORDS) with a simple config file, which isn’t just redundant but as any dev will assure you quite stupid and possibly harmful.
In the end he left after some nasty remarks, without having accomplished anything other than to upset a couple of devs due to simple ignorance. Now I assume he probably feels the same about us, but sometimes you just have to accept that the other party is right, and in this case we have >99% of the userbase and several years of experience backing our way up.

Now I know that such incidents are the rare exception and in most cases dealing with users is a nice experience, but they can make you pretty upset for a while sometimes.
So what’s the moral of the story? If you’re in (heated) discussion, always consider that you may be wrong, try to look at the other parties arguments from a different angle. If you notice that you’re hitting a wall go away for a while so everyone can cool down (and try the first advice again), and maybe try to resume the discussion at a later date. And if you’re talking with someone who’s an expert in a domain while you’re not, don’t try to beat him in that domain (e.g. don’t say that you change package metadata with package.keywords when you aren’t sure what package metadata is in the first place). That doesn’t mean that you can’t talk about it, but realize that the other person likely knows a lot more about the topic (and maybe doesn’t want to explain every little detail to you when rejecting an idea).

In the end it’s just for your own benefit (and nobody likes grumpy devs ;))

more elog goodness

After venting about annoying users last week I’ll try to post something a bit more useful today, maybe this will even evolve into a series about what’s new in portage land.

Well, today I added a little extension to the elog subsystem to make multi-target logging a bit more useful by extending the PORTAGE_ELOG_SYSTEM syntax a bit. Now you can override PORTAGE_ELOG_CLASSES per module, so for example one might send all messages into a file (using the save-summary module added in 2.1.2) and additionally send the important ones also by mail. Another related extension is that you can now use a * wildcard whereever a loglevel is wanted to include all loglevels.
So using the example above you would put the following in your make.conf:
PORTAGE_ELOG_SYSTEM="save-summary:* mail:log,warn,error"

what to do with deleted ebuilds

So today I wrote a little script to store the ebuilds and associated files of installed packages into a separate overlay as a first measure to solve bug #126059. It’s still far from perfect as it only adds to the overlay, isn’t limited to ebuilds deleted by an `emerge –sync` (which can be seen as a benefit as well) and most importantly doesn’t deal with Manifests yet.
The last point makes it rather useless for general usage, but it’s just a prototype script to see if this is a viable solution to the problem of disappearing ebuilds.
So any feedback about this would be appreciated.

Oh, and I just got the cheque with the initial payment an hour ago. It’s about 390 Euro (stupid exchange rates) and was delivered by FedEx, but no tracking mail, so I was quite suprised (as the “surprise” package was delivered by DHL and included a tracking mail).