diet for portage/__init__.py

So, as I said earlier I’ve now moved the dbapi stuff into it’s own subpackage, and portage/__init__.py (formerly portage.py) has now shrunk to 5k lines. However, that’s still way too much for me, so I’ll see what I can remove from it next, likely candidates are config() and/or doebuild stuff.
Hopefully at some point no module will have more than 1k lines, so things get managable again and we can start working again without getting lost in files that span hundreds of pages, and maybe even break some of teh larger functions/classes (config, fetch, treewalk, …) down into smaller pieces. Now what’s the point of breaking things up? Well, one thing is that the smaller a code block the easier it usually is to reuse it. Same for replacing it with something better. Also as I also have to determine what symbols each new module actually uses to rewrite the import statements it might also give us a better view on which symbols are actually used, the dependencies between modules and eventually give us a clue how to group them better (so that semantically related symbols are in the same namespace).

Namespace sanitizing and splitting up the tree

Something that’s bugged me for while in portage was the crappy namespace handling we had since whenever we moved the python modules to /usr/lib/portage/pym. Originally there was no real problem as we only had a single module portage.py, so all you needed was a ‘import portage’, but over time more modules were created, which Nick started to name portage_foo.py due to the lack of a “portage” python package to use as container. Also there were a number of modules without any “portage” part in the name, such as xpak, cvstree, output or the cache package, which could potentially cause a namespace collision with other packages in site-packages or even the standard library, not a very pleasant thought.
But as of today that’s history, I finally fixed this annoyance and moved all the portage related code into the new “portage” package (so portage.py is now portage/__init__.py and portage_foo.py is now portage/foo.py). For now the code is mostly a 1:1 translation, but over time it hopefully gets a bit cleaner by removing redundant qualifiers. Also this now allows us to split the big portage.py (or now __init__.py) up further without fearing namespace collisions, I’ll probably move the dbapi classes into their own package later this week.
But what does this all mean to you? If you’re just a normal user it shouldn’t affect you in any way (assuming I didn’t screw up anything and Zac updates the ebuild accordingly). If you have some custom scripts or are a developer of a tool using the portage API you should prepare for updating it after portage-2.1.3 is released, though for the time being the old names should just continue to work as I’ve also added some symlinks to avoid a large-scale API breakage.

On another note I fully agree with Diego on the idea of splitting the tree up. I’ve never been a big fan of the recent overlay hype, but at this point it’s still manageable. Also besides any technical problems a tree split would increase the “repo hunting” problem which we’re already starting to see and is IMHO one of the major downsides of most other (rpm-based) distributions, and that’s something I’d like to avoid in Gentoo.

Getting rid of KEYWORDS=-*, step 2

After raising the awareness about KEYWORDS=”-*” being a stupid thing to use in the last months today I decided to eliminate the remaining reason for using it (one couldn’t unmask a package that had KEYWORDS=”” without editing it) by adding support for a new token in package.keywords. So now when portage-2.1.3 goes live all theses live-cvs-completely-unsupported packages can stop using the broken KEYWORDS=”-*” and use KEYWORDS=”” instead without loosing functionality. And once we get the tree clean from those KEYWORDS=”-*” abusers we can also finally fix the -* handling for package.keywords to do what it should do (act like ACCEPT_KEYWORDS).

Gentoo-Stats isn’t dead (yet)

I assume some of you have been wondering what has happened to my gentoo-stats project as there haven’t been any news or updates recently. Well, unfortunately there isn’t much going on, I guess I’ve been just a bit too frustrated with it to work on it in the last weeks/months. That frustration mainly comes from the package-filemap module and its crappy performance and the conceptual failure of the auth encryption I had planned/implemented. The latter is just frustrating simply due to the wasted time, but the former means the lack of a key feature, namely finding which packages provide a given file even for uninstalled packages. Already tried several things to get it faster but without real success so far
Now I have two more ideas how to get it still working: First is to simply reduce the amount of data to the bare minimum (e.g. just recording executables and libraries), the second is using a custom storage backend for filenames instead of using MySQL for everything (as the DBMS is the slow part). I really want to avoid the first (as it would reduce functionality and likely just delay the problem a bit) and only use it as a last resort before dropping the module completely, so a while ago I wrote a custom backend based for storing filenames efficiently, but haven’t integrated it yet into the processing module. We’ll see if I can find some time in the coming days/weeks to get this project back on track.

If you’re interested in helping with it:
– I don’t have any design for the web interface yet, so far it’s just basic HTML-2.0 or so. I’m not a big designer, so this is something where I’d definitely welcome external help
– A GUI for the client would be nice (like for selecting data modules or performing complex queries), but I’m not a big fan of GUI programming (though I could help with any missing backend parts in the client)
– Wouldn’t hurt to have someone else who’s an expert with (My)SQL/mod_python/security have a look at the current code/db schema before this service goes into public testing.

how to authenticate

So now I’m at the point where I need to work on the authentication part for the stats server code, and I noticed that my plan to use http digest authentication doesn’t work as that requires to store the plaintext password of clients on the server which I’d like to avoid (generally one should only store a hash of the passwords in the authentication backend).
Before going into alternatives let me list a few requirements I have for them:
– don’t require the real password in the auth backend
– don’t transmit the real password unsecured over the network
– must work with only http headers, don’t touch the body in any way
– must be easily scriptable
– preemptive authorization (e.g. send the auth data with the first request)
– should work within a webbrowser
So, what options do I have now? Well, I can’t see a single alternative that fits all requirements (if you know one let me know), the closest is http basic auth, but I really don’t want to send the password over network as almost-plaintext. This lead me to the idea of extending it with gpg-encrypting the password, but that’s not transparent when you use the browser (not that important for the current use case) and more importantly gpg adds about 600 bytes of protocol overhead for encrypted data (without using –armor), with the base64 encoding required for http that’s almost one kilobyte just for a password that originally only had a few bytes.
So, right now I have to select between a rather hackish, inefficient and untested but secure solution and a well-tested, relatively efficient and well-specified but insecure one. What would people prefer here?
Or does anyone know another solution to the problem that satisfies the above requirements? (the first four are hard requirements, the other two I could work around)

transport code works, performance issues

Made a big step forward in the last few days as I’ve implemented the basic client ID management and the DRF transport code, in other words I can now register a client and upload stats data to the server 🙂 (in theory at least)
Of course this is very fragile at the moment, partially because the client is a bit too optimistic when generating deltas, and sending a delta record if the server doesn’t have the matching base record isn’t all that useful.

On another note I’m thinking about disabling the packagefilemap module on the serverside as processing the data for it simply takes way too much time in the current state, at least on my box. All I can hope right now is that a box with better IO and/or CPU is going to make a very large difference.
The main problem is simply that inserting a DRF with the packagefilemap data included can result
in several 100k or even a few millions inserts (one or two per file) and selects (to lookup foreign keys). It’s not going to be a very common operation but will happen regulary (at least whenever a client submits the first DRF with packagefilemap data). Now I expected this to take some time, but I didn’t expect it to take over ten minutes, and currently it’s close to 30 (with an almost empty db, would probably be worse for a populated db).
While I know of a few ways to decrease the number of inserts and selects quite a bit I don’t like them as they require dropping certain functionality (like not being able to associate a filemap entry with package use flags anymore).
Well, we’ll see how it goes once I can test this on a not-so-crappy box.

Dear users, …

sometimes I just hate you. Why? Because every now and then someone you haven’t heard of before comes around and asks a question, then doesn’t like the answer and calls you and idiot in one way or the other. Like today when somone joined our portage IRC channel, asked how to test a package on a arch it wasn’t keyworded for and then refused to understand how the keyword system works. He even called using package.keywords for anything other than ~arch a misuse of that feature (which is funny as I wrote that feature initially).
And after trying to explain it three times he still insisted that we should implement another way because the current one (with ACCEPT_KEYWORDS and package.keywords) wasn’t “logical”.
Instead he insisted that we add a feature to allow users to change package metadata (like KEYWORDS) with a simple config file, which isn’t just redundant but as any dev will assure you quite stupid and possibly harmful.
In the end he left after some nasty remarks, without having accomplished anything other than to upset a couple of devs due to simple ignorance. Now I assume he probably feels the same about us, but sometimes you just have to accept that the other party is right, and in this case we have >99% of the userbase and several years of experience backing our way up.

Now I know that such incidents are the rare exception and in most cases dealing with users is a nice experience, but they can make you pretty upset for a while sometimes.
So what’s the moral of the story? If you’re in (heated) discussion, always consider that you may be wrong, try to look at the other parties arguments from a different angle. If you notice that you’re hitting a wall go away for a while so everyone can cool down (and try the first advice again), and maybe try to resume the discussion at a later date. And if you’re talking with someone who’s an expert in a domain while you’re not, don’t try to beat him in that domain (e.g. don’t say that you change package metadata with package.keywords when you aren’t sure what package metadata is in the first place). That doesn’t mean that you can’t talk about it, but realize that the other person likely knows a lot more about the topic (and maybe doesn’t want to explain every little detail to you when rejecting an idea).

In the end it’s just for your own benefit (and nobody likes grumpy devs ;))

more elog goodness

After venting about annoying users last week I’ll try to post something a bit more useful today, maybe this will even evolve into a series about what’s new in portage land.

Well, today I added a little extension to the elog subsystem to make multi-target logging a bit more useful by extending the PORTAGE_ELOG_SYSTEM syntax a bit. Now you can override PORTAGE_ELOG_CLASSES per module, so for example one might send all messages into a file (using the save-summary module added in 2.1.2) and additionally send the important ones also by mail. Another related extension is that you can now use a * wildcard whereever a loglevel is wanted to include all loglevels.
So using the example above you would put the following in your make.conf:
PORTAGE_ELOG_SYSTEM="save-summary:* mail:log,warn,error"