MAKEOPTS=”-j${core} +1″ is NOT the best optimization

Many times, when I had to set the make.conf on systems with particular architectures, I had a doubt on which is the best –jobs value.
The handook suggest to have ${core} + 1, but since I’m curious I wanted to test it by myself to be sure this is right.

To make a good test we need a package with a respectable build system that respects the make parallelization and takes at least few minutes to compile. Otherwise with packages that compile in few seconds we are unable to track the effective difference.
kde-base/kdelibs is, in my opinion, perfect.

If you are on architecture which kde-base/kdelibs is unavailable, just switch to another cmake-based package.

Now, download best_makeopts from my overlay. Below an explanation on what the script does and various suggestions.

  • You need to compile the package on a tmpfs filesystem and, I’m assuming you have /tmp mounted as a tmpfs too;
  • You need to have the tarball of the package on a tmpfs because if you have a slow disk, it may takes more time.
  • You need to switch your governor to performance.
  • You need to be sure you don’t have strange EMERGE_DEFAULT_OPTS.
  • You need to add ‘-B’ because we don’t want to include the time of the installation.
  • You need to drop the existent cache before compile.

As you can see, the for will emerge the same package with makeopts from 1 to 10. If you have, for example, a single core machine, just try the for from 1 to 4 is enough.

Please, during the test, don’t use the cpu for other purposes, and if you can, stop all services and make the test from the tty; you will see the time for every merge.

The following is an example on my machine:
-j1 : real 29m56.527s
-j2 : real 15m24.287s
-j3 : real 13m57.370s
-j4 : real 12m48.465s
-j5 : real 12m55.894s
-j6 : real 13m5.421s
-j7 : real 13m13.322s
-j8 : real 13m23.414s
-j9 : real 13m26.657s

The hardware is:
Intel(R) Core(TM) i3 CPU 540 @ 3.07GHz which has 2 CPUs and 4 threads.
After -j4 you can see the regression.

Another example from an Intel Itanium with 4 CPUs.
-j1 : real 4m24.930s
-j2 : real 2m27.854s
-j3 : real 1m47.462s
-j4 : real 1m28.082s
-j5 : real 1m29.497s

I tested this script on ~20 different machines and in the majority of the cases, the best optimization was ${core} or more exactly ${threads} of your CPU.

Conclusion:
From the handbook:

A good choice is the number of CPUs (or CPU cores) in your system plus one, but this guideline isn’t always perfect.

I don’t know who, years ago, suggested in the handbook ${core} + 1 and I don’t want to trigger a flame. I’m just saying, ${core} + 1 is not the best optimization for me and the test confirms the part:“but this guideline isn’t always perfect”

In all cases ${threads} + ${X} is slower than only ${threads}, so don’t use -j20 if you have a dual-core cpu.

Also, I’m not saying to use ${threads}, I’m just saying feel free to make your tests to watch what is the best optimization.

If you have suggestions to improve the functionality of the script or you think that this script is wrong, feel free to comment or leave an email.

33 thoughts on “MAKEOPTS=”-j${core} +1″ is NOT the best optimization

  1. George Shapovalov

    Well, see, your suggestions call for everything residing in tmpfs, so I presume, this was your test configuration as well. Now, that +1 was suggested exactly for the purpose of dealing with IO. So, yea, on tmpfs it is unneeded and, as would be expected, you see the regression (due to overloaded cpu).

    Frankly, you results show pretty flat platoo aroung the minimum, so, looks like n_threads + 1 is a safe bet in any case, and may actually be better if you compile real stuff on disk. I would imagine, exact “leader” (there seems to be really not much difference between N and N+1 anyway) would depend on how big the individual compile tasks are..

    Reply
        1. ago Post author

          Since efore every merge, there is: echo 1 > /proc/sys/vm/drop_caches, of which other cache are you talking about?

          Reply
          1. WolfWings

            That should be a 3, to drop the inode/directory information as well as the file-data information.

            It can make a surprisingly large performance difference forcing directory re-scans as if from a truly cold disk cache.

  2. Luca Barbato

    Depends on the package. e.g. most recoursive make can’t use more than few threads because on how they laid out their system and/or because their tree has complex interactions among modules.

    Libav, Linux and such can use as many cores as you have, as long you have enough ram to support it.

    As you said “your mileage will vary”

    Reply
  3. ssam

    I’d guess that this is due to improvements in the linux scheduler over the past few year. Con Kolivas used to cite make getting faster past -j${core} as an example of why the existing scheduler was far from optimal ( http://ck.kolivas.org/patches/bfs/bfs-faq.txt ).

    Assuming that you are using a vanilla kernel, this shows that CFS has caught up with BFS in this metric.

    Reply
  4. Ryan Hill

    Cool, I’ll have to give this a try. It would probably be a good idea to disable ccache as well (set CCACHE_DISABLE to true).

    Reply
    1. ago Post author

      kdelibs is not avaiable for ia64, I just compiled another package to give an example of the behavior on a different architecture.

      Reply
    2. Ondrej Grover

      The cpufreq governor is an interesting point. Thus setting it to performance in the script is a good idea, so we have a common reference base.

      I’m testing it on a RAID5 array atm, any writing-bound IO should have some effect there. /tmp will be on that array.

      Reply
      1. Ondrej Grover

        Done testing, indeed I had the same observation, the difference was about 20 seconds on an old Xeon server with /var/tmp on RAID 5.

        I noticed, that you talk about compiling in /tmp on tmpfs, but shouldn’t you set PORTAGE_TMPDIR instead of DISTDIR in teh script?

        Reply
        1. ago Post author

          my PORTAGE_TMPDIR points always to /tmp so I omitted it in the script..but honestly, the script gives the idea and then, everyone make it compatible with his system/settings.

          Reply
  5. Bloody

    Old topic… still, sometimes ${core}+1 is slightly faster than ${core}. Depends on if its C or C++ and a few other things.

    On a single-core, -j2 seems to benefit. On quad-core, j4/j5 is often similar but not always.

    To know for sure, re-emerge the entire world and see how long that takes with ${core} and then again with ${core}+1. Just one certain package as reference won’t do, you need a larger and more diverse sample to come out with a conclusion..

    Reply
    1. ago Post author

      1) Give me an example _where_ $core + 1 is faster than $core
      2) I have a single core, -j1 is faster that -j2 here.
      3)Do this test on a world make no sense, because you have packages that does not respect the parallel make.

      Reply
      1. Bloody

        Well then build gcc, glibc, coreutils, firefox, QT and a few more hand-picked biggies. Just sayin that only 1 package is a bit too small of a sample.

        single-core: -j2 gains a bit because while one file is reading/writing, another can use the CPU so it’s kept at 100% load for longer periods of time. Not so much noticeable when compiling pure C++, though.

        Reply
        1. ago Post author

          If for you this is a news, is not the same for me. I discovered this one year ago and during this year before point out the blog post, I did a lot of tests and for all tests $threads worked better.
          If you said to include firefox, probably you don’t understand how-to and the goal, because firefox has a very sad buildsystem.
          Please test packages with _sane_ buildsystem.

          Reply
  6. Vyk

    This is interesting. I actually did similar tests on a dual 3.06GHz Prestonia Xeon in March 2010; however, I timed emerge -e world (as Bloody suggested above). ago’s test seems a better test of ideal speed; however, my concern was time to build world, so that is what I chose to test. (I don’t remember much about my methodology other than that.)

    My first discovery was that repeatedly emerging world on a pair of Pentium-4-based Xeons removes the need for either a space heater or a white-noise machine. :) Aside from that, however, I noticed that while user time increased as number of jobs increased, real time stayed approximately constant from -j3 to -j7. I did not test -j8 or higher; however, I did test -j (unlimited jobs) and wall time increased, so specifying a sane -j is worthwhile.

    Note that this machine has two procs, each with a single hyper-threading core; however, HT appears to have been off for these tests. (It appears that I also tested with HT on, but only up to -j4, so it’s not helpful for this discussion and I haven’t included those results.) Thus, we have two cores.

    Times were (real/user/sys):
    -j1: 213m25.234s/163m52.220s/49m58.810s
    -j2: 162m26.425s/177m13.860s/56m59.370s
    -j3: 159m58.587s/222m22.130s/62m16.130s
    -j4: 160m15.666s/256m49.840s/65m16.290s
    -j5: 160m31.982s/262m40.510s/65m53.130s
    -j6: 159m8.536s/263m16.640s/65m49.620s
    -j7: 159m11.622s/264m10.660s/65m42.840s
    -j: 168m34.406s/261m6.200s/65m48.230s

    Things to note: -j2 (${core}) was not the best. -j3 (${core}+1) wasn’t either, but came pretty close. (50 seconds behind the best time on a 160-minute build–about a 0.5% difference.)

    I’m not sure why this differs from ago’s results–there are plenty of potential reasons. However, it does indicate that, at least at one point, “number of CPUs (or CPU cores) in your system plus one” was a reasonable guess. (Note that the suggestion dates back at least to the 2004.2 handbook.) Overall, though, this demonstrates that each person should test things themselves in their environment and with their workload; results from anybody else’s test may not apply.

    Reply
  7. Roger

    Ditto here.

    ${threads} seemed always best with me, while ${threads}+1 always slowed the build on all of my Intel CPU boxes here.

    Possible, the “NUM of CPU’s +1″ was written by a technician developing the very first hyperthreaded CPU’s? Just a guess.

    Reply
  8. Richard

    How much RAM does your system have? Linux could be caching everything in RAM on your system, which would mean that the only meaningful IO happening would be writes. As George Shapovalov said, the additional job is meant to compensate for IO latencies. If everything fits in Linux’s disk cache, then the IO problem disappears.

    Also, your amd64 CPU has SMT. This partially negates the advantage of an additional job because each physical CPU core can still be doing useful work when one of the jobs has finished. If you disable SMT to simulate a CPU that lacks it and do these tests without a tmpfs, the additional job could become more important because then IO would ensure that you have periods where an entire CPU core is idle. That assumes that everything is not being cached in RAM, which could be the case.

    I am not convinced that there is an appreciable difference between -jN and -j(N + 1) on your system. If you do multiple runs and do error calculations (which could be as simple as the 25th and 75th percentiles), you should see that the two overlap. That effectively makes the two cases the same on your system, but says nothing about other configurations.

    Lastly, please do not mistake this comment as a statement that -j(N + 1) is superior to -jN. Cache likely favors -jN while slow IO likely favors -j(N + 1). When you have a sufficiently large amount of RAM, the former should be true where IO is good while the latter should be true where the IO is awful. If you have a limited amount of memory, -j1 could be better than either of them. What is better likely depends on your configuration, but the difference between -jN and -j(N+1) should be small when you have a large amount of RAM.

    Reply
  9. Russ

    Thanks for doing your testing and and posting it, very informative and thought provoking.

    With a 6 core Phenom II, I use -j7, with “–jobs 3″ in my EMERGE_DEFAULT_OPS, and I have a distcc helper, and ccache. My distcc helper is a 3-core Phenom. I have been wondering how best to set my flags.

    I’ll see if I can run the test for another data point. Would you have any expectations for this type of configuration? I am interested in overall speed of ‘emerge -u @world’, not necessarily a single package.

    Reply
    1. Arne Babenhauserheide

      I also wanted to ask about ccache :)

      Once you use ccache, a big part of the builds for updates will simply be disk-IO, so having a much higher number of jobs might work well.

      Note, though, that too much IO can clog your system (I know that from experience…).

      Reply
  10. Mark

    Just tried testing a kernel build on my i7-3930K (six cores/12 threads) with 32GB of memory. Testing was done in a tmpfs, and in addition to the compile, I had two programs chewing up a CPU core each to simulate someone using the system while compiling in the background.

    Times are real/user/sys, in seconds.
    -j1: 638.7/535.3/67.9
    -j6: 148.3/746.2/85.5
    -j10: 122.6/888.6/96.8
    -j12: 119.0/902.6/98.2
    -j16: 116.4/921.1/97.9
    -j32: 116.3/940.6/101.2

    -j999: 117.5/954.4/103.7

    What’s interesting is that compile speed continues to rise well past the point where I’ve got one job per execution unit, presumably because more jobs means that those two non-compiling programs are getting smaller and smaller slices of the CPU. The other interesting thing is that “efficiency” (total CPU-seconds needed to compile the kernel) is highest with just one job, and plummets as the job count goes up.

    Reply
  11. trueriver

    I’ve done similar experiments, compiling @world on a freshly installed stage3 to give a reasonable selection of apps, and hopefully some kind of overview.

    My findings, which like those of the blogger relate only to my machines, are that the following give best results
    MAKE -j $nthreads -l $((nthreads-1)).8
    EMERGE –jobs $((nthreads+1) –load-average ${nthreads}.4

    note that the load average can be specified as a real number, and I took advantage of this here.

    In the case of make, the best loadaverage came out at 0.2 less than the number of threads.

    In the case of emerge, the best load average came out as 0.4 more than the number of threads.

    I had iterated in steps of 0.2, and by the time I got to these results the differences were too small for me to be bothered with trying to interpolate further.

    These results were done some years ago on a box with two single core separate CPUs. Whatever the cause of the effect, it is nothing to do with the introduction of hyperthreading.

    My guess, and that is all it is so like the original blogger please don’t flame me, is that the handbook advice became wrong when two optimisations were introduced: both are from a long time ago,

    First, the -pipe optimisation in gcc not only saves filespace but also allows more than one thread of the compiler to run at once for small amounts of the time. This means that when you have N compilatons ongoing you may well have something like 1.1 * N threads running.

    Second, the use of tmpfs and -pipe also cut out a lot of waiting for i/o to complete. Therefore any given thread will spend less time in i/o wait; which is the main thing we are trying to avoid by over-tasking the CPU.

    Finally, a couple of comments on the tests: emerge is still more i/o intensive, and my tests were including the installs as it is the total time for the overall process that I was interested in. So it is no surprise that for the emerge limits, the N+1 still seems more useful.

    And yes, these tests were useful for heating my living room too – I did the tests during winter and at the time I was in an all-electric house so (assuming a perfect thermostat) they came with zero cost.

    I have not repeated the tests with modern equipment, but have still (rather irrationally) kept to my limits rather than the handbook limits ever since, despite not now using hardware that makes it relevant.

    Like the original blogger, the actuall differences between run times were a very small proportion of the overall time when I compared N with N+1, and those for a difference of 0.2 in load-average much lower still.

    In fairness to the manual, I read it to say that the suggested values are reasonably good but may not be the absolute best. On that reading, they are completely fair comment

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>