← Back to team overview

canonical-ci-engineering team mailing list archive

Re: Adt-cloud concerns (Re: Internet access in tests)

 

Hello Celso,

Celso Providelo [2015-05-26 10:29 -0300]:
> On Tue, May 26, 2015 at 3:30 AM, Martin Pitt <martin.pitt@xxxxxxxxxx> wrote:
> > However, in terms of prioritization this is by far not urgent. The
> > current numerous problems that we have with our CI autopkgtest cloud
> > infrastructure are far more important/urgent: missing support for all
> > architectures except amd64, a *lot* slower, no daily base images, they
> > don't dynamically scale, inefficient controller vs. testbed
> > allocation, not using ScalingStack, the layout of results in swift got
> > totally broken, we still use Jenkins in between, frequent failures to
> > start tests.
> I've just split this thread so we can organise and discuss your
> concerns about adt-cloud separately from the internet-access-for-tests
> endless epic.

Sure, sounds good! Related to that, I already asked twice on IRC, but
let's ask here again: Where does one report bugs against adt-nova?
https://bugs.launchpad.net/adt-cloud-worker does not exist (can we
perhaps enable that?) and it seems that this is unrelated to
lp:uci-engine, so reporting bugs there wouldn't work either?

> I think we should clarify what are actual *problems* (read
> bugs/regressions) and what are missing features and future work. Those
> are very different things with different priorities and I am quite
> certain we agree on this.

Right, of course. I didn't say that all of the above were regressions
but we should keep in mind that the move to nova was done so that we
can actually see some *improvements* towards the rather busted
situation that we had (and still have) with the bunch of static and
manually maintained testbed machines and Jenkins. The main advantage
that we have now is that we got roughly twice the x86 bandwidth now
(as the static machines now only run i386 while nova runs amd64), but
the problems with manual maintenance, Jenkins, etc. still remain.

Some expanion on my quick list above:

 * missing support for all architectures except amd64

   →  as long as we have that, maintenance has actually become worse,
   not better; I'd say the importance of this is rather high, so it's
   kind of a regression now in terms of manpower

 * a *lot* slower

   → regression (see below), but not very urgent

 * no daily base images

   → regression, probably this contributes most to the speed decrease
   it also potentially makes tests more unstable

   low urgency right now, but this can quickly become medium
   if the additional overhead of dist-upgrading from an ancient base
   image and then rebooting for each and every test becomes
   unbearable

 * they don't dynamically scale

   → this seems to be a design problem; testbeds are meant to be
     allocated and dropped as needed, we shouldn't have to pre-create
     n controllers/workers, let them do nothing most of the time, and
     have large queues whenever gcc/glibc/glib2.0/etc. hit

     urgency: medium; not a regression, but also not quite a
     credible/useful cloud story either :-)

 * inefficient controller vs. testbed allocation

   → probabl part of the previous point -- currently (I think) we have
   20 controller nodes which just waste cloud resources, do mostly
   nothing; a controller can easily drive many dozen parallel adt-run
   runs as it essentially just does a bunch of "nova boot" commands
   and shovels logs from testbeds into swift

   urgency: medium (see above)

 * not using ScalingStack

   → this is probably the blocker for full arch support? so "high"

 * the layout of results in swift got totally broken

   not sure where that came from; the data structure was designed
   carefully between the Debian CI team, Vincent Ladeuil and me in
   https://wiki.debian.org/debci/DistributedSpec so that we can drop
   all the hideous mechanics on snakefruit and tachash and make
   britney directly poll swift (or perhaps some kind of mirror of it)
   for incoming results. uci-engine got that right, and results were
   in a swift (pseudo-) directory like
   /trusty/amd64/libp/libpng/20140321_130412_adtminion7/log
   but now the results look like
   /adt-0daac672-9baa-4e1c-a4f3-509b1515c507/results.tgz
   which is totally unpredictable, not sortable, and useless for
   efficient evaluation due to the single .tgz.
   I guess this is because it got reimplemented from scratch without
   considering the spec, as Celso took this over from Vincent and
   during that a lot of the current state/knowledge got lost?

   → also not a regression, but medium in the sense that it keeps
     blocking moving britney from jenkins to to swift

 * we still use Jenkins in between

   → same issue -- we originally designed all this to drop Jenkins
     from the picture, and now we keep building even more jobs on it

 * frequent failures to start tests

   → these are high urgency usually, but being dealt with as the daily
   "cihelp:" churn. Thanks to Siva for your timely help with those!

So, as you see most of these aren't regressions, but from my POV they
are almost all necessary to actually improve upon the situation that
we had before adt-nova. Also, from my POV pretty much everything above
by far outranks "disable network access from tests", as that's a lot
of work for little benefit.

> I am particularly interested in your point about adt-cloud being a
> 'lot' slower that qemu-VMs, specially in backing it up with (rough)
> data we already collect in jenkins:
> 
> http://d-jenkins.ubuntu-ci:8080/label/adt&&i386/load-statistics?type=hour
> http://d-jenkins.ubuntu-ci:8080/label/adt&&amd64/load-statistics?type=hour

Consider a recent example:

 http://d-jenkins.ubuntu-ci:8080/job/wily-adt-gem2deb/8/ARCH=amd64,label=adt/
 (25 minutes)
 http://d-jenkins.ubuntu-ci:8080/job/wily-adt-gem2deb/8/ARCH=i386,label=adt/
 (1:39 minutes)

The log doesn't contain the nova setup part. From my experience "nova
boot" is a matter of < 1 minute (vs. starting a local VM which takes <
10 s), so that part can't explain most of the difference. I guess that
the extra 23 minutes are due to dist-upgrading a too old base image,
but this deserves to get more detailled logging.

If we would have an actually elastic solution right now, this wouldn't
matter that much, but with quadrupling (on average) the time of every
test together with a static limit of 20 parallel tests we get a
noticeable throughput bottleneck.

> Can we schedule a hangout/meeting to discuss these in details and
> establish a common view about the current solution status ?

Sure! Just pick a time between 05:00 and 17:00 UTC, but preferably not
today any more as my voice is still rather rough/weak from a cold.

Thanks,

Martin

-- 
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Attachment: signature.asc
Description: Digital signature


Follow ups

References