canonical-ci-engineering team mailing list archive

Thread
Date

Re: Microservices, the condensed version

To: Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx>
From: Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx>
Date: Wed, 10 Sep 2014 12:09:18 +0200
Cc: canonical-ci-engineering <canonical-ci-engineering@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAOe9oG5ryyN=sPJ1aeEJtSE2hqk2MrEs=Wz31kNHm03tYw=Xow@mail.gmail.com> (Evan Dandrea's message of "Thu, 4 Sep 2014 11:52:01 +0100")
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)

Sorry for the delay, this was stuck in some unread box.

>>>>> Evan Dandrea <evan.dandrea@xxxxxxxxxxxxx> writes:

    > On 3 September 2014 14:31, Vincent Ladeuil <vila+ci@xxxxxxxxxxxxx> wrote:
    >> 1) as a rule, workers should only receive requests they can handle,

    > Big +1

    >> 2) if they fail to fulfill the request (or find they *cannot* process
    >> the request) they should reply with an error message and not put the
    >> message back on the queue

    > Assuming this is not a request that could be processed by another
    > worker reading from the same queue or this same worker once a bug
    > causing the failure to process is fixed.

Hmm, tricky one, no good answer for now.

Or rather, I see the point but can't see (in a generic way) how to
make a distinction between a genuine test failure and a failure caused
by a bug in the infra.

In the former case (test failure), a new package version will trigger a
new test request. The same package version won't be tested again. We're
good.

In the later case (infra bug), I can't see how we can find which test
requests should be retried. We can't blindly retry all failed tests nor
can we know if test successes were false positives.

    > We need to be very careful to ensure that these error messages are
    > processed, and I would argue this should be done asynchronously.

Big +1 on asynchronously. In fact, discussing with pitti I realized that
britney is already separating:

- triggering test requests,
- processing test results to decide if a package can be promoted.

I.e. Test requests are fired (and almost forgotten), later on, the
results are polled to decide whether a package can be promoted.

So if test requests fail (for whatever reasons), the promotion rules
will decide whether or not they are relevant. As pitti explained, that's
how armhf failures are ignored in various cases.

    > So britney sends a request to the test runner workers via a
    > queue. A worker picks this up and determines that neither it nor
    > its peers on the same queue can handle it. It sends an error
    > message to a separate queue that britney is reading from.

So, this is already achieved by having a test failure mentioning that
the testbed cannot be properly setup (or some other reason). The point
is: the test failed, we know how to represent that so we don't need a
specific mean to report that.

    > Sound reasonable?

Yes ;)

        Vincent

References

Microservices, the condensed version
From: Evan Dandrea, 2014-08-13
Re: Microservices, the condensed version
From: Vincent Ladeuil, 2014-09-03
Re: Microservices, the condensed version
From: Evan Dandrea, 2014-09-03
Re: Microservices, the condensed version
From: Vincent Ladeuil, 2014-09-03
Re: Microservices, the condensed version
From: Evan Dandrea, 2014-09-04