< bitcoin-git>
[bitcoin] Empact opened pull request #16123: Return error information on descriptor parse error (master...descriptor-parse-error) https://github.com/bitcoin/bitcoin/pull/16123
< bitcoin-git>
[bitcoin] Empact closed pull request #16123: Return error information on descriptor parse error (master...descriptor-parse-error) https://github.com/bitcoin/bitcoin/pull/16123
< bitcoin-git>
[bitcoin] practicalswift opened pull request #16124: tests: Limit Python linting to files in the repo (master...lint-inside-repo) https://github.com/bitcoin/bitcoin/pull/16124
< stevenroose>
Any idea what it could mean when Travis gives this error for the feature_notification.py test?
< stevenroose>
Seems like it can't write to the file somehow. Or the directory doesn't exist or something.
< jnewbery>
#proposedmeetingtopic - has travis got worse or have we just added so many builds to our job that it times out?
< jnewbery>
moneyball: ^
< ryanofsky>
MarcoFalke, could drahtbot just automatically restart travis jobs that fail with "Error! Initial build successful, but not enough time remains to run later build stages and tests."?
< wumpus>
ryanofsky: I don't think drahtbot queries/scans the travis logs right now
< wumpus>
but it's a good topic I think, it definitely seems that travis is overloaded, I'm not sure what changed why it's like that now
< fanquake>
Are we at the free limit/tier with Travis? I know the idea of adding "another" CI wasn't liked too much, for risk of increasing the size of the slot machine, but if we need more capacity maybe we could look at Circle CI, and be able to add some BSDs while we are at it ?
< wumpus>
right, I'd be for adding a CI only if it improves reliability by lowering the load on travis, the reason I'm against it otherwise is exactly that, even more different ways to fail randomly and different buttons to press to respin
< wumpus>
#startmeeting
< lightningbot>
Meeting started Thu May 30 19:00:15 2019 UTC. The chair is wumpus. Information about MeetBot at http://wiki.debian.org/MeetBot.
< wumpus>
"has travis got worse or have we just added so many builds to our job that it times out?"
< wumpus>
I've wondered this, too, travis has been more unreliable on (PRs at least) than it used to be
< jnewbery>
In the last couple of months, *a lot* of travis builds time out
< wumpus>
while I don't notice this for local tests
< jamesob>
hasn't seemed any worse to me recently, though we've had to rekick certain jobs for months
< jnewbery>
I don't know if our builds got slower, travis got slower, or we just added to many jobs for travis to handle
< achow101>
a lot of things have been added recently, maybe it's too much for them to handle?
< wumpus>
at least we should be careful with making the tests even longer now
< fanquake>
Also a lot of depends related PRs timing out recently, but not much that can be done about that.
< instagibbs>
There is an element of how Travis is feeling that day
< instagibbs>
lots of variance in build times
< wumpus>
right, it's very possible that it's not our fault entirely though and it's simply the infrastructure becoming worse
< jnewbery>
There are currently 10 different test stages. I know it used to be 6 or 7
< wumpus>
I haven't noticed the tests nor builds becoming noticably slower locally
< wumpus>
jnewbery: hmm might be time to evaluate whether they're really all contributing something useful
< instagibbs>
in elements land we've been having weird issues too that might reflect travis being overly taxed, hard to say
< instagibbs>
failure to write to files, that kind of stuff
< promag>
best case I usually see is around 20min (for longest job 8)
< jnewbery>
I know it runs those test stages in parallel
< wumpus>
yes weird stuff happens but I don't think we have that often, it's mostly just timeouts
< luke-jr>
instagibbs: does the Elements Travis have caching enabled?
< ryanofsky>
are people seeing travis errors other than "Error! Initial build successful..."? This is the only travis error i see and restarting fixes it 100% of the time
< jnewbery>
ryanofsky: yes, that's the error
< luke-jr>
ryanofsky: I've seen cases where restarting *doesn't* fix it
< instagibbs>
ryanofsky, that's when dpeends takes "too long"
< promag>
ryanofsky: sometimes I see others and I leave a comment in the PR (before restarting) or create an issue
< instagibbs>
and it early exits
< luke-jr>
ryanofsky: but they mysteriously resolved before I could troubleshoot :/
< instagibbs>
luke-jr, I believe so, the restarting thing fixes that issue
< jnewbery>
The longest running test stage is "sanitizers: address/leak (ASan + LSan) + undefined (UBSan) + integer". I wonder if the same hardware is shared between different test stages and whatever is running at the same time as that one might time out
< wumpus>
it *should* be fixed by restarting, that's the point of that message, it's an ugly hack though
< jnewbery>
yes, travis is supposed to save developer time, not add an extra step to opening a PR!
< luke-jr>
jnewbery: on that note, it's annoying that AppVeyor doesn't use standard build tools, and requires duplicating changes for it
< meshcollider>
Is it possible for a new drahtbot feature to auto-restart builds with that error?
< achow101>
apparently travis switched from docker containers on ec2 to vms on gce late last year, maybe that's related?
< jnewbery>
has anyone raised this with travis? We have a paid account, right? Can we try to get support?
< promag>
does it run multiple "build depends" with the same conf if needed? sounds unnecessary?
< luke-jr>
jnewbery: Travis support is pretty useless in my experience :/
< luke-jr>
jnewbery: I expect we'd at the very least need a very specific concern
< jnewbery>
luke-jr: support is pretty useless in my experience :/
< wumpus>
yes, filing an issue 'it is slow' probably won't do much, I'm sure they get that a lot
< jamesob>
circleci (https://circleci.com/) execution is very good in my experience but I am definitely not volunteering to migrate our .travis.yml :)
< wumpus>
migrating to another CI would definitely be an option if it's better
< meshcollider>
jnewbery: Travis is free for open source projects, we don't pay
< wumpus>
(and then I really mean migrating not adding another one)
< luke-jr>
meshcollider: we do have a paid account, though
< meshcollider>
Oh?
< luke-jr>
meshcollider: afaik the Bitcoin Foundation set it up way back when
< jnewbery>
well, what's the issue exactly? There's some specific job timout on travis, and so we cancel the build before that timeout to cache whatver has been built already? Can we ask them to increase the job timeout for us?
< jnewbery>
I believe we have a paid account so we get more parallel builds
< jnewbery>
because we were getting a backlog of PR builds a couple of years ago
< luke-jr>
jnewbery: it used to also be required for caches (though I'm not sure if they expanded that to free accounts maybe)
< jamesob>
meshcollider: Chaincode starting kicking in for Travis a year and change ago
< jamesob>
*started
< wumpus>
but it didn't use to run into the timeout so often, so it's become slower, that's really the issue, not the timeout itself; increasing the timeout would work, up to a point, but doing that indefinitely makes the testing slower and slower which isn't good either
< wumpus>
are we somehow doing more travis builds than before? e.g. how often does drahtbot re-kick builds?
< jnewbery>
yeah, someone needs to investigate where the slowness comes from. Is there an API to pull down all the build results from Travis for a project so we can at least get a sense for how often things are failing?
< wumpus>
yes, travis has a quite extensive API
< luke-jr>
jnewbery: there's certainly an API
< jnewbery>
One issue is that restarting a build makes the logs for the failed build unavailable (at least on the website)
< luke-jr>
(including one that lets you SSH into the VM)
< wumpus>
jnewbery: I don't know if that's the case for the API, or the new log will get a new id
< luke-jr>
wumpus: pretty sure it overwrites the existing log
< jamesob>
I think there's some amdahl's law at work here though - is speeding up travis really going to make the process materially faster? we're pretty review-bound
< wumpus>
ah yes I'm sure too--this is the URL: https://api.travis-ci.org/v3/job/$1/log.txt it's per job, and it will have the same job id
< wumpus>
jamesob: *making it fail less* will lower frustration
< jamesob>
wumpus: yeah, agree spurious failures are annoying
< jnewbery>
yeah, it's not about how long it takes, it's that if you open a PR, most of the time you'll need to log back in an hour later or more to rekick travis
< wumpus>
it's really frustrating if tests fail randomly, if that happens too often people take them less seriously which means actual problems might be ignored
< luke-jr>
maybe a flag to tell Travis "updated caches, please restart"?
< * luke-jr>
ponders if the Travis job can call the API to restart itself
< jnewbery>
wumpus: exactly that - inconsistently failing builds/tests lower confidence in the product/tests and hide real problems
< luke-jr>
wumpus: well, false failures on tests is another issue IMO
< promag>
luke-jr: it needs an auth token, not sure if there's a way to set secrets in travis
< jnewbery>
luke-jr: that'd be nice, or to get drahtbot to do it, but this is just a variation on the increase timeouts and kick the can down the road, no?
< luke-jr>
promag: there is, but it may be a dead idea since the job would still be running :/
< luke-jr>
jnewbery: dunno, it might be nice to restart after cache updates regardless
< luke-jr>
just to make it more deterministic for the actual tests
< fanquake>
Should we move onto another topic? Seems like the conclusion here is to go away adn investigate Travis/bot/other CI options and discuss in AMS?