< sipa>
but the progress estimation code was changed significantly in 0.14
< gmaxwell>
reindexing spends something like 20 minutes up front scanning for headers, which might be distorting your numbers.
< pfeerpedr>
who do i need to talk to in order to speed up my transaction?
< bitcoin-git>
[bitcoin] MarcoFalke opened pull request #9846: doc: Small release notes fixups in the list of pulls (0.14...Mf1702-014doc) https://github.com/bitcoin/bitcoin/pull/9846
< fanquake>
achow101 Interestingly, my osx gitian results now match cfields. Which is weird, because nothings changes since rc1 that could have fixed gitian issues.
< achow101>
actually, just ran gitian again and it got cfields's results. I'll run it a few more times to make sure it is deterministic
< cfields>
achow101: it'd be really helpful if you could upload the .o files from a non-matching build
< achow101>
I think I can give you the kvm image of the non-matching build. I just need to make sure it is the right one
< cfields>
achow101: "on-target" after the build gives you a shell
< achow101>
cfields: well that build ended a while ago and I have since done other builds. right now I am trying to start the vm with that image of the mismatching build which I saved and then ssh'ing into it, but it doesn't seem to be working now
< cfields>
achow101: ok, let me know if you manage to get them. I'll check back in the morning
< jonasschnelli>
my setup: ./configure --enable-zmq --enable-glibc-back-compat --enable-reduce-exports CPPFLAGS=-DDEBUG_LOCKORDER --with-incompatible-bdb
< jonasschnelli>
(same as the failing travis setup)
< gmaxwell>
oh geesh we have source files with the same name. bet that'll be fun for anyone trying to build with msvc.
< jonasschnelli>
you mean the problem when we removed the rpc_ prefix and moved them into the rpc/ folder?
< gmaxwell>
yea, at least last time I used it MSVC couldn't handle source files having the same name even if they were in different directories. :)
< jonasschnelli>
My IDEs find by filename also doesn't like this
< jonasschnelli>
We could have kept the rpc_ prefix even after moving them into the specific folder
< gmaxwell>
so in that rpc tests I don't see anything that sets up the conman object. But if it's executing those objects it's not null. How is the g_conman setup in the tests?
< fanquake>
jonasschnelli I can see the same results with valgrind
< jonasschnelli>
TestingSetup() jas a g_connman = std::unique_ptr<CConnman>(new CConnman(0x1337, 0x1337)); // Deterministic randomness for tests.
< paveljanik>
FWIW - I'm not able to reproduce test_bitcoin failures on any of my machines (different unices) :-(
< wumpus>
darn
< bitcoin-git>
[bitcoin] laanwj closed pull request #9846: doc: Small release notes fixups in the list of pulls (0.14...Mf1702-014doc) https://github.com/bitcoin/bitcoin/pull/9846
< wumpus>
there seems to be nothing *special* in the config.log posted in #9850
< wumpus>
standard ubuntu 16.04 versions of everything
< wumpus>
no arguments to configure
< paveljanik>
I suspect some travis issue
< paveljanik>
(even if it was reproduced outside of it)
< wumpus>
I forgot something in my testing yesterday; the travis build passes, --enable-glibc-back-compat --enable-reduce-exports and LDFLAGS=-static-libstdc++" . No difference in reproduction, though
< wumpus>
I also test it faster now, launch test_bitcoin and kill it after a second (after all, the problem happens just before the Running ... line so there's no need to go all the way)
< wumpus>
in any case it just works perfectly, every time, no matter what I do. Almost feels like travis is trolling us
< gmaxwell>
"Why do the patterns of failuers seem to be spelling ascii digits? ...'wouldnt want to give yo..'"
< wumpus>
hehe, yes that would be a giveaway
< * wumpus>
threw DEBUG_LOCKORDER into the mix. No, that didn't help either
< wumpus>
never felt so unhappy to see "*** No errors detected"
< wumpus>
well, so much for trying to reproduce locally, going to try set up a trap for this on travis
< wumpus>
ok my gdb script is working, this should work
< bitcoin-git>
[bitcoin] laanwj opened pull request #9851: [do not merge] travis gdb parachute for #9825 (master...2017_02_travisissue) https://github.com/bitcoin/bitcoin/pull/9851
< wumpus>
Everything that can go wrong is going wrong, man, it's hard to think of a more nightmarish way to debug things. Well maybe debugging the kernel for GPU cache issues wins by a bit :/
< wumpus>
I'm going to cancel all other travis builds to give this one priority, sorry
< wumpus>
ah the builds are starting, let's see what surprises await this time
< wumpus>
NOOOOOOO don't start building ccache :(
< wumpus>
cfields: what would be the best way to skip buildling of dependencies for a PR, for debugging?
< wumpus>
I don't understand why all three builds of #9851 trigger a complete dependency rebuild, but this way it's not going to work, I need a fast iteration time to have any chance of reproducing the issue
< achow101>
did the signed binary detached sigs come out yet?
< BlueMatt>
wumpus: you could do it on your own personal fork?
< jonasschnelli>
Any idea why the LXC gitian initialization takes that long?
< jonasschnelli>
Here it takes >5mins during "Upgrading system, may take a while"... seems to be very long
< jonasschnelli>
(step between "install.log" and starting of "build.log")
< cfields>
wumpus: note that DEBUG=1 is used for the crash case. That adds the extra bounds checking from libstdc++
< cfields>
wumpus: as for rebuilding depends, the travis cache depends on the env vars set. So if you change an env var, it will create a new cache because it looks like a new build that it shouldn't clobber
< cfields>
where "change" also includes adding/removing env vars
< cfields>
gitian builders: sigs for v0.14.0rc2 are pushed
< BlueMatt>
so now that we have named args someone should probably do a pass and fix the million places that we reject args that are null even when they have a default value, I suppose?
< wumpus>
BlueMatt: yes - null should be interpreted as the default value, on a call by call basis
< wumpus>
I intend to get around to that for 0.15
< wumpus>
in most cases it's trivial
< wumpus>
there are a few such as getbalance that have slightly different functionality based on the number of arguments, some discussion will be needed there
< sipa>
i can't file an issue right now, but my RPi3 bitcoind OOMed, and marked a block invalid as a result
< cfields>
sipa: seems i just managed to bring down my dev box while testing a fix (forcing OOM). Hope you're happy :)
< cfields>
woohoo, rescued
< BlueMatt>
cfields: so you have a fix? or should I go look into it?
< cfields>
BlueMatt: yea, i have a patch ready. I'm uneasy about it though, so debate welcome
< cfields>
sec
< BlueMatt>
k
< cfields>
BlueMatt: see 9854
< BlueMatt>
oh
< BlueMatt>
hmmmm, I like that
< BlueMatt>
wait, does this apply to more than bad_alloc?
< cfields>
no
< gmaxwell>
cfields: next time replace malloc with a wrapper. :P
< BlueMatt>
if we can make it apply only to std::bad_alloc then I'm all for it (or is there a list of all the things this could apply to?)
< BlueMatt>
lol
< gmaxwell>
BlueMatt: sipa pointed out the error to me earlier in private, my comment:
< gmaxwell>
11:55 <gmaxwell> God damnit. it really should not reject the block because of a fucking exception!
< cfields>
gmaxwell: you mean new? :)
< gmaxwell>
11:55 <gmaxwell> I hate that we use exceptions for error handling in the seralization.
< gmaxwell>
11:56 <gmaxwell> maybe we can wrap the allocator so that failures kill the process.
< gmaxwell>
cfields: well I mean the underlying libc function new calls, which is malloc. (same way tcmalloc replaces the allocator)
< cfields>
gmaxwell: this isn't our exception. This is a c++ feature.
< cfields>
gmaxwell: right, this overrides what happens when "new" fails. So this is essentially what you're asking for
< gmaxwell>
cfields: no no: Our mistake is that a var int decode failure is an exception. Because of this we cannot wrap block processing with a catch * {tell user their hardware is befucked or someting bad happened}.
< cfields>
gmaxwell: oh, i see what you mean
< gmaxwell>
Which basically means that random programming errors that throw exceptions can cause blocks to be rejected intead of the node shutting down, which is exactly what produced the bdb locks as a fork rather than a brief DOS.
< BlueMatt>
wait, ok, so has someone identified what actually happened here?
< gmaxwell>
There are basically three states for block processing: "I have a valid block", "I have an invalid block.", and "I notice that I am confused." the latter should shut down without marking the block invalid.
< sipa>
gmaxwell: i think you're overgeneralizing
< cfields>
gmaxwell: i completely agree. but this is a specific case that can be easily detected
< sipa>
gmaxwell: problems during deserialization shouldn't _ever_ cause a block to be marked invalid
< gmaxwell>
yes, this one we can work around. But where is the next one? this is the second one of those btw.
< gmaxwell>
Leveldb internal errors also used to do this to us.
< gmaxwell>
Third if you count bdb's internal errors.
< cfields>
gmaxwell: so let's fix that independently :)
< gmaxwell>
I'm fine with your general fix approach for now.
< gmaxwell>
I am lamenting that C++ code randomly calls exceptions without documenting the possiblity clearly. And that we make use of exceptions to mark invalidity. Which means that random internal errors can mark invalidity. And you all know I hate exceptions, so that bias is not in question. :)
< gmaxwell>
cfields: why are you replacing new and not malloc? (I don't have a strong opinion, it's just a question)
< sipa>
cfields: it could use new[] instead, i think
< sipa>
gmaxwell: how do you replace malloc?
< gmaxwell>
glibc has a specific override. But perhaps there is no portable way?
< sipa>
you'd need to do it with link-time magic, and hope that libstdc++ doesn't bypass it somehow
< BlueMatt>
I'm still confused, where do we use such exceptions to mark invalidity?
< sipa>
BlueMatt: i don't know!
< sipa>
we shouldn't!
< BlueMatt>
yes, I dont see the specific issue here, yet
< gmaxwell>
sipa: your logs showed we did exactly that.
< cfields>
BlueMatt: my take-away from the above was that if we didn't throw in deserialization, we could just wrap acceptblock and activatebestchain in try/catch(), and abort any time something's caught
< BlueMatt>
that was after the bad_alloc
< cfields>
gmaxwell: memory allocation failed, then the _next block_ was rejected
< BlueMatt>
cfields: yes, and we should do that, probably still
< sipa>
gmaxwell: my assumption is that the error _is_ caught somewhere, not passed up, and as a result a normal "fail" return value is returned, and a higher layer interprets that as invalid block
< sipa>
gmaxwell: i don't think we have anywhere a direct "exception? mark invalid!" logic
< BlueMatt>
sipa: script interpreter does
< BlueMatt>
but thats it i believe
< BlueMatt>
(in the debug log you posted I do not believe that was the error, either)
< gmaxwell>
we do all over the place! we have a generic catch that returns false on functions that must be true for validity.
< BlueMatt>
gmaxwell: we do?
< gmaxwell>
Open up validation.cpp basically every catch in there does this.
< cfields>
sipa: my take was that the block was accepted, but we didn't switch to the new tip, so the next block failed when looking up inputs
< BlueMatt>
only script interpreter i believe
< gmaxwell>
okay it's not as bad as I thought.
< sipa>
did we in 0.14 introduce the SendRejectsAndCheckIfBanned(pfrom, connman) call in net_processing:2754 ?
< BlueMatt>
well now that i check it is worse than I thought :p
< sipa>
which before used to be inside the catch block?
< BlueMatt>
some disk reads shit that probably should be smarter than it is
< BlueMatt>
sipa: yes, and no, before it didnt exist
< BlueMatt>
(was only in SendMessages)
< cfields>
sipa: it's new, we used to only send rejects+ban from SendMessages()
< sipa>
i see
< cfields>
imo the throw happened somewhere around SetBestChain, it was just caught in ProcessMessages because that's the only place we do a generic catch(...)
< gmaxwell>
BlueMatt: well I thought it was _every_ one of them, but I checked readblockfromdisk and it's not.