< cfields>
NicolasDorier: hmm, I hadn't seen your hash cache stuff yet
< cfields>
I had been working on something very similar for the purpose of making the sigcache lock free
< cfields>
(after seeing morcos's numbers and discussing with jeremyrubin)
< NicolasDorier>
cfields: why is it so important to be lock free ?
< cfields>
I was going to wait to discuss with Jeremy since it was his idea, but I suppose I should push it up now for discussion. It'd be a shame if we collided with eachother and missed out on the possible speedup
< cfields>
NicolasDorier: it's not, but morcos reported very significant lock contention with 16 cores. and it's not actually necessary to lock it
< NicolasDorier>
for the cached hash make, the lock is tacken very briefly for a lookup in a map
< cfields>
NicolasDorier: sec, I'll push up my (not-fully-vetted) work
< cfields>
NicolasDorier: much of it is quite similar to your work :)
< cfields>
NicolasDorier: if i can convince morcos to bench that and there's no real improvement, I'll drop it. Otherwise, we might want to merge approaches a bit
< sipa>
alternatively, for just this, we can just comoute the 3 cached hashes before doing any validation
< NicolasDorier>
oh cool I'll take a look, in my PR I have a benchmark code if you need
< sipa>
and then not have any locking at all
< NicolasDorier>
we can compare easily
< cfields>
that also has the advantage of making the sigcache usable by libbitcoinconsensus in a way that's thread-friendly, but I'm not sure how necessary/desirable that is
< NicolasDorier>
cfields: are we talkign about the same thing ? I was talking about midhash caches of segwit, not sig cache in fact
< cfields>
NicolasDorier: that's great. Sorry for not poking at it further before chiming in, making dinner atm. I'll review yours more carefully tomorrow
< NicolasDorier>
looking at your PR you are on the sig cache
< cfields>
NicolasDorier: yes, we're talking about different things. But I think we may want to approach them the same way.
< NicolasDorier>
ah yes indeed I'll check that.
< cfields>
like I said, I just took a quick glimpse and wanted to push it up for discussion before it was too late. I'll have a good look at the hash cache tomorrow :)
< cfields>
bbl
< NicolasDorier>
cool thanks !
< morcos>
NicolasDorier: cfields: jeremyrubin: I'm happy to bench anything.
< morcos>
What I've been working with so far is just a very small change to sigcache that does the erases at the end of block validation (since thats the only thing we need the lock for) and i'm using a boost lock free queue for that
< morcos>
of course it would be easy'ish to make a custom version
< morcos>
jeremy has been hard at work at removing the locking from checkqueue. i know at least that the time to finish connecting all the transactions (and populating the checkqueue) is much slower due to lock contention.. so its a guess that possibly there is a lot of improvement in the actual validation too
< morcos>
cfields, if your code is ready i'm happy to try that out tomorrow
< morcos>
i've combined this with an approach to keep an always hot CCoinsViewCache on top of pcoinstip instead of creating a new view in ConnectBlock. This both saves a lot of copying if things are already in pcoinstip and serves to keep things loaded from disk.
< morcos>
I do this by running CreateNewBlock once every 30 seconds. and then pulling into that hot cache the inputs that would be needed to validate that block
< morcos>
I think this makes it unnecessary to ever have a dbcache bigger than about 300M (although you need up to about 150M additional for the hot cache, but some of that memory you would have been using anyway in block validation)
< CodeShark>
there seems to be a phishing attack going on where you get an email that seems to come from xapo and you can get 10 bitcoins immediately if you download and install a wallet
< NicolasDorier>
morcos: is there some kind of free lock map so I can get rid of lock in my hash cache map ?
< NicolasDorier>
ah also, those cache objects might find their way in libconsensus, we'll be able to reference boost ?
< morcos>
NicolasDorier: i don't think boost is the right end decision, and i think lock free coding is beyond the level i feel very comfortable doing myself
< morcos>
so thats why i'm offering to test other peoples implementations
< morcos>
but my goal here is to optimize current bitcoin core performance
< morcos>
how we make that work with libconsensus is a bit of a separate issue
< morcos>
without seeing a design document or us all agreeing on a roadmap for libconsensus, i don't feel able to help with that. i've been saying that for a long time now. i think its a worthy goal, but approaching piecemeal like this is frustrating.
< cfields>
morcos: mine's a little different....
< cfields>
morcos: jeremy made the observation that we only ever work with the sigcache from one thread at a time. ignoring the fact that _that_ thread spawns threads that trample on eachother
< cfields>
so the idea is for each of the workers to return what they would have cached, then batch cache/uncache after each block
< morcos>
cfields: thats what that commit i showed you does
< cfields>
rather, the workers return a list of cache hits/misses. the caller can then add the misses and optionally purge the hits (as in the case of connecting a real block)
< morcos>
you dont' actually cache anything in validation
< morcos>
its only uncaching
< morcos>
you dont' add misses in validation
< morcos>
only in ATMP, which isn't multithreaded
< morcos>
although good to make that work so we can make ATMP multi threaded
< morcos>
anyway, is that branch you have something thats ready to be tested
< morcos>
thats easy enough for me to do
< morcos>
got to go, will be online tomorrow
< cfields>
morcos: hmm, ok, I'll reread when I'm not preoccupied. I just saw it setting from get()
< cfields>
morcos: that'd be great, thanks. I'll read yours more carefully along with NicolasDorier's
< NicolasDorier>
cfields: wait tomorrow for reviewing my PR, I'm changing some stuff I got from sipa's feedback
< luke-jr>
jtimon: the issue is entirely giving bad advice to miners. regardless of what they might or might not configure in the end, we should not give bad advice
< luke-jr>
(btw, I see no reason the performance "hit" of using maxsize would be greater than the performance hit of using 0.13 over 0.12..)
< fanquake>
Have you noticed quite a lot of output during the Windows build now?
< fanquake>
Along the lines of warning: conflicts with previous declaration ‘void boost::signals2
< wumpus>
good that sigs match!
< wumpus>
no, haven't noticed
< fanquake>
I'll get the logs and open an issue.
< wumpus>
thanks
< morcos>
luke-jr: i was just discussing with sdaftuar yesterday, i'm not sure there is much of a perf hit, it just hasn't been benched yet. i may try to do that today. why does 0.13 have worse performance than 0.12, i don't think thats true.
< michagogo>
The "native_biplist" has a confusing name
< morcos>
luke-jr: ugh, sorry. i keep getting confused. i guess there are two ways it could get slower, and my anecdotal evidence was only that the extra serialization wasn't that bad. but the repeated failing to find a final tx is blockmaxsize is your constraining factor seems like it could be slow, especially on a big mempool
< michagogo>
My first thought was, wtf? Why is a list of BIPs its own package?
< michagogo>
Then I clicked it and saw it was downloading from PyPI, and that was even more confusing... until I looked up the package description
< morcos>
anyway, too much annoyance to go test all these posibilities.. i think its a clearly correct recommendation that until segwit activates using only blockmaxweight is the desired configuration. miners will be upgrading anyway for segwit, so there is a chance to argue something different for that
< morcos>
lets concentrate on getting 0.13 right.
< sdaftuar>
cfields: i'm looking at your crash from yesterday. seems odd! can you give some more color on what the environment was: mainnet/testnet/regtest? fresh node or already synced? what commit were you running?
< instagibbs>
is it expected behavior for a reindex to not show blocks being processed for while a while?
< instagibbs>
in getinfo at least
< instagibbs>
I seem to recall different behavior
< cfields>
sdaftuar: uhmm, from my http-thread branch, which was merged today
< cfields>
sdaftuar: so current master should be basically the same thing
< cfields>
checking my backlog to see how i was running
< cfields>
sdaftuar: ah, a funky config, forgot I was running that way. probably significant...
< cfields>
the -reindex-chainstate was accidental, left in there from testing something before.
< morcos>
cfields: i have some results. here are the average times per block for Connect transactions/Verify txins/Connect block (meausres are total time so far as reported in bench)
< morcos>
in ms
< morcos>
master: 26 / 80 / 111
< morcos>
your sigcache: 39 / 46 / 82
< morcos>
batched deletes using boost lock free (commit i showed you): 22 / 32 / 63
< morcos>
for yours and the batched deletes test i also included your copy-move improvements and my hot tipcache, but those don't make much difference on master
< cfields>
ok
< TomMc>
In regards to signature hashtypes, if the hashtype is SIGHASH_ALL, is the last byte given a value of 0x01?
< cfields>
i'm still working on the prevector btw, i've almost got that rewritten and specialized for the unsigned char case
< morcos>
i'll dig into it more, so see if i can understand why your code slowed down Connect transactions
< cfields>
i wonder what throws the connect time off...
< cfields>
morcos: thanks a bunch for testing. that's very interesting. I'll look into the connect as well.
< cfields>
morcos: one more interesting test, if you don't mind
< morcos>
cfields: the connect time can be reduced to as low as 10ms if you eliminate the locking contention in checkqueue
< cfields>
morcos: run with -par=1 as a baseline. I was surprised to see that it's actually faster on some of my machines that way
< morcos>
:) i know! how i found this stuff in the first place was running with blocks only mode was actually faster than running with tx, implying using the sigcache was slower than verifying the sigs!
< cfields>
morcos: right. I'm looking forward to pairing those two, I should think they would improve eachother
< cfields>
heh, right
< morcos>
will try par=1 as well
< cfields>
morcos: wait. current master actually _is_ faster with -par=1 on 16 cores?!
< morcos>
oh, no i haven't tried that, i was just saying i know in that its not surprising to me that it could be in some configs
< cfields>
oh, whew :)
< cfields>
looks to me like it breaks even around ~3 as-is
< cfields>
but I've tested so many configs now that I can't keep the numbers straight, so that could be wrong
< cfields>
morcos: ah, looks like it's the overhead of creating a copy of the to-test CScriptCheck that slows down the connect. that could be much improved
< morcos>
oh, that was my guess at first but then i didn't see where that happened b/c i thought it was just shared pointers
< sipa>
cfields: before libsecp256k1-based verification, i benchmarked where the break even point was on a large-number-of-cores machine, and set the max value for -par based on that
< morcos>
the method of batching deletes is extremely effective
< cfields>
morcos: yea, those should be cheap, but I'm assuming it's the constant inserts() that suck
< sipa>
when we switched to libsecp validation, the break even point probably dropped a lot
< sipa>
cfields: copy? it should be using swap
< cfields>
sipa: ah yes, that makes sense then.
< morcos>
sipa: depends on whether you are taking advantage of the sigcache a lot or not though
< cfields>
morcos: that whole vector business (including checkqueue) could be much improved with a list i think. Then we could splice all over the place instead.
< sipa>
i thought you were comparing with master
< cfields>
sipa: no. discussing different ways to reduce the sigcache lock contention
< sipa>
ok ok
< sipa>
looking forward to the results
< michagogo>
cfields: are you by any chance in the process of building rc2?
< morcos>
i can't remember if i mentioned this on channel before, but i think we made a mistake in the design of BIP 68
< michagogo>
Looks like we have 4 sets of sigs already
< morcos>
we should have designed that as a per input check
< cfields>
morcos: another route would be to maintain a single list rather than a vector of per-txin hits/misses. But I was afraid that would slow down one of the checks. I'll whip that up for another data point
< morcos>
that would have done away with that whole issue of calculating prevheights
< cfields>
michagogo: yes, builds in progress
< morcos>
cfields: keep in mind you never need both hits and misses
< michagogo>
(i.e. are detached sigs imminent? If not, I'll just wait until Saturday night, but if they are I can leave the VM running)
< michagogo>
Cool, thanks
< morcos>
i made a couple comments on your commit
< morcos>
and i actually haven't looked at how this interacts with the segwit style cachehashes that NicolasDorier is working on
< cfields>
morcos: well you at least need each txin's misses shared with eachother. for the 2dup..checksig..2drop..checksig case, no?
< morcos>
cfields: hmm.. explain a bit more. you mean if you are checking the same sig multiple times you dont' want to have to redo the actual verfication?
< morcos>
we don't solve that problem now do we?
< cfields>
morcos: right
< cfields>
morcos: i believe so, as they're all cached on-the-fly
< morcos>
they aren't cached in ConnectBlock
< morcos>
they are only cached in ATMP
< sipa>
morcos: sorry, what is 'they' ?
< morcos>
signature verification results
< morcos>
whats in the existing sigcache
< sipa>
ah, yes
< cfields>
morcos: hmm, yes, you're right
< sipa>
we store results in atmp, delete results in connectblock
< sipa>
if you duplicate sigs i guess that indeed will hurt
< morcos>
cfields: but you were talking about sharing a txin's misses with other txin checks()'s ? thats not necessary right? it's only within a given txin that you'll have a duplicated signature?
< morcos>
seems like if thats a worthwhile performance improvement thats easy enough to add separately from the sigcache.. kind of a separate question
< sipa>
cfields, morcos: with nicolasdorier's cache, we could add some sort of accumulator to the cache, which lists all the signatures that were valudated (or even sigcache iterators to them), and then delete them all at once after the whole block is validated
< cfields>
morcos: right. atm, they're per-txin. But since there's the overhead of creating a vector/list/whatever for each, I was (thinking outloud) that it may end up quicker to do per-block instead
< cfields>
sipa: that's exactly what this branch is doing :)
< cfields>
sipa: and morcos's as well
< sipa>
great
< * sipa>
shuts up
< morcos>
sipa: but right that it should probably be combined with NicolasDorier's cache too, b/c you'll want to batch the deleting of the hashes as well
< cfields>
sipa: no, it's helpful. The issue is that that means returning a list of hits/misses for each txin, which has a considerable overhead
< morcos>
cfields: yes, but a lock free list is relatively straight forward right... we don't have to use the boost version
< morcos>
but can accomplish the same thing ourselves.. so instead of returning the list, we just append to the shared lists of hits and misses
< cfields>
morcos: sure, but i _really_ like the idea of moving the cache out so that it can be passed in without worry of threading semantics. that lets others (libbitcoinconsensus) use it as well
< cfields>
but that's not the strongest argument, so I'd be fine with yours as well
< cfields>
morcos: passing it in also means that ATMP can be threaded as well (as you mentioned yesterday). I suppose you could do the same with your approach by adding a similar batch-to-add structure
< morcos>
right
< sipa>
cfields: one way would be to have a simple preallocated vector in each validation thread
< sipa>
and afterwards, all results are combined
< cfields>
sipa: i started that way, but was hesitant to re-intruduce contention in the re-combining. Did you have something in mind to avoid that?
< sipa>
cfields: i meant recombining only after all txn are validated
< morcos>
sipa: yep, that was another approach jeremy and i discussed
< cfields>
sipa: ah, so the threads would set a pointer to their thread_local vector at creation, then master could iterate when finished
< morcos>
cfields: i'm not sure i understand why your branch is safe to remove all the locks...
< cfields>
yes, that would be much better
< cfields>
morcos: a const cache is passed in, it's not changing during validation
< morcos>
ah, i guess its protected by cs_main?
< morcos>
yes, but how did you know another thread isn't modifying it while you are reading from it
< cfields>
morcos: right, it needs to be globally protected. atm nothing else touches it. so i suppose it's cs_main :)
< morcos>
right a rpc call to submit a tx or something
< cfields>
yes
< cfields>
ok, I'll try the per-thread suggestion. should be easy to whip up i think
< luke-jr>
morcos: CPFP requires updating stuff as parents get included in the block, which probably isnt *that* expensive, but a heck of a lot more than merely adding numbers (the overhead of maxsize); AFAIK it doesn't do any extra serialization, just adding sizes up
< morcos>
luke-jr: i believe suhas's bench marks for CPFP actually showed it neglibly faster than the pre CPFP code. this is b/c you can add a whole package of txs at a time i think.
< GitHub72>
[bitcoin] laanwj closed pull request #8403: Process "notfound" messages, and safeguard against unreasonably long … (master...ProcessNotfound) https://github.com/bitcoin/bitcoin/pull/8403
< luke-jr>
morcos: benchmarks with what tx sets? I suspect it would vary significantly depending on the input
< morcos>
historical simulation over a default mempool with default policy i think
< luke-jr>
what is a default mempool? anyhow, I guess the slower path is unlikely to be used historically
< morcos>
300M
< luke-jr>
probably not too likely to be used more with CPFP either
< luke-jr>
morcos: oh, but we don't have historical mempool contents
< morcos>
luke-jr: i'm not sure what you mean. sdaftuar wrote code to save p2p messages to disk and be able to replay them back off disk to a node. we've been using that for years to test and bench different changes.
< luke-jr>
morcos: oh, I was not aware; so you have years of data like that? :o
< morcos>
yep
< morcos>
its not a perfect system by any means, but its good for benching and it was very useful for evaluating different fee estimation algorithms
< Chris_Stewart_5>
Where do I add a new unit test file to have it included in test_bitcoin? I looked in test_bitcoin.cpp and didn't find an obvious spot, unless i'm blind
< sipa>
Chris_Stewart_5: just add it to the makefile
< sipa>
there is no explicit list of all tests
< Chris_Stewart_5>
Hmm, the README seems a litle misleading then
< luke-jr>
then clarify it too while you're at it :p
< Chris_Stewart_5>
but it is so much more fun to complain! :-)
< cfields>
gitian signers: v0.13.0rc2 sigs are pushed
< cfields>
michagogo: ping ^^
< sdaftuar>
cfields: i can reproduce your segfault. kind of an interesting case really, though i believe unrelated to #8096.
< sdaftuar>
cfields: looks like the issue is that if you ctrl-c on startup, while running with -reindex-chainstate, chainActive will not be initialized
< sdaftuar>
because ActivateBestChain returns immediately on ShutdownRequested()
< sdaftuar>
that violates an assumption in RewindBlockIndex