< gmaxwell>
also with no signature validation.. also about three hours.
< gmaxwell>
can't really make use of all the cpus on this host.
< dcousens>
IO bound?
< gmaxwell>
no, it's cpu bound in all the things that aren't signature validation.
< dcousens>
haha
< gmaxwell>
but thats what you get with over 8 cores or so.
< gmaxwell>
(in particular it gets bound on the leveldb, block handing, hashing, etc. and that inserts bubbles in the downloading pipeline)
< phantomcircuit>
oh no signature validation
< phantomcircuit>
heh
< gmaxwell>
phantomcircuit: no, this is with signature validation, specifically I'm testing the libsecp256k1 pull.
< gmaxwell>
But if I bench with none it won't be much faster.
< gmaxwell>
or openssl much slower. (because it'll just manage to use more of the cpus)
< phantomcircuit>
ohh
< phantomcircuit>
gmaxwell, it'll be a bit faster but yeah
< wumpus>
dcousens: the work queue is part of the HTTP server, happens before requests are handed off to the appliction logic (REST or RPC)
< dcousens>
wumpus: all good :), just hadn't received it before, caught me off guard
< wumpus>
don't know if the default number is high enough, do you hit it under quasi-normal circumstances? at some point under heavy 'abuse' it's better to reject clients instead of adding them to an overlong queue (also to prevents the requests from filling up memory), but the default of 16 may well be too conservative
< phantomcircuit>
gmaxwell, bleh it's really annoying nto having a MAC for the plaintext key
< phantomcircuit>
any opposition to storing one?
< wumpus>
what plaintext key and MAC are you talking about?
< * jonasschnelli>
is also wondering...
< phantomcircuit>
oh the encrypted wallet entires
< phantomcircuit>
entires*
< phantomcircuit>
entries*
< phantomcircuit>
the only way to tell which keys go to which master key (technically we support more than one... kind of) is to derive the public key
< wumpus>
why do you want to change wallet encryption? it works fine
< phantomcircuit>
it's very slow and corruption can only be detected when you unlock
< jonasschnelli>
i think phantomcircuits optimizations makes sense
< phantomcircuit>
we're basically using the private key -> public key derivation as a hash function right now
< phantomcircuit>
which doesn't make a lot of sense
< wumpus>
requiring an explicit pubkey derivation was on purpose, although due to AES padding trick you don't actually need tot do it
< wumpus>
what do you want to optimize? what is not fast enough?
< wumpus>
'trying more passphrases per second'? :p
< gmaxwell>
phantomcircuit: why do you need one for the plaintext key? you're checking the ciphertext.
< phantomcircuit>
gmaxwell, the current check will detect whether you have multiple master keys mixed together into a single wallet file
< wumpus>
what are we even trying to accomplish here? what is too slow that needs to be optimized?
< wumpus>
I really don't like aimless optimization in the context of crypto code :)
< gmaxwell>
wumpus: please see his existing PR, thie performance is not the main focus (though its quite slow with large wallets)
< gmaxwell>
the existing behavior has no integrity checking at all for keys in an encrypted wallet, which is not very safe.
< phantomcircuit>
wumpus, the main focus is detecting corruption on load instead of at the first unlock
< wumpus>
what I want to avoid is to make it really fast but insecure
< gmaxwell>
Nothing he's talking about would reduce security.
< wumpus>
then hashing database records (eg, add a checksum to the data) will work well enough
< wumpus>
no need to mess with the crypto
< gmaxwell>
wumpus: Yes thats what his PR does.
< wumpus>
ok, then what's the problem?
< gmaxwell>
he's trying to also eliminate the need for the really slow check on unlock, it sounds like. (which is problamatically slow for wallets with very large numbers of keys)
< gmaxwell>
the same is avoided already for non-encrypted wallets.
< wumpus>
that never used to be the case, we only have to check one key/value pair
< wumpus>
if you can assure the database is correct some other way
< wumpus>
at some point someone wanted to check *all* keys on first unlock, as an integrity check
< wumpus>
but that's not necessary if you add one at the db level
< phantomcircuit>
that's what it does now :)
< wumpus>
yeah, I heavily disagreed with that
< gmaxwell>
wumpus: We _do_ check all of them on first unlock, as there is no other integrity check.
< wumpus>
that indeed makes the first unlock slow on big wallets, that was clear, I even commented that back then
< phantomcircuit>
gmaxwell, if we're OK with saying all the encrypted keys in the wallet have to be from the same master key
< gmaxwell>
Sure, and thats still better than having no integrity check.
< phantomcircuit>
then we dont need the hash on the plaintext key
< wumpus>
anyhow, adding an integrity mechanism at the db level sounds good to me
< phantomcircuit>
but then we should assert if there's more than 1 master key entry
< wumpus>
but please don't change this code unnecessarily
< gmaxwell>
phantomcircuit: does it actually work correctly now if there is more than one?
< phantomcircuit>
gmaxwell, it correctly fails
< phantomcircuit>
with the hashes it will silently not fail
< phantomcircuit>
which would be double plus not good
< gmaxwell>
ah so without the exhaustive test it won't know that some are encryted but with another key.
< wumpus>
if you just add a checksum to the data, how can it add paths where it will silently work? you just add more checks, so more failures
< gmaxwell>
wumpus: if he adds the checksum and turns off the exhaustive test (now not so important with the checksum)
< gmaxwell>
phantomcircuit: so make it fail the load if there is more than one master key and call it done.
< wumpus>
right
< wumpus>
so that was the same in most actual releases (before the exhaustive check was added)
< wumpus>
agreed
< phantomcircuit>
sounds reasonable to me :)
< wumpus>
at this point having more master keys *means* corruption
< wumpus>
are you protecting all records or just the keys?
< wumpus>
in any case, an assert to only have one master key is a good sanity check
< wumpus>
who knows what happens with more, did anyone ever test that part...
< phantomcircuit>
wumpus, it's unfortunately a huge nuisance to protect all the records
< phantomcircuit>
i should look at doing that more carefully though...
< wumpus>
phantomcircuit: ok, then doing just the keys is a good compromise
< wumpus>
it is the most important data in the wallet
< gmaxwell>
key records are already protected, ckey are not. but yea, most concerning is keys, and I think its a reasonable compromise.
< wumpus>
if this was starting from scratch I'd want to do it for all data, but I can believe you if you say that becomes too ugly now
< phantomcircuit>
hmm maybe it's not as hard as i was thinking
< wumpus>
we should keep in mind that older versions should still be able to open the wallet
< phantomcircuit>
wumpus, yeah i need to look at the serialization format to be sure
< jonasschnelli>
does -debug=bench affect the performance in a recognizable way?
< wumpus>
it shouldn't - I think the timings are always done, just not logged normally
< wumpus>
so any performance overhead will be in generating more log data
< gmaxwell>
really chatty log messages do, but the debug bench log data is not that chatty.
< jonasschnelli>
okay. thanks... i do no compare the IBD time of the secp256k1 branch with master (with standard parameters)
< jonasschnelli>
*now
< phantomcircuit>
are there windows builds for 6954 ?
< gmaxwell>
As far as protecting all the data goes, --- assuming it could be compatible it would be useful, though otoh, it would be nice to keep the checksums for pubkeys around in memory so we could test against at point of use to reduce the window for corruption further.
< GitHub13>
[bitcoin] MarcoFalke opened pull request #6962: translations: Don't translate markup or force English grammar (master...MarcoFalke-2015-translations) https://github.com/bitcoin/bitcoin/pull/6962
< GitHub30>
bitcoin/0.11 4e895b0 Pieter Wuille: Always flush block and undo when switching to new file...
< sdaftuar>
gmaxwell: sipa: off the top of your head, do you have a sense of the relative speed of calculating a sha256 hash versus performing a signature validation using secp?
< jgarzik>
256x2 itym?
< sdaftuar>
yeah probably that is what i mean
< sdaftuar>
actually it might not be what i mean, i think #6918 (which i'm analyzing performance on now) does a single sha256 if i'm reading the code right
< morcos>
wumpus: re my comment on flushing undo files. i think i didn't realize that a reindex could pick up from where it left off. so it still reads through all the block files, but if the hash is in mapBlockIndex it skips processing?
< wumpus>
yes
< morcos>
i was assuming the corruption happened after a reindex finished, but before the flush
< morcos>
ok, thx
< wumpus>
it used to remember where it was, but this is complicated by block files being possibly out of order, so now it scans from the beginning again after restart (but this goes pretty fast)
< jonasschnelli>
IBD up to 382324 with secp256k1 branch took 9 hours...
< jonasschnelli>
standard parameters
< jonasschnelli>
Quad Core Intel(R) Xeon(R) CPU E31245 @ 3.30GHz
< jonasschnelli>
16GB ram (but i did not change -dbcache)
< jonasschnelli>
(now doing the same with current master [openssl], just curios how the performance benefit are with standard args on a standard system)
< gmaxwell>
sdaftuar: a verify with libsecp256k1 as it's used in bitcoin core is 280 times slower than a single run of the sha256 compression function. That PR ends up calling the sha256 compression function twice.
< gmaxwell>
sdaftuar: I think at some point we should add a faster cryptgraphic hash to the source code base for these internal non-normative uses. (E.g. blake2 is a bit over 5x faster than sha2-256; though if the user has a not-yet-released cpu with a sha256 instruction, sha256 will be faster again).
< sdaftuar>
gmaxwell: thanks. i was investigating whether the use of sha256 in the PR might cause an unexpected slowdown... after some more investigation i think my concern was misplaced, it looks like any "slowdown" is pretty minuscule, and i think i wasn't hitting the conditions under which the PR would provide expected speedups. will keep testing...
< gmaxwell>
sdaftuar: figuring out the if there was any tradeoff there is part of why pieter was measuring the cache performance with many sizes (because a small amount of hitrate increase offsets any constant cost). Also, the remove operation in the current code ends up being pretty slow compared to the hashing, in fact.
< sdaftuar>
because of map versus unordered_map?
< gmaxwell>
Yes.
< gmaxwell>
sipa: 6954 results--- so reindex with libsecp256k1, without 6954 was 3hr 16 minutes, reindex with it was _2 hours_ 7 minutes. Going to benchmark without signature validation.
< gmaxwell>
UTXO in memory at the end was 4012 MB.
< sipa>
gmaxwell: so from 5486 MB to 4012 MB, because of 6914?
< sipa>
gmaxwell: 6954 is libsecp validation, so i guess you mean 6914 instead?
< morcos>
sipa: is 6914 something you think should be considered for 0.12. i haven't really looked at it much, but i'd happily test it out if you think its a near term consideration?
< gmaxwell>
sipa: oops yes.
< sipa>
morcos: i'm not sure what i can do more for it than get review
< morcos>
gmaxwell's test seems to indicate there is a HUGE speed improvment, thats from allocation overhead? i think that would be something nice to test in my Connect Block benching
< morcos>
sipa: understood, but it wasn't clear to me from your previous conversation with gmaxwell whether you even wanted to do 6914 or wanted to wait and do a bigger change
< sipa>
morcos: as far as my reading of the specification is concerned, i think it's fully compatible with std::vector
< sipa>
morcos: i don't think a bigger change is in scope
< sipa>
i mean... i'd like to change allocation of scripts to some pooled storage per block and tx
< gmaxwell>
One third less sync time and 25% less memory consumption is pretty nice. These results don't surprise me -- well, so I'm a little surprised that _just_ fixing the scripts did this, I would be less surprised if the whole cache entry were made a single allocation.
< morcos>
sipa: yes thats what i was referring to, but that doesnt mean you don't want 6914 for now is what i'm clarifying, you still do want 6914
< morcos>
gmaxwell, i was seeing a significant chunk of ConnectBlock time being just destroying the temporary cache at the end.
< morcos>
while i have you both here. do you have any thoughts on how we could get rid of the 1 remaining database lookup for the coinbase in 6932
< morcos>
i hate that you have to access the DB for every block just in case you're processing the 2 historical violations of BIP 30
< sipa>
looking
< jgarzik>
morcos, agreed
< jgarzik>
that wants fixing (as you've proposed)
< morcos>
maybe i'm trying to overoptimize, but i wasn't able to eliminate checking the database for preexisting coins when you are creating the new outputs for a new coinbase
< sipa>
morcos: how about writing them and just not marking them as fresh?
< sipa>
make ModifyNewCoins take a boolean fPossiblePreexisting or something
< morcos>
that was my first plan, but it assert fails
< morcos>
the code assumes that if you're not fresh you must exist in the parent
< sipa>
i see
< sipa>
maybe that can be relaxed
< morcos>
i think you coudl remove that assert, but i didn't like making that change
< morcos>
so the idea i've been hesitant to suggest is to actually check if its one of those two hashes
< sipa>
yuck
< morcos>
hence the hesitancy, but its really more clear as to why you're doing what you're doing
< morcos>
otherwise you have to have a long winded explanation of why you treat a coinbase differently anyway
< morcos>
also i assume way faster than having to not mark all coinbases as fresh or first check the database for them
< morcos>
i don't know maybe not faster that not FRESH marking i guess
< morcos>
anyway, lunch.. i'm happy to do whatever you guys are most comfortable with
< sipa>
morcos: the only advantage of fresh is that it's removed from memory altogether before hitting disk, if it's spent before flush
< morcos>
sipa: right. so i guess that doesn't happen unless you have a big cache, with coinbases... but even if it happens a small fraction of the time, a DB write has to be orders of magnitude more expensive than checking 2 hashes on every coinbase
< sipa>
morcos: i think removing the assertion is fine
< sipa>
gmaxwell: my sigcache performance at 320 MB (with validating everything) keeps improving (it's an exponentially decaying average with 0.99 factor per block, so ~half a day halflife), down to 6% now
< sipa>
gmaxwell: feel like commenting on 6914?
< sipa>
morcos: i guess in 6914 i can add more comments that refer to the specification and unit tests for the iterators
< morcos>
sipa: what size in MB would be equivalent to a 1M sigcache of the old version
< morcos>
i tried 80MB which was my guess from the pull, and verification seemed to get a little faster, but overall run time slowed down
< sipa>
morcos: it should be around 3 times smaller
< sipa>
between 70 and 85 bytes per entry now
< sipa>
around 220 before
< gmaxwell>
sipa: will comment when I've through testing things with it.
< gmaxwell>
reindex with 6914 and no signature checks, 1h 17m.
< morcos>
sipa: did you benchmark adding things to the sigcache, or only looking them up? I'm wondering if calling GetRand() that often is too expensive. something slowed down for me for sure...
< gmaxwell>
morcos: it GetRand's on evicition (and the initial initilization) but not otherwise.
< morcos>
yeah i guess in the steady state it shouldn't be more than a couple evictions per new tx.. hmm.
< sipa>
morcos: the eviction takes like a microsecond
< morcos>
yeah i misremembered GetRand speed, but its a tad under a micro, i thought it was a few micros. still maybe better to use insecure_rand(), but doesn't explain my performance issue
< sipa>
morcos: in any case, the getrand is certainly the largest cpu time offender in the sigcache behaviour
< sipa>
more than the hash
< morcos>
i'm checking to see if i didn't match cache sizes, but i don't think thats the problem, since verification did speed up
< morcos>
its the rest of the run time that slowed down...
< sipa>
do you measure this by running two nodes in parallel?
< sipa>
or just long period of time averaging?
< sipa>
morcos: where exactly are you seeing slowdown?
< morcos>
total time to run a simulation over 7 days, increased by about 5%
< sipa>
interesting!
< morcos>
time spent in Connect Block had about a 10% decrease (but this number i've found has a lot of noise, i think due to the vagaries of the level db cache and/or whatever else is happening with my hard disk)
< morcos>
btw, have you thought about the effect on reorgs
< morcos>
reorgs are already incredibly slow, and now you've just lost the signature cache for every tx in the block
< morcos>
maybe we can skip signature checking on blocks added back from a reorg?
< morcos>
txs from the disconnected block added to the mempool that is
< morcos>
eh, that won't help since a bunch of them will need to be checked if they are included inthe replacement block
< morcos>
sipa: looks like it might be GetRand(). just ran with insecure_rand and it was a 5% overall speedup instead. i'm going to try again removing a bunch of extraneous debugging i have.
< sipa>
morcos: oh, that's an optimization i've thought about before
< sipa>
morcos: we actually store in the block index whether a block was verified already
< morcos>
sipa: are you talking about something different? a re-re-org?
< sipa>
right
< sipa>
morcos: it's strange that getrand makes things slower... the earlier version used way more entropy
< morcos>
i'm just talking about regular reorg, where you add back all the txs from the disconnected block. so now you have to process on the order of a few thousand txs with literally none of their sigs in the cahce
< sipa>
by using GetRandBytes()
< sipa>
morcos: one way to mitigate that would be to not remove sigcache entries during validation, but instead keep a list of txids added in each block in the past, and remove entries a few blocks deep
< sipa>
eh, nevermind
< sipa>
that would mean keeping a list of all signature checks
< morcos>
wait why wouldnt' that work
< morcos>
just keep a list of entries to remove
< sipa>
it would work, but it's ugly, large, and a layer violation :)
< morcos>
every new block add to the back of the list and remove from the front
< sipa>
we could use greg's idea of annotating the sigcache entries with a sequence number
< sipa>
and then after every block wipe things with a sequence number a few before
< morcos>
as for GetRand(). you're right, that doesn't make sense that its slower
< morcos>
so maybe the speedup from insecure_rand is real, but that still doesn't explain why it was slower in the first place
< sipa>
can you try with a 5 MB cache?
< sipa>
maybe the slowdown is just from larger working set
< cfields>
morcos: something you might try, i discovered this last week when profiling new network code...
< morcos>
the base case i was comparing against was a 1M entry cache
< sipa>
1M entries with old code?
< morcos>
yes
< sipa>
does it get filled?
< cfields>
morcos: the secure_delete openssl allocator was taking ~25% of the time to process a transaction. Commenting that out made a massive difference in the profile
< cfields>
not sure if that's artificial or if it actually surfaces in the real-world
< cfields>
s/transaction/message/
< morcos>
i didn't check, but over 7 days it accepts 1.3M txs to the mempool, so i would assume thats several million sigs
< sipa>
cfields: wut
< cfields>
sipa: take that with a grain of salt, it was a very artificial benchmark
< sipa>
morcos: so an interesting thing i saw is that with master and a 300 M mempool, and delete-on-use-in-block sigcache, i never got higher than 944k sigcache entries
< sipa>
cfields: what kind of profiling?
< morcos>
sipa: that seems plausible, but the old code doesn't have delete on use
< cfields>
sipa: gprof of echoing messages back/forth between nodes
< sipa>
morcos: in fact, it stayed between 936k and 944k very consistently
< cfields>
sipa: CDataStream is backed by the zeroafterfree alloc, so i suspect it'd be an issue in many other places
< sipa>
cfields: yeah, that may be unnecessarily much
< cfields>
i'd prefer if we were more conservative with where that was used. for non-sensitive data, it just seems wasteful
< sipa>
indeed
< sipa>
and network data should never be sensitive
< cfields>
sipa: right. very likely that that particular case was highly inflated though. It was a worst-case of bouncing thousands of tiny messages around very quickly. Still goes to show the trend, though.
< morcos>
sipa: apologies. that slowdown was an artifact. i've run it a couple times now, and it is faster over all with 6918. i'm still seeing 10% speedup in ConnectBlock, but then also an an overall speedup
< morcos>
switching to insecure_rand offered incremental improvement over that, but nothing like what i thought i saw before
< sipa>
morcos: ok, good to know!
< gmaxwell>
18 of the last 101 blocks have been v4.
< phantomcircuit>
gmaxwell, oon mainnet?
< phantomcircuit>
that was fast
< gmaxwell>
cfields: the zeroafterfree allocator has the benefit of killing making some use after free or use-of-uninitilized memory vulnerabilities unexploitable.
< cfields>
gmaxwell: mmm
< GitHub14>
[bitcoin] TheBlueMatt opened pull request #6964: Benchmark sanity checks and fork checks in ConnectBlock (master...verify-commits-fixes) https://github.com/bitcoin/bitcoin/pull/6964
< gmaxwell>
I dunno if its worth the cost, of course-- but it's not totally pointless.
< jgarzik>
perhaps makes valgrind happier
< GitHub124>
[bitcoin] TheBlueMatt opened pull request #6965: Benchmark sanity checks and fork checks in ConnectBlock (master...bench) https://github.com/bitcoin/bitcoin/pull/6965
< gmaxwell>
nah. it's on free, won't make valgrind happier.