< murrayn>
Is there a reason -O2 is specifically enabled in configure with --enable-debug? Should this not be -Og?
< gmaxwell>
sipa: linked on your SHANI pr is an implementation where someone else noticed the throuput/latency relationship that I noticed.. they also do a 4way and it's faster (by a small amount) than 2-way.
< gmaxwell>
they get 18% speedup for 2way over 1way, and 21% for 4-way over two-way.
< gmaxwell>
I'm not sure if that difference is even worth it, though perhaps throughput might increase for later cpus.
< sipa>
interesting, i'll try that too
< gmaxwell>
Their implementation might be interesting to look at to see if they had some smarter way of dealing with register pressure.
< sipa>
another remarkable thing i noticed: the speedup of 64-specialized shani over variable length shani was close to 2x
< sipa>
far higher than the ratio observed elsewhere
< sipa>
gmaxwell: from what i can see it's just interleaving
< gmaxwell>
(presumably register churn is why their attempt at 8-way was slower 2/4 way)
< gmaxwell>
sipa: The 64-specialized saves expander work, which I guess isn't as fast with shani? or maybe it's just that shani is faster so calling overhead (which the specialized reduces) matters more?
< provoostenator>
Memory management is a pain. I have a device with 1 GB RAM, trying to squeeze as much as possible out of it during IBD. Without swap, if I set it slight too high, it crashes when dbcache gets too large. With swap, it starts using the swap, which presumably defeats the purpose. Is there any way to _have_ swap but prevent dbcache from using it?
< gmaxwell>
provoostenator: I doubt swapping is actually defeating the purpose, at least if it isn't doing it heavily.
< gmaxwell>
The data that gets swapped is infrequently used stuff first...
< sipa>
gmaxwell: SHANI has special instructions both for expansion and transform
< provoostenator>
It indeed didn't seem very slow, so maybe it's not too bad in practice then. 450 MB dbcache (with maxmempool=5) seems about the max without swap.
< sipa>
gmaxwell: 4-way seems a bit slower here, but that may be due to less than perfectly interleaved code being emitted
< provoostenator>
I have a new theory as to why my aggresive pruning IBD branch is _slower_ than master. Namely that dirty CCoinsCacheEntry read/write doesn't perform well for very large cache sizes. See See also https://github.com/bitcoin/bitcoin/pull/12404#issuecomment-395998702
< provoostenator>
(theory, still have to measure this)
< phantomcircuit>
sipa, does flushing the cache still remove everything?
< sipa>
yes
< phantomcircuit>
sipa, and there's no way to flush "upto block x" right?
< sipa>
phantomcircuit: indeed, because there may have been entries created before x, but spent after x, which wouldn't be present on disk
< sipa>
it is possible with the non-atomic flushing since 0.15 (which writes to disk a range of blocks rather than a single up-to-x point)
< sipa>
though it's pretty complicated to reason about
< phantomcircuit>
sipa, so to enable that you'd need to keep around entries that are a record of an entry being deleted?
< sipa>
phantomcircuit: you actually don't
< sipa>
you just need to accurately keep track of (a) the block up to which you've flushed everything and (b) the block up to which effects may be present on disk, and at startup replay the blocks' UTXO effects between those 2
< sipa>
that's already implemented even
< sipa>
however, once you introduce partial flushing during reorgs which may overlap etc... it becomes far more complicated
< phantomcircuit>
yeah wasn't thinking about reorgs
< sipa>
all of this is doable, and i think i know all the algorithms necessary to implement it
< sipa>
with the goal of being able to have a background process that just periodically (and asynchronously) flushes the oldest dirty UTXO entries (and wipes the oldest non-dirty ones)
< sipa>
but it's a pretty big amount of work without knowing if it'll actually speed things up :)
< phantomcircuit>
sipa, i had a patch which did this, but broke consensus across reorgs
< phantomcircuit>
it was a substantial speed up
< phantomcircuit>
but that was a while ago, so possibly it wouldn't be as large anymore?
< sipa>
since per-txout in 0.15 performance profiles of such things may have shifted drastically
< sipa>
it could be less or more of a speedup now :)
< phantomcircuit>
yeah
< phantomcircuit>
iirc it was really simple to do
< phantomcircuit>
sipa, the FRESH flag looks a bit confusing
< phantomcircuit>
the idea is that if an entry is added and spent before a flush it's effectively a noop ?
< sipa>
it just means "this entry does not exist in the parent cache, so if it is spent, we can just forget about it"
< sipa>
phantomcircuit: it's *the* major performance gain our cache gives
< phantomcircuit>
ok i get that
< phantomcircuit>
yeah
< sipa>
because it avoids entries ever hitting disk at all