< phantomcircuit>
BlueMatt, you have code to walk the cache for stuff nothing depends on?
< BlueMatt>
nope
< BlueMatt>
its not hard, though
< sipa>
there is no "depends"
< sipa>
it's a cache
< BlueMatt>
sipa: you know what he meant
< sipa>
it's expected tgat gbt increases the cache size, as it will pull in all dependencies
< BlueMatt>
sipa: indeed, not sure what phantomcircuit is seeing is unexpected or not, but the graphs in the ml related to https://github.com/bitcoin/bitcoin/issues/6876 are not expected at all
< sipa>
well i have no idea what the gbt code is all doing
< phantomcircuit>
BlueMatt, what im seeing is 6.3GB of ram being used with dbcache=4096
< BlueMatt>
phantomcircuit: in dbcache? that sounds about right
< phantomcircuit>
and calling gettxoutsetinfo (which calls FlushStateToDisk) not changing that
< BlueMatt>
oh
< BlueMatt>
hmm
< sipa>
GBT copies a significant portion of the cache into its own view
< sipa>
so it can pretty much double the memory usage
< phantomcircuit>
sipa, yeah but that view is destroyed at the end of the CreateNewBlock call
< sipa>
does change your res
< sipa>
*doesn't
< phantomcircuit>
right because of memory allocation stuff
< sipa>
fragmentation etc
< phantomcircuit>
sipa, hmm boost::unordered_map is buckets so make expanding easier
< phantomcircuit>
im not sure that's it though since memory usage just jumped 700MB more
< phantomcircuit>
i cant imagine that's from fragmentation alone
< sipa>
well if there is 800 MB chaonstatr depended on by the mempool, then GBT will add another 800 MB
< sipa>
fragmentation just makes it not able to release it back
< phantomcircuit>
chainstate?
< sipa>
yes
< sipa>
typing in a driving car on a small keyboard
< phantomcircuit>
oh duh right they're separate
< morcos>
the problem with GBT is it can create that much memory usage per each of the RPC threads
< morcos>
each thread allocates the memory in a separate arena, and even though the objects are destroyed at the end of the call, there tends to be enough fragmentation that the memory isn't entirely free
< morcos>
in addition, if your chainstate expands during an RPC call (such as due to GBT) enough to cause a rehash of the unordered map
< morcos>
then this also will be allocated in a new arena, and possibly all the old chainstate won't be cleaned up
< morcos>
phantomcircuit: sipa: ^ not sure if you followed the earlier conversation i had with wumpus and gmaxwell about this
< cfields>
gmaxwell: for backlog, i added in Tor's RESOLVE extension to SOCKS5 so that we can query all seeds up front without making actual node connections. Not sure if there's any real benefit, but it was trivial to add.
< wumpus>
cfields: nice
< jgarzik>
jcorgan, btcdrak: 2015_sqlite branch now works, passes tests
< jgarzik>
not yet performance-tuned
< wumpus>
great
< * wumpus>
switches to sqlite
< jgarzik>
no apparent transaction size limit. "I am able to insert 10 million rows in a single transaction"
< jgarzik>
BEGIN...COMMIT maps easily to DBWrapper's batches
< wumpus>
that's good news
< wumpus>
one question: will using the mysql branch blow my old leveldb database, or can they exist side by side?
< wumpus>
will just create a new datadir to be sure
< jgarzik>
should be able to exist side by side
< jgarzik>
sqlite database is a file named "db"
< jgarzik>
but don't trust me - be certain and create a new datadir :)
< jgarzik>
2) git pull the latest 2015_sqlite branch, it includes some key performance tweaks in dbwrapper.cpp (grep for PRAGMA)
< * jgarzik>
heads back to bed *poof*
< btcdrak>
jgarzik: I'll take a look
< GitHub65>
[bitcoin] giacecco opened pull request #6885: Instructions on how to make the Homebrew OpenSSL headers visible... (master...master) https://github.com/bitcoin/bitcoin/pull/6885
< wumpus>
2015-10-24 12:45:29 UpdateTip: new best=000000000000000bf325356179fb8876fe40e250c9e31082242f70f89ecbcd0b height=240923 log2_work=70.308629 tx=1
< wumpus>
it seems like the flushing takes quite a long time with sqlite (40 seconds in this case), haven't done a direct comparison with leveldb though
< wumpus>
but it blazes past N blocks, then hangs noticably on the flush, longer than I remember
< wumpus>
at first glance it seems to do a lot of calls to fsync()
< wumpus>
(unscientifically tested by running in gdb and breaking+backtracing a few times)
< wumpus>
yes it's fsyncing - removing the fsync calls (obviously a stupid idea in itself) flush time is down to <10 seconds. I understand calling fsync, but is it doing so for every single statement?
< jcorgan>
wumpus: confirming same behavior here
< wumpus>
"PRAGMA synchronous=0" works too (make sure it's executed every time the database is opened) instead of commenting out fsyncs - though won't be very resistant to crashes https://www.sqlite.org/pragma.html#pragma_synchronous
< jcorgan>
heh, isn't the whole reason we're testing this is to improve crash resistance? :-)
< wumpus>
yeah...
< jcorgan>
once this reindexes i'll bet steady-state performance won't really be any different
< wumpus>
so sqlite's idea of crash resistance seems to be 'fsync after every file operation'. Or maybe it's after some buffer fills that can be increased, I don't know.
< wumpus>
jcorgan: I'll bet the same
< jcorgan>
hmm, reindexing is actually speeding up as it goes along
< jcorgan>
eyeballing it it is about 150-200 blocks/sec, with peak disk writes of about 70-80 MB/sec
< wumpus>
jgarzik: the pragma tweaks in sql_db_init are not persistent - they need to be done every time the db opens, not only when it is created
< jgarzik>
wumpus, very odd - I would think page size is persistent
< wumpus>
jgarzik: probably the database is created with one page size, but e.g. cache size isn't remembered
< wumpus>
good news, PRAGMA journal_mode=WAL; seems to solve the excessive-fsync problem too
< wumpus>
"WAL works best with smaller transactions. WAL does not work well for very large transactions. For transactions larger than about 100 megabytes, traditional rollback journal modes will likely be faster" that's strange - we certainly have transactions that big
< jgarzik>
In C++, is there a one-line way to convert an int to a std::string ?
< jgarzik>
i.e. "foo" + numstr(22)
< wumpus>
we have itostr in utilstrencodings.h
< wumpus>
and i64tostr
< jgarzik>
ok, perfect
< jgarzik>
wumpus, pushed the db init fix to 2015_sqlite (where params like cache_size are configured every time, but 'create table' happens only once)
< jgarzik>
wumpus, WAL should be good for us
< jcorgan>
is that in your update?
< jgarzik>
jcorgan, WAL? no
< jgarzik>
jcorgan, it's pretty obvious where to add configuration lines at the top of dbwrapper.cpp, so it's straightforward
< jgarzik>
Another todo item is making the cache size configurable
< jgarzik>
(via runtime GetArg, I mean)
< jgarzik>
70-80 MB/sec is pretty darned good
< jcorgan>
btrfs on hardware raid 10 :-)
< jgarzik>
I would be interested in the wall clock reindex time
< jgarzik>
of master vs sqlite
< jcorgan>
i'm restarting with WAL, i'll time it
< jcorgan>
btrfs snapshots are the bomb
< wumpus>
with WAL, PRAGMA wal_autocheckpoint has a lot of influence in the number of fsyncs
< * jgarzik>
doesn't see a need for fsync inside a batch
< jgarzik>
as long as post-crash we see a consistent picture, we can lose the WAL-in-progress
< wumpus>
well it *looks* to me that it's this: during a batch it writes to the journal, and it fsyncs the journal. I agree though.
< jgarzik>
changes are batched to std::vector<> internally, and then flooded to the db in a rapid BEGIN..INSERT*..COMMIT sequence.
< wumpus>
no need to checkpoint *ever* during a batch
< jgarzik>
correct
< sipa>
cache size is controlled by -dbcache already
< sipa>
part of it is assigned to pcoinsTip cache, part to the database layer itself
< sipa>
using a totally arbitrary formula
< jgarzik>
sipa, the specific need is for sqlite to move from static constant stored within a constant "foo" string to GetArg configured with that
< sipa>
i'm just saying, no need for a new GetArg, there is already logic for this in init.cpp
< jgarzik>
yep
< sipa>
note that the chainstate only really has a durability requirement when pruning
< sipa>
otherwise, any old but consistent state is acceptable
< sipa>
though blockindex flushes are hard requirements
< jgarzik>
I would like to separate things out into multiple tables
< jgarzik>
(as sipa mentioned days ago)
< sipa>
i'm surprised you didn't need to yet
< sipa>
can transactions span multiple tables?
< jgarzik>
yes
< sipa>
in that, you probably want to split it
< jgarzik>
wumpus, One concern with the current implementation is the 'ORDER BY' - a sort - in the CDBIterator class. Once fully sync'd to current bitcoin block height, failing to store in an always-sorted container may create lumpy bitcoind behavior whenever CDBIterator is used... maybe.
< jgarzik>
Needs testing to disprove hypothesis.
< jgarzik>
possibly either the sort is fast enough or smart enough that this is not noticed
< wumpus>
yes, depends on the kind of index
< wumpus>
I suppose the default is a sorted index
< wumpus>
something like a hash index on the key would indeed break the assumption that CDbIterator always returns the results in order
< jgarzik>
well not break - slow down
< jgarzik>
'ORDER BY' provides the ordering guarantee in case the underlying db does not, in this implementation
< wumpus>
yeah... but an in-database sort of a whole table really isn't pretty
< jgarzik>
nod
< wumpus>
I guess it's best to just use a sorted index for now, unless it proves to be a bottleneck
< jgarzik>
it's a partial sort, starting at the base key
< jgarzik>
not a whole-table sort
< wumpus>
don't we only use iterators over the whole table?
< jgarzik>
not needed
< jgarzik>
code has one pattern: seek(key) then next() next() next()
< wumpus>
then again - without an ordered index, the >= criteria on the key will also cause a full scan
< jgarzik>
yep that is actually an option - assuming results can be unorder - drop 'ORDER BY' and simply scan
< * jgarzik>
wonders how to tune index types
< wumpus>
yeah, that's the same as the database does internally. I suggest just sticking with a sorted index, unless it turns out to be really a problem, this is unwanted uglyness
< jcorgan>
wumpus, what did you end up with for the list of pragmas
< jcorgan>
WAL isn't making any difference for me
< wumpus>
good question though: do any of the CDBIterator clients require the records to be in a defined order?
< jgarzik>
it works if you can do ">= start_key" and "< next_prefix" and know what next_prefix is
< wumpus>
my pointi s that the sort would not be required when the records would be binned in a different way, for example in different tables, instead ofusing a prefix to distinguish them
< wumpus>
they don't rely on the fact that keys are sorted
< jgarzik>
wumpus, agreed there - hence <jgarzik> I would like to separate things out into multiple tables
< wumpus>
in the case of the UTXO database this doesn't matter that much because it *almost* only contains COINS entries
< wumpus>
for the blockdb on the other hand, it can contain these txindex entries...
< jgarzik>
wumpus, what is your total diff versus 2015_sqlite, WRT pragmas? can you paste that?
< wumpus>
(and a lot of them, the number of block entries is neglible comparison)
< jgarzik>
I want to put that in 2015_sqlite
< wumpus>
+ "PRAGMA wal_autocheckpoint=0",
< wumpus>
+ "PRAGMA journal_mode=WAL",
< jcorgan>
isn't that supposed to be schema.journal_mode ?
< jgarzik>
global is fine
< wumpus>
I prefer setting the global option
< jcorgan>
ok
< wumpus>
(it's possible to set it per schema)
< wumpus>
(but do we even define those?)
< jgarzik>
no need
< wumpus>
using schema.journal mode *literally* fails, that was my first try too :)
< jgarzik>
wumpus, well, of course it will fail, first time the table does not exist
< jgarzik>
stmts can be reorders but ... setting the global is best
< wumpus>
but setting the global one works, so yeah...
< jgarzik>
reordered
< jgarzik>
OK, pushed out WAL & improved errors to 2015_sqlite
< jgarzik>
going to task switch, poke me if there are other updates for the branch
< jcorgan>
i'll time the total reindex with the new stuff
< jcorgan>
huge difference with the pragmas
< jgarzik>
Thanks. If someone is so motivated, timing with different page sizes (1024 4096, 8192) can be useful. Page size goes all the way up to 64k.
< wumpus>
wouldn't the optimal page size depend on the hardware as well? would be good to do some benchmarks, for example do a full reindex with various sets of parameters, but I don't have a system I can use for controlled tests without background noise
< wumpus>
-stopafterblockimport is a great option for timed, batched reindexes, though
< jcorgan>
i suspect it would be very filesystem and storage configuration dependent
< wumpus>
btw it could be that we need to call sqlite3_wal_checkpoint_v2 at some point (eg, when flushing) with autocheckpointing disabled
< wumpus>
"Writers sync the WAL on every transaction commit if PRAGMA synchronous is set to FULL but omit this sync if PRAGMA synchronous is set to NORMAL." or maybe not. Pragma synchronous is FULL by default. I'm not sure how the checkpoints come into it.
< wumpus>
syncing on transaction commit sounds sane
< jcorgan>
yes
< wumpus>
after running a while, bitcoin-shutoff is taking a long time in sqlite3_close - possibly due to the lack of checkpoints
< jgarzik>
RE optimal page size - ideally it is an "I/O unit" which is filesystem block size - but large dbs might gain from clustering. Also, modern filesystems are kernel-page-cache based, and so an I/O unit in practice is often 4096 or 8192.
< wumpus>
jgarzik: yes - probably setting the page size to anything less than 4096 is not going to be useful
< wumpus>
(at least not on linux)
< wumpus>
adding "rrc = sqlite3_wal_checkpoint_v2(psql, NULL, SQLITE_CHECKPOINT_PASSIVE, NULL, NULL);" at the end of WriteBatch indeed stops the wal file from growing... but, makes the flush slow (~30 seconds) again :/
< * jgarzik>
reverted 2015_sqlite back to default autocheckpointing behavior... plenty of room for further research & experiments
< jgarzik>
I think there is a background WAL checkpoint mode
< wumpus>
so I'm not sure when to do these checkpoints
< wumpus>
the default autocheckpointing does a checkpoint every 1000 statements or so, that is probably even worse
< jgarzik>
nod - it's a stable base for further research, not an endpoint
< jgarzik>
the main goal is to not-wait for checkpoint, not not-checkpoint
< wumpus>
yes but we want checkpoints only to happen on commit, not inbetween
< wumpus>
I think the default does so, not sure though... didn't expect manual calls to sqlite3_wal_checkpoint_v2 to be so slow even when done once per transaction
< wumpus>
doing it in the background would be nice
< wumpus>
but then "Checkpointing needs to lock the entire database, so all other readers and writes would have to be blocked. (A passive checkpoint just aborts.)"
< wumpus>
so I'm not sure how much doing the checkpointing in a thread would help in practice. If you do PASSIVE checkpoints they'll never happen at all, and the other modes do waiting in one way or another
< jcorgan>
whouda thunk that getting database operations right would be as hard as getting crypto right? :-)
< wumpus>
well I'm not sure it's as hard as getting crypto right :) but yes it's not trivial
< wumpus>
but seeing all of this it seems leveldb isn't so bad at all
< jgarzik>
switching major database system types, one expects some analysis and bumps in understanding the new system
< sipa>
the only way the wal is cleared is with a checkpoint?
< wumpus>
yes, and this is only an experiment
< wumpus>
sipa: yes, 'checkpoint' seems to be the operation 'incorporate the WAL into the databse'
< sipa>
and that operation needs a full exclusive lock on the database?
< wumpus>
yep
< sipa>
that's unusable
< jgarzik>
The underlying system calls are range locks, so I wonder
< jgarzik>
non-WAL updates the db with range locks, I'm pretty sure
< sipa>
range locks are useless on a db ordered by random indexes
< wumpus>
yes there's always the normal journal, though non-WAL does an fsync per statement, it seems, its performance was really bad here (and all the fsyncs were making the rest of the system slow, too)
< jgarzik>
not necessarily - modern schemes write index/data updates to new pages, allowing older pages to be read in parallel (presenting older, committed view of the data)
< jgarzik>
so you can be writing new while reading old
< jgarzik>
update-in-place is avoided these days, as it does not produce crash-consistent behavior
< sipa>
yes, sure
< sipa>
but you want the background-written log to be incorporated into the real db without locking the whole thing down
< sipa>
leveldb does that
< jgarzik>
nod - and that's perfectly doable here with range locks - you update the new index, then a quick "flick of a switch" jumps from old consistent state to new consistent state, the latter of which was written in parallel in the background
< sipa>
every set of wal will touch database entries all over the place
< sipa>
every non-trivial set.of log entries will need to lock the whole db
< jgarzik>
same is true for every batch, every database system
< jgarzik>
again not true - read the above - you don't need to lock the whole db in theory - I can't speak for sqlite but in database circles the solution is well known here.
< sipa>
oh of.course
< sipa>
i'm just observing sqlite's behaviour
< sipa>
leveldb solves it by having different levels :)
< sipa>
and no guarantee in which level particular data is to be found
< sipa>
socchanges can "ripple up"
< wumpus>
just read https://www.sqlite.org/c3ref/wal_checkpoint_v2.html - from what I understand from it, checkpointing requires an exclusive lock and needs to wait for all readers and writers to finish. If there is a way to work around that it'd be nice, but theory isn't of much help here :)
< jgarzik>
wumpus, that's tuned by wal_checkpoint=[passive/full/restart/truncate] a bit
< wumpus>
tha's mentioned in the link I gave, yes. It looks like that determines who gets to wait (or cancel) for whom, not whether the operation requires an exclusive lock
< sipa>
does issuing a passive checkpoint that gets interrupted make progress?
< wumpus>
it can
< wumpus>
doing e.g. SQLITE_CHECKPOINT_FULL or _TRUNCATE from a thread would make sense I suppose - it blocks all writers and some readers: apparently only those readers not reading from the most recent database snapshot
< wumpus>
at least it wait for all readers to non-recent database snapshot to complete
< wumpus>
(which makes sense as the old version is effectively discarded)
< wumpus>
blocking writers is bad during initial sync, but not so much during steady-state where not much updates happen
< jcorgan>
fyi, this reindex is with txindex=1
< sipa>
that likely hurts performance
< sipa>
txindex writes are not batched
< sipa>
all other database writes are
< jcorgan>
yeah, it's at height 180K and has slowed down dramatically
< jgarzik>
yeah txindex=1 will slow things down & is not representative
< sipa>
i consider txindex something you need for debugging/diagnostics, not for production use
< wumpus>
agreed
< jcorgan>
let me redo it then
< sipa>
that said, it shouldn't gratuitously kill performance if there is an easy way around it
< sipa>
i also doubt that anything you notice at 180k is due to that
< jcorgan>
i don't know that it started at 180k, just that was where it was when i check in on it
< wumpus>
don't you mean 280k?
< wumpus>
(that's where it is here, and I started about the same time)
< jcorgan>
no, 180k
< wumpus>
ok
< wumpus>
unbatched? hmm, doing a sqlite transaction per bitcoin transaction is certainly going to kill sqlite performance
< jcorgan>
restarted fresh with txindex=0, no difference in speed at the start, but we'll see how it slows down
< wumpus>
there's almost no transactions at the start :)
< jcorgan>
right :)
< jcorgan>
it's about 200 blocks/sec
< CodeShark>
I don't even count the first 100k blocks :p
< CodeShark>
so are we benchmarking the SQLite stuff?
< jcorgan>
"benchmarking" would be generous
< jcorgan>
more like, wow, it actually works
< wumpus>
CodeShark: if you want: jgarzik repository, 2015_sqlite branch
< CodeShark>
jcorgan: why wouldn't it work? :)
< wumpus>
it does actually work quite well :) although no one has tested yet if it is more resilient to crashes/poweroffs than leveldb, esp. on windows
< jcorgan>
um, with txindex=0, it just hit 120k in the last 10 minutes
< CodeShark>
sqlite is a pretty solid piece of software (albeit I've never tried using it for something that really taxes it)
< sipa>
jcorgan: that sounds very slow
< wumpus>
this is with autocheckpointing reenabled?
< jcorgan>
i don't think so
< wumpus>
well it won't get faster than that
< jcorgan>
wal, autocheckpoint=0
< jcorgan>
it just hit 180k
< wumpus>
doesn't sound too bad to me
< jcorgan>
still at about 160 block/sec
< jgarzik>
Ideally you want a database system that ensures consistency, permits uninterrupted reads and writes, and optimizes in the background (i.e. "optimize" means move journalled data to final db position, updates indices, etc.)
< jgarzik>
app shouldn't have to checkpoint
< sipa>
yes
< jcorgan>
wumpus: should i have made a further change in the pragmas?
< jcorgan>
also, what was that stop after import option, and can it be set in the .conf file?
< jgarzik>
reads locklessly read consistent data always, and writes do not stomp on that
< wumpus>
jgarzik: they do in sqlite, the checkpoint is just administrative, it doesn't affect what readers read
< wumpus>
at the time checkpoint is called, new readers will already be reading the new state, by the time it can start ,old readers will be finished reading the old state, by definition
< wumpus>
jcorgan: well with autocheckpoint=0 it effectively appends all changes to the db-wal file instead of incorporating them
< wumpus>
this is a matter of where the data is stored, it doesn't affect how clients perceive the database
< sipa>
but reading data from the wal file instead of the actual db must be slower?
< jgarzik>
there is impact if some operations are blocked by other operations
< wumpus>
could be
< wumpus>
there can certainly be performance impact, just not functionality/consistency impact, that's all I'm saying...
< jgarzik>
nod
< jcorgan>
ok, restarted with txindex=0, wal, and removed the autocheckpoint=0
< wumpus>
db-wal for chainstateshould stay at about 440M then
< wumpus>
(1-transaction-sized)
< wumpus>
hmm too big for one transaction.. no I don't know :)
< wumpus>
at least it stays relatively constant around 440mb and doesn't become GB's big anymore, to really make it truncate one'd probably need the SQLITE_CHECKPOINT_TRUNCATE
< jgarzik>
jcorgan, current 2015_sqlite uses autocheckpoint=10000
< jgarzik>
not that I would suggest re-starting yet again :)
< jcorgan>
i have it scripted :)
< jcorgan>
hmm
< jcorgan>
-stopafterblock import shut it down at 183842
< btcdrak>
looks like i am missing the sqlite party
< CodeShark>
it gets really slow once it passes 230k
< jcorgan>
something strange is happening
< jcorgan>
the node stops reindexing at 183k, then starts up and asks connections for blocks at this point
< jcorgan>
ok, that's weird
< jcorgan>
somehow i had restored a snapshot of bitcoin dir that only had 183k blocks in it
< jcorgan>
i cloned a new snapshot of the original and it is working fine with that