< gmaxwell>
sipa: ah that might have been why I didn't bother trying to make it interpolate-- the seeks are in an index.
< jamesob>
sipa gmaxwell: mem usage increase is for sure the leveldb change (https://i.imgur.com/yMHJI37.png) and sipa's change didn't change the elevated usage
< sipa>
jamesob: so it's still elevated with my patch?
< sipa>
or it's not elevated anymore?
< jamesob>
still elevated :(
< gmaxwell>
might want to make a microbenchmar to figure out if its mmap behavior or something else in leveldb.
< gmaxwell>
also because it'll be easier to expirement (e.g. see if MAP_NORESERVE does anything) when we don't have to wait 5 hours.
< gmaxwell>
just a loop to mmap all the files in a directory then sleep 30 seconds should test the earlier hypothesis.
< jamesob>
gmaxwell: yup, agree
< jamesob>
will write if I have time tomorrow
< ossifrage>
It seems like MADV_RANDOM is useful if there is less locality to the reads then the size of the prefetch window, but I was under the impression that the prefetch is triggered by the pagefault and not by the mmap() so you actually have to do reads to get the extra pages
< sipa>
luke-jr: mmap'ed files aren't kept open after creating the mapping
< sipa>
so they don't count towards the 1000 files limit iirc
< luke-jr>
sipa: but we increased the file limit to 1000 *because* mmaps didn't count toward the fd limit
< luke-jr>
the limit I'm talking about is max_open_files, not ulimit
< sipa>
luke-jr: yes, but there is a separate mmap limit it seems
< luke-jr>
sipa: max_open_files effectively limits the number of mmaps too
< luke-jr>
so it seems like the LevelDB mmap limit change is effectively a no-op since we keep the 1000 max_open_files limit :x
< gmaxwell>
luke-jr: it's very much not a no op change.
< luke-jr>
gmaxwell: how does it have an effect?
< gmaxwell>
it causes more mmaps to be open, and less files. you can see this easily with lsof.
< luke-jr>
the mmap limit is enforced by NewRandomAccessFile, which is only ever called by TableCache which enforces the max_open_files limit :/
< gmaxwell>
Well I'm not telling you what the source code says, I'm telling you what it does. :P
< luke-jr>
is there a bug in the LRU cache stuff? :/
< luke-jr>
gmaxwell: is there an easy way to reproduce what you observe?
< gmaxwell>
yea, start a node up no connect, with and without the change and lsof it and look at the number of mmaped files. you might need to crank checkblocks up a bit.
< wumpus>
luke-jr: the only reason that the change to num files was acceptable is because leveldb, at least the version in our tree, doesn't actually keep more files open (on platforms with mmap) see also https://github.com/bitcoin/bitcoin/pull/12495#issuecomment-377228329
< luke-jr>
gmaxwell: cannot reproduce with -checkblocks=10000 :/
< luke-jr>
wumpus: I understand that. I just don't understand why increase the mmap-specific limit, without increasing the num files limit that is required for any mmap
< luke-jr>
we will never have more than 1001 mmaps because the files limit is 1000
< wumpus>
--I don't think mappings into closed files count against the open files ulimit
< sipa>
wumpus: luke's point is that internally in leveldb, the mmapped files count towards the open file limit
< wumpus>
but we increased leveldb's limit
< wumpus>
though to be honest I'm completely confused now with all the different limites, how they interact, and how they're supposed to interact
< sipa>
wumpus: yeah...
< wumpus>
given all the confusion about this, I'm starting to think it might be better to revert these changes, I like understanding things even if it means not getting every last possible % of performance
< sipa>
so there is a configurable leveldb max open files limit (which is 1000), and a mmap limit (which is hardcoded to 1000, but we raised it to 4096)
< wumpus>
and seemingly it now relies on all kinds of undocumented behavior
< sipa>
luke says (i haven't verified) that mmapped files count towards that limit... which is inconsistent with our observations
< sipa>
i'm just clarifying my understanding as there seems to be some confusion
< wumpus>
thanks
< gmaxwell>
right if we revert though we should revert both changes.
< wumpus>
yes
< gmaxwell>
I wouldn't oppose doing that for 0.17.
< sipa>
both? what other chanfe?
< sipa>
i saw the mmap increaee
< gmaxwell>
sipa: the fd count change and the mmap limit change.
< gmaxwell>
first FD count was increased with an argument that basically it wouldn't use more FDs due to using mmap. But then we found out that actually when it hit the mmap limit, it started actually using more FDs.
< gmaxwell>
Since the lack of mmaps is probably also bad for performance (though I don't think anyone benchmarked specifically for that), it seemed like it made sense to increase mmaps rather than revert the fd count change.
< wumpus>
yes
< wumpus>
"By default LevelDB will only mmap() up to 1000 ldb files for reading and then fall back
< wumpus>
to using file desciptors."
< wumpus>
that pull 19 is *supposed* to change that
< wumpus>
(from 1000 to 4096, which is still an arbitrary value) -- maybe luke-jr is using a system leveldb?
< wumpus>
I don't know anymore, sorry, should probably get coffee
< gmaxwell>
wumpus: 12495, as you say.
< sipa>
ah yes
< sipa>
hmm
< sipa>
we were using 64 files only before
< sipa>
but using 1000 there seems pretty standard practice; it's leveldb's own default
< luke-jr>
wumpus: not for this testing
< wumpus>
sipa: correct
< gmaxwell>
sipa: after increasing it, we were using a lot more file descriptors.
< luke-jr>
wumpus: but in any case, for testing this, I'm changing the 4096 back to 1000, and can't get LDB to use a fd
< gmaxwell>
luke-jr: are you using some system leveldb?
< luke-jr>
gmaxwell: like I said, not for this; but in any case, it would be the same 1000 mmaps
< gmaxwell>
luke-jr: dunno why it didn't reproduce for you, I saw leveldb using FDs, after using 1000 maps... and not anymore after increasing the map limit.
< luke-jr>
hmm
< gmaxwell>
sipa also saw it using fd's on his node.
< luke-jr>
testnet or mainnet?
< gmaxwell>
mainnet
< sipa>
mainnet
< luke-jr>
I wonder if that's why. :x
< gmaxwell>
testnet probably doesn't have enough ldb files.
< luke-jr>
ah
< sipa>
how large is testnet's utxo set?
< wumpus>
"disk_size": 1025042445, as of height 1384188 (it is at 1392776 now, don't know how much it changed since then)
< luke-jr>
hopefully I won't "damage" my mainnet .bitcoin trying this.. :x
< luke-jr>
(I still run 0.13 wallets from time to time)
< luke-jr>
looks like the UTXO db upgrade isn't triggering it, at least
< * luke-jr>
facepalms. After UTXO upgrade, it says I need to reindex >_<
< wumpus>
does it say why / give an error message? the whole point of the utxo upgrade would be to not reindex, unless it fails
< wumpus>
provoostenator: nit: you could pass NO_UPNP to the depends build instead of --with-miniupnpc=no
< wumpus>
not that libminiupnp takes long to compile
< provoostenator>
wumpus: thanks, in this case it's probably easier to keep the configure line flexible.
< provoostenator>
I first tried copying bitcoind over and just running it locally but that resulted in a world of pain, probably because the two systems aren't perfectly identical.
< wumpus>
yes copying executables between linux systems is fraught with problems
< provoostenator>
It's works fine with c-lightning though.
< wumpus>
though it should work with depends and the glibc compat enabled
< provoostenator>
I tried with depends as well, haven't tried glibc compat, might try that next time. But this also prevents me from accidentally syncing the chain over 4G.
< wumpus>
-static-libstdc++ helps as well
< wumpus>
(but only if all c++ dependencies are static, which is the case with a depends build but not otherwise)
< wumpus>
then you end up with an executable that is as-good-as-static and it should be portable to non-ancient linux versions with the same CPU architecture
< wumpus>
of course, that's the theory, there are still some things that can interfere for example if glibc is built with different options (I vaguely remember redhat and some security/hardening flag), or the target platform has a different libc
< wumpus>
this is one reason they came up with "snap"s which are executables that essentially carry around part of the system dependencies with them in a container
< wumpus>
(along with some security/isolation features)
< provoostenator>
Being able to run locally is especially useful for QT, and with the multiprocess stuff it would be nice to run bitcoind remote and QT locally.
< wumpus>
aanyhow if your two systems have the same release of the same distro, and are at the same update level, it should work too, that's probably easiest for most...
< luke-jr>
wumpus: I reflink'd the blocks/ and chainstate/ dirs to a new datadir. I'm guessing that simply wasn't sufficient somehow
< luke-jr>
oh well, I'll let it sync overnight and see if it hits the breakpoint
< luke-jr>
env_posix.cc:379 if anyone else who can reproduce it wants to try
< luke-jr>
(if you hit it, maybe see if you can work out why it got there..)
< luke-jr>
wumpus: heh, I was tired last night; the reason it needed a reindex (or I guess re-IBD) seems to be that I named the blocks directory "newblocks" XD
< luke-jr>
err, "tmpblocks"
< Chris_Stewart_5>
BIP157 doesn't currently have an open PR does it?
< jimpo>
Oh, thanks, will open a PR to fix that. Need to add a test block anyway.
< Chris_Stewart_5>
Thanks!
< luke-jr>
gmaxwell: sipa: wumpus: sync'd up to 341411 and still no non-mmap opens from LDB..
< sipa>
luke-jr: ?
< sipa>
what does that mean
< gmaxwell>
341411 is a long time ago.
< gmaxwell>
sipa: he's trying to reproduce the behavior ossifrage, you, and I saw where leveldb would hit the mmap limit and start using more FDs.
< sipa>
you'll only exceed 1000 files/mmaps once you're above ~2GB UTXO set size
< sipa>
as each file is 2MB ish
< luke-jr>
hmm
< ossifrage>
pmap 24199 | grep \.ldb | wc -l == 1894 (for a node up ~26 days)
< sipa>
luke-jr: so you'll need to get to somewhere in 2017 at least
< * luke-jr>
retries the UTXO upgrade, this time with the blocks dir named correctly
< ossifrage>
That includes the txindex and the chainstate
< sipa>
ossifrage: per database; txindex and chainstate are separate
< gmaxwell>
has anyone just tried making a dummy program that mmaps a couple thousand files to see what the RES shows up as?
< ossifrage>
is having a larger resident size really an issue, you only get those pages if memory is available, the os should evict them right quick under memory pressure
< sipa>
presumably, indeed
< sipa>
it may be just an methodology problem, what rss doesn't exactly corresponds to what we care about
< gmaxwell>
ossifrage: if thats actually the case, thing things are fine.
< gmaxwell>
and if so we probably need to figure out how to start tracking "total dirty pages" or something.
< ossifrage>
leveldb only uses mmap for reading, so the dirty pages from all the maps should be 0
< gmaxwell>
Right.
< luke-jr>
personally, I only look at ps's "size" metric
< luke-jr>
size SIZE approximate amount of swap space that would be required if the process were to dirty all
< luke-jr>
writable pages and then be swapped out. This number is very rough!
< gmaxwell>
ossifrage: so to catch you up. We have infrastructure that benchmarks the software regularly and charts performance... so we find out if some commit accidentally hurts performance.
< gmaxwell>
ossifrage: we noticed around the time of the 0.17 feature freeze that the claimed memory usage suddenly went up 20%!
< gmaxwell>
during the week where a lot of last minute stuff got merged.
< gmaxwell>
Bisection turned up your change to increase the leveldb maps count.
< ossifrage>
gmaxwell, where does the claimed memory measurement come from?
< gmaxwell>
Thats what we're trying to determine. after finding out that mmap automatically pages in some of the file tried madvising DONTNEED all the maps, but that didn't seem to fix it.
< gmaxwell>
It might well be that the increased usage is totally fictional... that it's just counting paged-in non-dirty pages.... and they're not real memory pressure and we don't need to worry about them.
< ossifrage>
I'm writing a mmap() test case out of morbid curiosity
< gmaxwell>
Or maybe, though we haven't spotted it yet, leveldb is using 200kb of god knows what per open map. :P
< gmaxwell>
Meanwhile, luke-jr went and read more of the leveldb code and believes that it can't have more mmaps open than the FD limit, and so he thinks the mmap increase should be a noop.
< gmaxwell>
(which it clearly isn't, but it would be good if everyone was on the same page about how leveldb works)
< ossifrage>
I currently have 992 chainstate ldb files mapped and 859 txindex and 43 blocks/indexes ldb files, but it took a fairly long uptime for the original problem to show up
< ossifrage>
I don't see much of a change between the htop reported RES size between mapping 1 txindex ldb file and mmaping all 2041 of them
< luke-jr>
gmaxwell: well, I can reproduce it now, but I haven't figure out why yet
< ossifrage>
(but I never read from the map)
< gmaxwell>
ossifrage: k? try reading like, the last byte of them or something?
< ossifrage>
The VIRT line goes up as expected but RES stays fairly small
< luke-jr>
hmm, is everyone reproducing this using txindex?
< ossifrage>
gmaxwell, yeah that caused the RES size to approach the VIRT size (read the last byte)
< sipa>
gmaxwell: "it can't have more mmaps open than the FD limit" -> s/FD limit/max_open_files setting/
< gmaxwell>
now, try MADV_RANDOM ?
< gmaxwell>
ossifrage: ^
< ossifrage>
I read the last byte, so I'm not sure what it would prefetch :-)
< gmaxwell>
I'm guessing that it's prefetching the whole thing.
< gmaxwell>
andytoshi: HI!
< andytoshi>
gmaxwell: hiya, sorry, my server shit out this morning .. i have no idea why, the logs show everything clear, but it was apparently completely locked up even to a local terminal user
< andytoshi>
so i called my dad to hard-boot it and it seems ok now
< ossifrage>
MADV_RANDOM didn't seem to change the bump in RES caused by reading last byte
< gmaxwell>
ossifrage: saddness. ... so run pmap on the process, and see how many dirty pages it says you have, presumably like none.
< sipa>
gmaxwell: mmap is only used in leveldb for read-only files; none should ever be dirty
< gmaxwell>
Interesting, there are only 41.3 BTC stored in bare multisig outputs, but there are 410k of them in total.
< gmaxwell>
sipa: I know but I also would have expected that increasing the mmap limit wouldn't have increased RES. :)
< ossifrage>
One sec, I might have screwed up...
< gmaxwell>
I'm going to suggest we start charting the total dirty pages from pmap... and not worry about RES...
< luke-jr>
it looks like the mmap limit is global
< ossifrage>
Okay, it looks like MADV_RANDOM isn't helping, but just reading the last byte does cause the RES size to jump quite a bit (but I'm not sure how much I expect it to go up)
< ossifrage>
at least 1 page per file?
< sipa>
yes, 1 page per file is expected
< gmaxwell>
I expect that 1 page per file is what you should get, which is a modest amount of memory.
< gmaxwell>
in bitcoin we're seeing an increase of hundreds of meg.
< gmaxwell>
way more than one page per file.
< ossifrage>
When reading the first byte, MAD_RANDOM does decrease the RES usage
< ossifrage>
Even with MADV_RANDOM, it would depend entirely on how many pages are faulted in by ldb
< ossifrage>
Err, it looks like for me it is prefetching ~64K
< gmaxwell>
64K with MADV_RANDOM?
< ossifrage>
I think I might be suffering from previous state problems
< ossifrage>
Okay, dropping the cache and madv_random does seem to only read 4K for most of the files mapped (some have more for some odd reason)
< gmaxwell>
oh interesting, so, if you cause the files to be paged in, then start a program with madv_random it will reflect higher res becuase you mmaped files that were already in cache?
< ossifrage>
Yeah, without MADV_RANDOM all the files have 64K RES when I read the 1st byte
< ossifrage>
echo 3 > /proc/sys/vm/drop_caches between runs
< ossifrage>
Yeah I was getting confusing results until I realized that
< gmaxwell>
Cool so also we might not have seen the effects of MADV_RANDOM when testing with actual leveldb due to that.
< ossifrage>
Depends heavily on the specific hardware yeah
< ossifrage>
But drop_caches is a mighty big hammer
< gmaxwell>
sure, just in terms of "huh why did MADV_RANDOM seem to do nothing"... now we know.
< ossifrage>
I suspect the minimum amount read without MADV_RANDOM depends on things like the filesystem and the underlying block device
< ossifrage>
Like reading 1 page with a SSD is no big deal vs 1 page with a RAID5 md array
< ossifrage>
Any ideas how much leveldb actually reads at a time?
< ossifrage>
It would have to touch at least some of the index and then the actual data it wants to read
< gmaxwell>
it does a bisection search in the index and then I think it only reads the record of interest.