> Every programmer should have a basic understanding of latencies. Why exact...

dgant · on May 2, 2012

There are very few software environments where the cost difference of "Read 1MB sequentially from disk" versus "Read 1MB sequentially from memory" is irrelevant.

prophetjohn · on May 2, 2012

If you're having problems with page faulting, you really just need to use less memory, not necessarily use it more intelligently. If you're programming so poorly that you're blowing through all the available system memory, the problem is likely a little more simple than you not knowing how paging, TLB, etc. work.

Either way, since the disk is so slow, virtual memory and paging are handled by the OS, not the hardware; so there's a lot less that the programmer can specifically control once you get that far down the memory hierarchy, as opposed to (say) avoiding cache misses.

btilly · on May 2, 2012

Not true!

A programmer who is aware of the latency issues around disks can figure out how to move data out of memory into files, and how to change algorithms to arrange for sequential access to data. One who is not aware has no way to figure out how to stop the month-end data processing run taking five days. (That is a real world example.)

prophetjohn · on May 2, 2012

To avoid disk latency, you're moving data out of memory and onto disk? That's essentially a coping mechanism for having stuff taking up space in memory that doesn't need to be in memory. Knowing how and when to do that is useful, but doesn't require a lot of knowledge of the memory hierarchy beyond "the disk is incredibly freaking slower than memory."

edit: and I just realized I'm in the subthread about knowing about latencies being important, not the subthread about knowing the entire contents of the linked paper being important, so yes, I think we're agreeing.

btilly · on May 2, 2012

Actually we are disagreeing.

Your claim was that once you're hitting problems with disk latency, there isn't a lot that a programmer can control. My point was that there actually is a ton that the programmer can control at that point.

In fact a programmer working in high level scripting languages benefits more from knowing about disk issues than everything else for the simple reason that they really don't have any control over everything else. Of course you hope that the people implementing the language understand everything else...

prophetjohn · on May 2, 2012

No, my point was that if you're running into problems with page faulting, there is less a programmer can do than if they were running into problems with cache misses. Specifically, what can be done is to use less memory, which would be an equally effective strategy for cache misses. The difference is the cache misses can also be avoided by modifying memory access patterns. Such a strategy is far less effective at controlling page faults and as such, the primary strategy a programmer has is to program in a manner that occupies less space in memory. Doing so does not require a detailed and fundamental understanding of the memory hierarchy and only really requires an understanding of the fact that "the disk is incredibly freaking slower than memory."

btilly · on May 2, 2012

I don't see the lack of a parallel that you are trying to make.

If I take data that was in RAM, explicitly move it into a file and then make sure that I access it sequentially, I've modified my memory access patterns in a way that controls page faults. I'm using the same amount of actual space in memory and on disk, I've just labeled some of it "file" rather than "RAM".

In fact if anything I'd argue it the other way. If you're using a high level scripting language, what I just said about page faults still works. However the language gives you so little control over what happens in memory that you can't control the cache misses.

prophetjohn · on May 2, 2012

I guess I'm not entirely sure what you're saying. You're avoiding instances where you're accessing data in such a sporadic way that causes it to repeatedly request data from a page not in memory? Then when that page is brought into memory, it requests data from a different page that is not in memory and so on? The principle of locality states that you're going to have to try pretty hard to make your memory accesses that far apart for it to ever happen. If that's what you're saying, then I'll concede that in this unlikely circumstance, memory access patterns can help; but really, it's more that your unusual access patterns were hurting you in the first place than that you're fixing a typical problem by tuning your access patterns.

The only other reason you're going to be having page faults is because you're trying to work with more data than can fit in memory. In this case, you will inevitably page fault. The only way to avoid that is to use less memory (or install more memory). So much less that all your data can fit in main memory at a given time.

btilly · on May 2, 2012

You keep saying "the only way" when there are other ways.

Let me give a concrete example.

You are page faulting because your data is in a large disk-backed hash. By the nature of a hash, you can access any part of it. Convert that hash to a B-tree, and similar requests are more likely to be close in memory. This can be a big win if there is a non-random access pattern (for instance you are more likely to look up recent records than old ones). Performance may magically get much better. You haven't changed how much memory you need, your data set, etc. But page faults are reduced.

More drastically, rather than going through one data set doing random lookups in another, you can sort both data sets and then process them in parallel. This is a much, much bigger change in the program. However it can result in many orders of magnitude better performance.

I am drawing these from a concrete example. I was looking at a month-end job that did summary end of month statistics for a busy website. Substituting a B-tree for a hash made it go from 5 days to something like 3 days. Not bad for under 5 minutes work. (I just changed the parameters that opened up the database file to use a different format.) Later I took the time to completely rewrite it, and that version took about an hour. Through this process there were no hardware improvements.

So you see, understanding disk performance issues is not useless for programmers. You have options for page faults other than asking for better hardware.

prophetjohn · on May 3, 2012

Now it's definitely you who is misunderstanding. The scenario you just explained is a very nice example of the caveat that I allowed for. You've found an uncommon case where you're accessing memory in such a way to defy spatial and temporal locality and are experiencing page faults at a rate greater than those that are due to your data set simply being larger than your memory.

However, provided that your hash is greater than the size of your memory, you are still inevitably going to page fault if you expect to access every element of that hash. In that case, "the only way" to avoid page faults is to either increase the size of your memory or to use less of it.

    understanding disk performance issues is not useless for programmers

You're right; that's a fantastic point. The problem is that no one made that claim.

btilly · on May 3, 2012

Go back to http://news.ycombinator.com/item?id=3920319 to see the claim that I disagreed with. There is a lot that can be controlled at the page fault level other than just "use less memory". And having a lot of data in a hash is hardly an uncommon use case if someone's processing, for instance, a log file.

prophetjohn · on May 3, 2012

I mean, I know what claim you disagreed with. We just had a conversation spanning 9 hours where we hashed out (heh heh) what I actually meant in that post. I don't want to reset back to that post and start explaining from the beginning because I already typed all those words explaining what I meant.

And I dunno, man, you have big log files. I don't think that processing many-gigabyte log files is what most would call a common case, but I coud be totally wrong here.