Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sometimes the job of an interview question is to make it clear to the interviewee that this is not, after all, somewhere they want to work.

That would be my reaction to those interview questions.

As for the follow-up question, my response would be to ask why you are optimizing your B tree for a specific architecture instead of using a cache-oblivious B tree. (Yes, I can think of several reasons why not to. But if the interviewer is asking me BS questions, I want to know whether their knowledge is similarly good.)

Of course in practice I'm generally happy to use BerkeleyDB - the B tree has never been a performance bottleneck for me except when it involved disk latency. And at that point a more fundamental reorganization of algorithms becomes necessary.



Cache wasn't mentioned in the question... there might be no cache at all. ;-)

Cache oblivious B-trees aren't exactly oblivious to how DRAM behaves, and sometimes their operation actually can be counter to what works best in DRAM (though I agree that generally they tend to do better than data structures that are "oblivious" in a very different sense).

But let's ignore that for a moment. Okay, you aren't using BerkeleyDB, MySQL, or any other of the common B-trees out there, but instead you've got your hands on a cache oblivious b-tree. What is the approximate optimal fanout rate and what is the approximate number of reads per second you can achieve with the 7-7-7-23 DDR3 memory (or memory of your choice that is commonly found in modern systems)?

The question isn't really about knowing the specifics of a particular memory infrastructure. It's about understanding the cost of going to memory, which is increasingly as important as understanding disk latency (you know how "disk is the new tape"? well, RAM is the new "disk").

Interestingly, a lot of databases these days are CPU bound , particularly when working with big data, but even in other cases. They aren't really CPU bound, it just looks like that because of how accounting is done. When you poke under the covers, you see the CPU pipelines are comparatively idle because the CPU's are all tied up accessing memory.

Particularly if you are using AWS or other virtualization services, most web applications these days rely on servicing a majority of their requests entirely from memory. Being able to understand the performance characteristics of memory at least within an order of magnitude becomes increasingly important, because the average rate that you can service from memory becomes a critical metric for the scaling of your services.

It's more than even that though. As concurrency becomes increasingly important, locks and atomics become central to a lot of work. The last few iterations of Moore's law have changed the principles of how locks & CAS impact performance. It used to be that locks were bad, because you often burned up CPU going down to the kernel, but more than that because suddenly you are wasting CPU while you wait for something else to complete. CAS allowed you to largely sidestep all that CPU wasting and you're a happy man. Then CAS got integrated in to locks so that locks only really sucked when you had contention, but now... The real expensive cost of a lock these days is the memory fence put around it. CAS is better because the fence is scoped to only the object being swapped, but even that can be horribly expensive if there is contention on the object. Having a clear understanding of just how big that cost is relative to burning some more CPU, wasting some RAM, or even changing the constraints for your code, can be very important.

Trust me. This stuff really matters for any context where performance matters, and it is only going to get more important in the coming years. Memory has pretty much already trumped CPU in terms of its importance for most applications, and the number of cases for which it is more important than disk is growing at a rapid rate.


I believe you.

However I've spent most of my life working within scripting languages. And the same is probably true of most programmers in the startup world.

If you're able to get away with using a scripting language, then your problems had better not be performance critical, because your swiss army knife of choice really sucks as a saw...


Yeah, web apps mostly get to ignore stuff like this until they're really big. But then sometimes you're a startup like Akiban Technologies and this is what you do.


First, the notion that using a scripting language means that performance isn't an issue is a canard. It ought to mean your revenue isn't directly correlated to performance, but that's about it. Indeed, if you are using a scripting language, the odds are more important that you understand the trade offs, as your lack of low level options means you need to make smart choices about high level options.

Second, very few startups don't end up making decisions about performance even in the early days. What? You're using a NoSQL store? Is that because you just hate yourself or because you thought it'd be more efficient/scale better than an RDBMS? Why aren't you just storing it all in memory? Wait, you can run out of that? just let it swap then! Actually, why not just store all the data in a flat file? You don't even need to sort it, you can just scan through it whenever you need to, because performance doesn't matter. Why are you using an associative array there? Do you care about having efficient lookups based on a key or something?

You see my point.

I get that some people are playing around in a segment of startup land where programming isn't really that important, but presumably your long term plan isn't for every startup you participate in to never go anywhere. What if you are a success? Do you still want to be on the technical side when you've now got a team of developers and a serious customer base? If not, that's fine and likely a great career plan, but you aren't a programmer in the professional sense (which is what the paper is referring to). If you are planning on staying on that side, either learn this stuff or expect to be cast aside and not grow with the organization, because you aren't ready for the bigger world you'd now be playing in.

To be clear here, I'm not suggesting you'd have to pick up a systems programming language. Heck, when I started my career two decades ago, when CPU's were a lot slower and interpreters weren't as efficient as they are now, there was already this notion that "oh, I'm working in an interpreted language which provides a layer of abstraction so I don't need to know how computers work". I made a good start of my career (and met a lot of people who had been doing so for quite some time) essentially kicking those guys out of their jobs because it turned out doing scripting with an understanding of the computing consequences of your choices is applicable in a very broad set of situations, but scripting without that understanding is barely more useful than knowing how to use Excel.

High level languages actually can kick quite a bit of butt in performance critical situations, because they let you focus on the bigger picture of how you are executing, where the bigger performance wins can be had (I've seen "scripting languages" winning performance bake-offs for exactly this reason), but you have to know what the heck you are doing. People often make the mistake of thinking this won't matter. They tend to have very short careers.


Within the limitations of a scripting language, of course you try to get the best performance that you can.

However in general within a scripting language you are not trying to optimize things at the CPU cache level. You're worried about overall efficiency of your algorithms, avoiding disk, identifying performance bottlenecks, etc. But the pervasive use of hashes everywhere and memory hungry data structures makes careful use of CPU caches pretty much a lost cause.

Furthermore the language does so much behind your back that an analysis of performance from first principles is very unlikely to be right. Instead you need to benchmark.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: