LemonGraph – A log-based transactional graph

dnomad · on June 17, 2018

At this point I really have to wonder how many people have written graph engines on top of lmdb! In the past month I've seen two from the same bank built on top of lmdbjava. Instead of reinventing this same wheel over and over it'd probably make sense for somebody to sitdown with lmdb and tinkerpop [1] and bang out one decent implementation.

...actually this has been done [2] but the project looks abandoned. So NSA guys you should get right on this.

[1] http://tinkerpop.apache.org/docs/current/reference/

[2] https://github.com/pietermartin/thundergraph

adulau · on June 17, 2018

At least, this one is in Python. My dream would be to have a transactional graph database written in Python which is used as back-end for NetworkX (https://networkx.github.io/).

funfunfunction · on June 17, 2018

This looks interesting. I have to say I find it amusing that the NSA has a GitHub.

roryisok · on June 17, 2018

And that they use Adventure Time puns to name things. I guess the NSA techs are just a bunch of nerds with government contracts

cbcoutinho · on June 17, 2018

A bunch of nerds with government contracts that smoke weed on the way to their interview

https://motherboard.vice.com/en_us/article/d737mx/the-fbi-ca...

EDIT: oops that was 2014, apparently the FBI isn't having that issue anymore

https://motherboard.vice.com/en_us/article/aepj4p/fbi-mariju...

X6S1x6Okd1st · on June 18, 2018

They have more code up than just what is found on their github page:

https://code.nsa.gov/

Huge amount of stuff devoted to managing ETL.

freeduck · on June 17, 2018

They also made geomesa. A geospatial graph database that can run on top of bigtable/accumulo, hbase, cassandra and kafka.

sudhirj · on June 17, 2018

Yeah, my first thought was that it automatically does free streaming replication + backup to NSA servers.

plq · on June 17, 2018

The committers' names are all fake though :)

therealtomsmith · on June 17, 2018

This is what my tax dollars are going to? Another shitty software library. ...I mean, wow this thing is awesome. Good job NSA!

codebeaker · on June 17, 2018

What's a use-case for this? Is it like Neo4j or some other niche usecase (e.g mass surveillance graph, given the source)

antonvs · on June 18, 2018

Yes, it's like Neo4j but based on the in-memory database LMDB, presumably to provide high performance running on a single machine.

The link mentions that its use case is streaming seed set expansion, which allows you to identify communities based on a set of seeds. I wrote more about that in this comment: https://news.ycombinator.com/item?id=17335873

hyc_symas · on June 18, 2018

Quibble: LMDB is a memory-mapped file database. It is not an in-memory database, although it generally outperforms all other in-memory databases.

antonvs · on June 22, 2018

Thanks, this is the first I had heard of LMDB.

I actually love the idea of a memory-mapped database, I've often thought memory mapping isn't taken advantage of enough.

jsumrall · on June 17, 2018

It’s a graph database built on LMDB, so probably has the same use cases as you’d want to use LMDB in, but with some graph-y helpers.

amelius · on June 17, 2018

Why is adding properties to a node significantly faster (153k/s) than adding edges to a node (25k/s)?

jpalomaki · on June 17, 2018

Indexes? Seems to be there are indexes fromNode, toNode and for edges.

wiradikusuma · on June 17, 2018

for laymen like me: what is this, and what are the perfect use-cases for this?

"..log-based transactional graph (nodes/edges/properties).. ..primary use case is to support streaming seed set expansion." -- I'm totally lost.

I know these kind of software is targeted at developers, but it won't hurt to give analogy like "Uber for XXX" like in startup pitches. e.g. "It's like <put popular product name here e.g. MySQL> but <differentiating factors>".

anigbrowl · on June 17, 2018

Graph databases are a really neat thing that liberate you from the need to figure out your database schema at the outset and also allow much faster searching than traditional table-based queries across huge datasets. They're ideal for sparse data or for collections of data whose structure/relationships you're not sure about, and also allow very fast searches because the number of steps between different nodes typically grows more slowly than the number of records between different table entries.

There are a bunch of them on the market, Neo4j is probably the most popular (and has lots of good quality introductory text on the website and on youtube). Graph databases are key to many major internet services, eg Google, Twitter, and Facebook are all just really big graph databases.

This particular graph database stores all its data in a single file trading speed and simplicity off against flexibility. I'm not an expert but it seems like it would work very well for search queries, but poorly for tasks involving a lot of contributors like a chat server.

antonvs · on June 18, 2018

Based on the description on the site, the NSA uses this as a way to identify communities based on their communication patterns. This is a kind of social network analysis: https://en.wikipedia.org/wiki/Social_network_analysis

The particular use case mentioned for LemonGraph is "streaming seed set expansion", which, given a set of "seeds" can expand that set based on their communication patterns to find people that are likely in the same community or overlapping communities.

E.g., if you have a few known terrorists and a database of metadata about phone calls or internet communications (emails etc.), you can analyze who your known subjects (seeds) talk to and who those people talk to, etc., to identify communities. This relies on the fact that there tends to be cross-communication between people in the same community.

This kind of analysis can often reveal the structure of communities, like where the headquarters, who the boss is, etc.

In industry and law enforcement, the same kinds of approaches can be used to identify fraud of various kinds.

"What it's like" is a graph database, in this case based on a fast in-memory database to support high speed graph analysis on a single machine.

jjeaff · on June 17, 2018

In this case, a "graph" is an organized way to store and represent data, especially in cases where you want to store relationships (edges) between different things (nodes).

Streaming is just referring to its ability to be constantly updated by incoming data.

So it's like a database but for a much more narrow use case and can perform much better in those cases.

znpy · on June 17, 2018

The NSA released a graph database. I wonder what they use it for. /s

kapustinsky · on June 17, 2018

Unix commands: awk - AWKWARD. When you write in awk you become awkward. sed - When you write in sed you become extremely sad.

floatboth · on June 17, 2018

Wrong thread?