Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LemonGraph – A log-based transactional graph (github.com/nationalsecurityagency)
101 points by adulau on June 17, 2018 | hide | past | favorite | 24 comments


At this point I really have to wonder how many people have written graph engines on top of lmdb! In the past month I've seen two from the same bank built on top of lmdbjava. Instead of reinventing this same wheel over and over it'd probably make sense for somebody to sitdown with lmdb and tinkerpop [1] and bang out one decent implementation.

...actually this has been done [2] but the project looks abandoned. So NSA guys you should get right on this.

[1] http://tinkerpop.apache.org/docs/current/reference/

[2] https://github.com/pietermartin/thundergraph


At least, this one is in Python. My dream would be to have a transactional graph database written in Python which is used as back-end for NetworkX (https://networkx.github.io/).


This looks interesting. I have to say I find it amusing that the NSA has a GitHub.


And that they use Adventure Time puns to name things. I guess the NSA techs are just a bunch of nerds with government contracts


A bunch of nerds with government contracts that smoke weed on the way to their interview

https://motherboard.vice.com/en_us/article/d737mx/the-fbi-ca...

EDIT: oops that was 2014, apparently the FBI isn't having that issue anymore

https://motherboard.vice.com/en_us/article/aepj4p/fbi-mariju...


They have more code up than just what is found on their github page:

https://code.nsa.gov/

Huge amount of stuff devoted to managing ETL.


They also made geomesa. A geospatial graph database that can run on top of bigtable/accumulo, hbase, cassandra and kafka.


Yeah, my first thought was that it automatically does free streaming replication + backup to NSA servers.


The committers' names are all fake though :)


This is what my tax dollars are going to? Another shitty software library. ...I mean, wow this thing is awesome. Good job NSA!


What's a use-case for this? Is it like Neo4j or some other niche usecase (e.g mass surveillance graph, given the source)


Yes, it's like Neo4j but based on the in-memory database LMDB, presumably to provide high performance running on a single machine.

The link mentions that its use case is streaming seed set expansion, which allows you to identify communities based on a set of seeds. I wrote more about that in this comment: https://news.ycombinator.com/item?id=17335873


Quibble: LMDB is a memory-mapped file database. It is not an in-memory database, although it generally outperforms all other in-memory databases.


Thanks, this is the first I had heard of LMDB.

I actually love the idea of a memory-mapped database, I've often thought memory mapping isn't taken advantage of enough.


It’s a graph database built on LMDB, so probably has the same use cases as you’d want to use LMDB in, but with some graph-y helpers.


Why is adding properties to a node significantly faster (153k/s) than adding edges to a node (25k/s)?


Indexes? Seems to be there are indexes fromNode, toNode and for edges.


for laymen like me: what is this, and what are the perfect use-cases for this?

"..log-based transactional graph (nodes/edges/properties).. ..primary use case is to support streaming seed set expansion." -- I'm totally lost.

I know these kind of software is targeted at developers, but it won't hurt to give analogy like "Uber for XXX" like in startup pitches. e.g. "It's like <put popular product name here e.g. MySQL> but <differentiating factors>".


Graph databases are a really neat thing that liberate you from the need to figure out your database schema at the outset and also allow much faster searching than traditional table-based queries across huge datasets. They're ideal for sparse data or for collections of data whose structure/relationships you're not sure about, and also allow very fast searches because the number of steps between different nodes typically grows more slowly than the number of records between different table entries.

There are a bunch of them on the market, Neo4j is probably the most popular (and has lots of good quality introductory text on the website and on youtube). Graph databases are key to many major internet services, eg Google, Twitter, and Facebook are all just really big graph databases.

This particular graph database stores all its data in a single file trading speed and simplicity off against flexibility. I'm not an expert but it seems like it would work very well for search queries, but poorly for tasks involving a lot of contributors like a chat server.


Based on the description on the site, the NSA uses this as a way to identify communities based on their communication patterns. This is a kind of social network analysis: https://en.wikipedia.org/wiki/Social_network_analysis

The particular use case mentioned for LemonGraph is "streaming seed set expansion", which, given a set of "seeds" can expand that set based on their communication patterns to find people that are likely in the same community or overlapping communities.

E.g., if you have a few known terrorists and a database of metadata about phone calls or internet communications (emails etc.), you can analyze who your known subjects (seeds) talk to and who those people talk to, etc., to identify communities. This relies on the fact that there tends to be cross-communication between people in the same community.

This kind of analysis can often reveal the structure of communities, like where the headquarters, who the boss is, etc.

In industry and law enforcement, the same kinds of approaches can be used to identify fraud of various kinds.

"What it's like" is a graph database, in this case based on a fast in-memory database to support high speed graph analysis on a single machine.


In this case, a "graph" is an organized way to store and represent data, especially in cases where you want to store relationships (edges) between different things (nodes).

Streaming is just referring to its ability to be constantly updated by incoming data.

So it's like a database but for a much more narrow use case and can perform much better in those cases.


The NSA released a graph database. I wonder what they use it for. /s


Unix commands: awk - AWKWARD. When you write in awk you become awkward. sed - When you write in sed you become extremely sad.


Wrong thread?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: