Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
IAM is hard – Thoughts on $80M fine from the Capital One Breach (twitter.com/kmcquade3)
241 points by bharatsb on Aug 8, 2020 | hide | past | favorite | 122 comments


My general experience with crafting IAM policies is very reminiscent of SELinux, in that it's very difficult to work agnostically while adhering to a principle of least privilege. Especially given that this kind of task is often done by admin/ops people, one typically can't know in advance everything that the app might need to be able to access in order to work correctly. The process of discovering this is: try running it -- it fails, you note the permissions error -- look at docs, talk to devs, try to give it the most granular permissions it needs to get around that error -- rinse & repeat, probably many many times. This is onerous and you wind up going all around the mulberry bush trying to understand and satisfy every dependency.

At least, if you're very good and don't mind being perceived as a roadblock, you try to understand things. If you're more typical, you just find the most direct route from logged error to added permission (audit2allow approach). And if you're bad, which is also not uncommon, you just give it the broadest permissions possible and call it a day.

With respect to IAM in particular, I'm finding in the Lambda world that some seemingly straightforward functions wind up needing some sort of access to all kind of other AWS services; these services each have their own funky permissions structures and attendant quirks. Each one is a temptation to the IAM admin to just throw their hands in the air and put a wildcard on it.


I've seen this a ton. I have been an on/off security professional so academically I am committed to the principle of least privilege, but holy hell it can be painful or impossible in real life.

Where possible I've started adopting the "run it and see" or audit2allow approach (there are awesome tools that can do this for AWS IAM perms too), but then before applying the policy, somebody needs to put a quick line beside each permission that explains why. If the answer is "I don't know" and the permission is simple/low-risk then maybe let it go. If it's a high-risk permission then don't do it without answers.

That formula is the only pragmatic one I've seen. Not perfect but sometimes perfect is the enemy of good.


> Not perfect but sometimes perfect is the enemy of good.

This. Add enough hurdles, and people will

a) spend all the energy they're willing to spend on your process, hate it, and as a result never do anything except what you force them to, and in particular, never voluntarily reduce permissions unless forced (because they might need them later and then it'll be a pain). They'll also see you as an enemy, not a partner, which is not the place a security team should be in.

b) optimize for not what is best, but what is least affected by your wall of process, even if it's less secure (e.g. because it's a legacy system that you didn't get around to locking down yet).

c) outsource to less secure vendors, and get it approved because management knows that getting it deployed internally would take forever due to the process

d) in the most extreme case, set up uncontrolled shadow IT with zero security controls and hide it from you - and because they already spent all their energy dealing with "perfect", they don't want to hear the word "security" ever again, and the security posture of their shadow IT shows this.


Thank you for acknowledging this. I worked at a place like that and it put me off security to the point where I didn't want to ever deny any person or app or service permission to anything and was just happy with securing the edge nodes of our network. I got over it and now I'm in a position where my neck is on the line if there is a security breach I feel differently but I can still relate to myself back then and understand the need to be pragmatic and massage people into following the process rather than forcing them.


Years ago I wrote a program to let various services run their course, query Cloudtrail for successful calls madero different AWS services, and attempt to find a minimal set of IAM permissions (not applicable for S3 at the time). The idea was to run an exhaustive test suite with expected allowed actions only and deny anything else. I believe AWS has a similar tool now for IAM but it’s not a problem that’ll be resolved satisfactorily for everyone given the combinatorics of IAM possible. Lateral movement in IAM roles and credentials is tough and even today not every action that IMO should be flagged is reported (IAM role assumption failures across accounts is silent when I checked early last year).


Did you open source it? If not, you definitely should.


Someone has actually been working on a project like this. While not 100% complete it's the best working one I know of. Can definitely relate to the problem being described here, especially when writing IAM policies for terraform deployments.

https://github.com/flosell/trailscraper


Probably can’t be open sourced given IP under contracts but I could try to re-write it. There’s some new services in IAM that could be leveraged to make it more accurate and cheaper to use, too.


Thank you for putting this idea in my head! I’ve been trying to get better at expressing infrastructure as code, and one of the big blockers has been how adding new services to e.g. Terraform is tough when you don’t know all their permissions they need (see also https://github.com/hashicorp/terraform/issues/2834 for example).

Using a test AWS environment to stage and then checking CloudTrail to see what was actually called would be a step forward. Having software to extract it would be even better.


The "I don't know" case is the most painful to ask for since it feels like I am always second guessed.

If I have the permission, I can tell its value in a few minutes but instead I have spend hours doing due diligence to try to justify my answer and still am not confident asking.

I usually end up getting what I need in the end so the admins don't see it as a big issue, but it is a subtle thing that can kills productivity for anything related to AWS for everyone else.


In our company we went the other road. We have the developers write the policies (since I mean, they know what their app needs) and test it in dev environments. After that, ops guys step in during code review to check for too broad allow in the policies. So far it seems to work in acceptable manner.


This sounds like a good process, but it depends a lot on the relationship between dev and ops... I've seen too many dev shops push against changes requested by ops or security because their main pressure is to ship features fast. And then it turns into a management fight and whoever has the more influential management gets the final say while the other side is forced to grumble.


We have machines to tell people their policy (in dev) is insecure and will be deleted in 30 minutes.

Not really easy to argue with the machine.


Nice that just like having bots in discord. When administrators hide behind bots with decisions so users cannot retaliate. It also is nice combined with shadow bans, user thinks his posts are going through but no one replies.

Unfortunately for development it is better to get tight integration of dev and ops so you could solve it by discussion and cooperation. Not sure if you can build such teams that often but that would be great.


Well, developers can fight all they want to. I was the dev lead at a company that had to be HIPAA compliant. My neck was on the line if we were found to be out of compliance along with the security director and the operations people.

Now as a consultant working with many large customers, “shipping fast” is nowhere near as important as security.


I'm in the same position and we haven't built our infrastructure yet.

Did you use a managed service?

Or did you build it slowly and carefully on AWS?

I'm currently stuck between the two options. Managed service seems the easiest way to be HIPAA compliant but I'd rather we managed our own infrastructure on AWS since it gives us more flexibility for stuff like blue green deploys and it would be cheaper.


Back then, I was building a green field project on prem. It was more about limiting access and auditing. In the middle of the implementation a mandate came from on high to “move to the cloud”.

I didn’t know anything about AWS back then, they hired an MSP who was just a bunch of old school netops people who knew how to click around on the console and gave us a bunch of VMs.

Long story short, I studied for one AWS certification so I could talk the talk. I learned both all of the things that I could have taken advantage of and saw how much they were making and that changed my whole m.o. and decided to get some experience with AWS and go into consulting.

Next company I went to, the founders outsourced everything technically to an outsourcing company - software and infrastructure and they treated AWS as an overpriced colo. Everything was in one account and everyone had access to it. At first, they were just aggregating publicly available information about doctors for hospitals so it wasn’t a big deal.

They brought a new CTO in and started bringing development in house. I led the charge to first separate out the environment to different accounts, establish a sane CI/CD process and then lock down who had access to prod.

Of course they had secret access keys in config files everywhere. We had to audit the code to make sure that no code was using keys. Locally, every SDK can automatically retrieve the keys from your global config file (that’s nowhere near your git repo) and on AWS it gets permissions based on the attached roles.

Then of course we had to lock down roles. But we couldn’t have the granular permissions we needed because even though they had lots of microservices (we sold access to our APIs to businesses). They were all on two “pet” EC2 instances.

Next step was to move the .Net Core APIs to Docker/Fargate and further restrict the attached roles to those.

Finally, we had to audit all of our AWS dependencies and add encryption where necessary and then sign a BAA with AWS and bring in auditors.

By the time I left a month ago, we could pass the needed certifications and expand our offerings.

It took a lot of upskilling, hiring an internal ops person (I’m a developer who knows AWS) instead of depending heavily on the MSP.

I left for greener pastures - I’m a consultant with AWS.


Currently experiencing this at my current employer. I have a suspicion that slowing the dev and deployment process is in everyone's perceived best interest, until of course the day where we are out competed.


And rightly or wrongly you will be outcompeted by places with more risk appetite, until they have a security breach before they are big enough to swallow the cost.


In my company, developers write policies, but they have to be approved by someone with ops expertise (usually me). And the policies are often either too broad, or not sufficiently tested, and missing permissions for things it needed. Sometimes the same policy has both problems. I don't blame the developers. You can hardly expect every developer to become an expert on AWS's IAM system. Especially given how inconsistent it can be.


>or not sufficiently tested

How do you test a policy?


You set up a policy in a test environment and run the code there. Of course, generally, the policy can't be identical across environments, so you can run into errors.


That describes my own personal hell with iam, and it drove me to adopt aws-cdk.

aws-cdk is their infra-as-code product and it's extremely valuable for iam alone.

You can say things like: my_lambda.grantRead(s3_bucket)

And it figures out the least privileges necessary to make all of that work.

Plus it's real code, not some annoying DSL, which means you can easily abstract other iam permissions out. I have a fairly tight lambda policy that I reuse in all sorts of places, and it's as easy to use as the above snippet.


Cloudformation. Ewww.


Sure, CF is garbage, I won't try to argue otherwise.

There's this though: https://www.hashicorp.com/blog/cdk-for-terraform-enabling-py...


Wonder what Pulumi will do to stay relevant now.


Pulumi probably has a better story around multicloud, but yes, it does seem like it's going to be difficult to differentiate.


As one of those "admin/ops" people with experience of fixing terribly set up systems, a common issue is that users just can't tell the difference between systems that work insecurely and systems that work securely, but they will immediately notice if a system does not work because the security policy is too strict.

You get the best security when everyone is involved in the security design from the "ground up", but quite often there's not enough communication between the people developing some application, whose work may be valued by the number of features they ship and "velocity", and the operations side of things whose work is to provide the infrastructure for running the software, and to keep it running. At worst you just get some 3rd-party consultant to set up a thing somehow and then afterwards have to reverse-engineer it to figure out what the hell they did and how to prevent it from going up in flames immediately.


I always find the lack of communication goes the other direction. Security and compliance teams just enforce an IAM policy without talking to application or product teams. It gets rolled out and lots of things break, even things that have legitimate needs to work the way they do and have considered security best practices heavily already, and then security and compliance just throws their hands up and says too bad, refactor it from first principles regardless of the level of effort, staffing requirements, competing priorities, etc.


I have never seen a compliance team make the choice to just break production unless they (and by proxy you) are in hot water with auditors.


That’s weird, because I see it constantly, even for minor systems where the relationship to a compliance requirement is minor / optional. I’ve actually never seen it happen when there is a real security or auditor issue at stake - I’ve only (repeatedly) seen compliance & security teams demand enforcement of a policy that breaks production in circumstances where the whole thing could have been easily prevented if they had gone to product teams and had a conversation first, but they didn’t.

The most recent one I lived through a few months ago was when compliance just all of a sudden decided to wholsesale enforce a bunch or org-wide settings changes to every GitHub repo in the company, and it caused several outages and a huge amount of unplanned triage work as the settings were very sensitive for a bunch of continuous integration systems and jobs.

This was at a Fortune 500 company with a big, well-staffed compliance team. They had to roll back their changes and delay the new settings by several months because only through breaking production did they realize their proposes settings workflow was not feasible given in-house system requirements.

And of course, no apologies at all.

This is pretty run of the mill. I’ve seen the same thing from compliance and security teams in a few other large, “household name” tech companies, and also in a few mid-range startups.

Compliance teams number one MO is to blame product teams for not partnering with them, but it’s the compliance teams who refuse to do the partnering.


> I’ve only (repeatedly) seen compliance & security teams demand enforcement of a policy that breaks production in circumstances where the whole thing could have been easily prevented if they had gone to product teams and had a conversation first, but they didn’t.

This is exactly backwards. Product devs need to reach out to security early in the design phase. There’s no way for a separate security org to understand the app or use case after the fact.

If you want to do $newthing your product dev management needs to involve security, finance, compliance, legal, etc. That’s their job. Developers don’t get to ignore all the normal business constraints the real world offers.

Building within constraints is what engineering is all about.


You are so right on the SELinux comparison. Of course, in this case, there are way more developers that are required to write them.

Reiterating what was mentioned in the thread - the best way to avoid this wildcard situation and make it easier for developers is to use Policy Sentry[0]

Thought I’d mention this for those who read the title and the comments instead of clicking on the tools. This will solve most of your problems with writing IAM policies for machine roles.

[0] https://github.com/salesforce/policy_sentry


Is there an SELinux equivalent of Policy Sentry?


I wish.


If there happens to be an OPA<->IAM adapter for your AWS resources, OPA allows end to end testing of your policies.

As far as adapters go, I know you can get SQL, Kubernetes, Terraform, Kafka, Envoy, s3 (via Minio), EC2/ECS/Lambda (linux). that would cover most use cases I think.

https://www.openpolicyagent.org/


For AWS integration with my commercial tool, I am considering having it inspect it's own permissions and loudly tell you it's misconfigured if you give it permissions to do anything more than what it minimally needs. I wish more tools did this.


I feel like this is the real base problem here.

There's an incredibly broad set of permissions (at the cloud or OS level). Any app / tool may be written to use any subset of those. And what it uses is rarely documented (because developers don't see IAM security as a primary feature, outside of apps intended for use in regulated environments).

Without automation, this thus requires continual reverse engineering, which is never a healthy, sane long-term solution.

This should be fixed on the product / app side, where folks are much better placed to dump "I need this, and only this" in machine-readable form.


Some steps I take to mitigate:

1) put ec2 servers I can't properly IAM lock down in dedicated accounts separated from all other things

2) don't let users create the things.. let users take actions that result in what they want by a programmed set of commands that is peer reviewed (e.g. spin up an EC2 instance with specified config/user data by an MFA'ed credential.)

3) user accounts are created manually (we have <10 employees with AWS access), but their user accounts can only do two things: a) massage their own account basics (rotate password and keys), and b) assume role. User account has MFA at setup and can't be removed (only changed). The roles they some are change controlled and checked regularly.

A LOT, like every damn security talk I've ever been to, says to watch cloud trail and enforce bad action by writing reactive scripts. We enforce before the action is taken, essentially. This works for better for EC2, which, in my opinion, is HORRENDOUS for least privilege.

Serverless items, while sometimes requiring more permissions than I'd like or expect, is far _less_ common to throw a requirement for "Create" on resource '*'..

Lastly, I give developers a playground to learn in that is entirely disconnected from user data. It has some mocked false user data and structures, and it's own isolated domain so the infrastructure can be explored


> And if you're bad, which is also not uncommon, you just give it the broadest permissions possible and call it a day.

Whether this is "bad" depends on what you're targeting... it's bad for security, but good for getting things done. And from an economic standpoint, right now, unfortunately the "good" approach is often the "bad" one.

The $80M file may be less than the cost of doing it right. And until that changes, "good" managers will incentivize the bad approach.


Yeah, the SELinux approach reminds me of IAM - both hard. They need to build in a run code path with full permission propose minimal policy based on accesses seen.

The reality - everyone finds it MUCH quicker to give broad admin rights out otherwise.

One good thing - accounts - you can create an account, give admin to the consultant / outsourced IT group, still bill to org, they can do what they need without endless hassle of IAM. Anyone else using this - it's a pretty rough hammer but seems to work ok so people can get solutions spun up with some efficiency.


PureSec made a fairly useful tool:

https://github.com/puresec/serverless-puresec-cli/

There is little reason that the friction of generating least privilege IAM policies can't be reduced. The same goes for deployment Roles.


Those problems seem to come from the separation of the IAM admin from the developer. I'm coding a server now. My IAM roles are defined in a template, and I just add new permissions to the template as I need them. My code has the bare minimum permissions that it needs, and it doesn't seem at all onerous for the benefit it provides. So I think the problem is less "IAM is hard", and more "coordination is hard".

The one big exception I've run into is that to launch a CloudFormation template, the role practically needs admin access. I'm considering offloading the launch to a minimal Lambda function with the requisite (very broad) permissions. Does anyone have a better approach?


My team approaches this problem by using separate identities with different access policies for doing different things. The identity can be an alternative user account (disconnected from the primary corp domain) or a service principal that does only limited set of things.

For example, an alternative user account with restricted time window for access is used to even get to performing infrastructure management tasks. Then each cluster has unique service principal attached to it for pulling containers and retrieving infra-related secrets, but cannot access application-related secrets. Applications that run on clusters use completely different managed identity which doesn't have access to infra-related secrets, but has access to application-related secrets. On top of that, wherever possible, we restrict access to secrets only to GET operations, so you need to know the name of the secret beforehand in order to access it. The latter is not always possible, but if it is possible, we use it.

We use a bunch of scripts that go on and create all necessary identities and set up security policies, which helps a lot with ensuring the process is fully repeatable and risks of user mistakes are mitigated.


And to add to this - it wasn't the case from the very beginning. It took us shipping a set of services with ISO and SOC 2 requirements to arrive to security model we're employing now. It also helps a lot having corp-wide robust security and compliance teams that drive security mindset across the company. They create a lot of pain for dev teams, but this ultimately results in much better security stance across the board.


I agree that "coordination is hard" is the root issue here. For things I develop, it's easier for me to specifically say what permissions the IAM roles need, and I can get them down to least privilege.

Then sometimes I need to onboard an application written in a language I don't know, and it's a beast of an application. I ask the developer what permissions it needs, and they say "I don't know" but give me a full admin role they know works. I'd love to go through the whole process of determining what permissions it needs, except that I have deadlines, and they have deadlines, and our deadlines are visible to management, and I have additional projects that I need to finish, and my team is already too small. It's like the universe is just telling me to give the application full admin.


Time will tell if this works out for me, but lately I’ve been explaining the risk to the developers’ manager, that I wouldn’t do this, but if they assume the risk that I’ll do what they ask. I figure eventually we’ll get compromised and/or some misbehaving app will take down production and then people will pay attention.


At my company I am working to get all of our IAM policies ironed out in AWS-CDK. Through this any developer can pull down the git repo containing the IAM roles. They can makes changes and submit MRs but the only approvers of those MRs are part of our Access Management team. This all of the IAM roles can be crafted by the devs but must be explained and understood by the AM team through a code review process. After that automation takes over and then the policy can be referenced by other CDK/CF templates.


This works if all developers understand IAM and don’t just throw a wildcard in the first time they don’t understand something.


Agreed, there is a large learning curve and it took me a long time to wrap my head around it.

It requires a lot of knowledge and discipline, which sooner or later will create holes. For example, if you need to pass a role to a service, you'll need PassRole; if you grant it on " * ", then oops, you might have just created the opportunity for privilege escalation [1].

There's also probably issues specific to your company: allowing access to read resource foo in general is not an issue, except your specific company stores sensitive data there. If every developer is expected to be a security expert, the security risks increase, and the productivity overhead may even be worse than having a dedicated security/IAM team that gatekeeps permissions.

[1] https://rhinosecuritylabs.com/aws/aws-privilege-escalation-m...


I agree with this but I think AWS could have designed the interface for IAM policies better. There are a lot of actions where the resource has to be “”. There are also many situations where the Principal is because you’re using Conditions to restrict the access (eg by the org id).

The resources are also “typed” despite the UI being json. This leads to confusion when a policy doesn’t work because the string in the resource is of the wrong type (eg S3 bucket vs S3 object). IAM happily lets you create the policy and there might be a small warning in the console that some of your policies somewhere have invalid resources for their actions but if you’re using CloudFormation you’ll never see those warnings. It begs for an automated linter that understands the type system and can fail your merge request or highlight the code in your IDE if the policy is invalid. AFAIK CFN-lint doesn’t do this but it certainly should.


We've managed to mitigate this by doing cloudformation launches manually with the developer's own credentials, and have each resource take on the IAM role it needs. (Since you need sts:AssumeRole for CFN creation anyway, it's not really useful to limit the cloudformation role, it's going to be essentially admin in any event.)


FYI for anyone in the same situation, Netflix built some open source packages to solve this:

https://netflixtechblog.com/introducing-aardvark-and-repokid...

The idea is that the default policy on new things is deny all, and then it monitors cloudtrail for privilege failures and reconfigures IAM to allow the smallest possible privilege to get rid of that deny message.


Offering this service can probably be spun into its own SAAS company.


So it gives a services any privilege it asks for? I haven't read the article, but from your description it doesn't sound much better than default allow-all.


It sounds a lot better. Set up your script and run it and the tool determines the minimum set of permissions needed for future runs. You lock that permission set in for future runs. Read the link


> You lock that permission set in for future runs

Thanks, that was the important bit missing from OP's description.


I am not a bank. My risks are much lower. My CORS policies are strict and I block merges that are too permissive. I immediately disable and remove keys that people share in slack or emails or commits. I use IRSA everywhere I can (and net new services since I joined the current org aren't allowed to use user key pairs ever). We operate on the principal of least privilege and everything us RBAC. CapOne made a mistake and it was known. IAM is hard but when picking places to cut corners, security can't be one of them. Hopefully this fine sends that message.


Did they fine the management responsible for the decisions around these systems? Incentives matter. If you’re not exposed to the consequences, you’ll optimize for your comp and parachute out somewhere else when the shit hits the fan.


* IAM policies for deployed applications should be kept with those applications

* Use a feature like GitHub's CODEOWNERS to make sure that the service's IAM policies "belong" to your infosec team. Any PRs that attempt to change the IAM policy are then reviewed and approved by infosec.

* Set up monitoring that alerts when IAM policy is deployed that is too broad (i.e. wildcards).

* Eventually, recognize that many of your applications are pretty similar, and move to a smarter deployment model which maps a type of service to the correct IAM policy to be deployed with it. Then, what's stored in the service's code repository is which kind of application it is, and the IAM policies are factored out into a common repository owned by infosec.

Look, IAM isn't hard. Allow, deny, verbs, resources, it's all pretty simple. Not so different from firewall rules we've had for decades now. What's difficult is managing, not IAM rules specifically, but anything at scale. Managing security at scale is hard, because managing anything at scale is hard. What's more difficult is taking legacy setups which already exist at scale, are poorly secured because they weren't set up with the correct tooling to manage them at scale, and migrating them to a standards-based approach that makes it possible to manage them at scale.

What makes it difficult isn't the technology but the organizational politics that comes with it. If you build it too early, you're over-engineering and focusing on the wrong thing when there's more "important" stuff to focus on. If you build it too late, then you need to migrate stuff onto it, stuff which "just worked", whose stakeholders ascertain too much risk to the transition, in environments which generally undervalue security work. What makes it difficult is politics.


In my experience, the hardest thing about this whole space is the number of developers who don't understand what the problem is until it is costing them $80 million dollars.

Limiting blast radius, for example. Why does your threat model include the possibility of one of your applications being compromised and using its credentials to do desired things to other applications? That implies your programs are buggy. And clearly, your programs aren't buggy; you're using best practices. How could they be?

Even in professional development in big name companies, this is a surprisingly pervasive default attitude. Appropriate level of paranoia is something that I think has to be taught.


It’s not always about even being malicious. At my last company I was an admin. But I locked myself down so I wouldn’t make a stupid mistake.

In the past, both Apple and Google made mistakes in their installers that if anyone else did it, people would assume it was malicious.

There was a bug in the iTunes installer that erased files on people’s hard disk if there was a space in the name.

There was also a bug in the Chrome installer that made your hard drive unbootable if you had system integrity protection turned off.


> Limiting blast radius, for example. Why does your threat model include the possibility of one of your applications being compromised and using its credentials to do desired things to other applications? That implies your programs are buggy. And clearly, your programs aren't buggy; you're using best practices. How could they be?

I find that a lot these "defence in depth" style policies end up muddling where the actually important boundaries are. You get people slapping on a "security" layer everywhere because "it can't hurt", but then people don't worry about bypassing those layers either because "that part is just in case, it doesn't have to be 100% secure" and then the holes in your Swiss cheese line up and you get hacked.

Realistically you have a budget for how much effort people are willing to put into understanding your security systems and you need to spend it where it will do the most good. For most organisations that means a clear, centralised access control model that the rest of your systems trust to do its job, and no ad-hoc mitigation measures.


All of the AWS services I’ve used are difficult to work with. Documentation is often vague, outdated, incomplete, or nonexistent. The whole system seems designed to create jobs for AWS admins. Yes, you’ve got tons of power and control, but what we often want is transparency and simplicity, and that’s what AWS is worst at doing natively.


I disagree on the documentation bit. I think AWS's docs are really good overall. You have the services FAQs for high level overview + AWS Docs to get deep in to the weeds.


They are good in some areas, but certainly not good in others. Eg when it comes to how aws codebuild/deploy and how ECS service/instance/task roles interact.

It seems AWS teams are very vertical, leading to more complex cross services interaction being badly documented, as it's "nobody" who truly owns it?


Yes, their organizational structure and the lack of trust that sometimes flares up between teams really shows in the integration. Which is, of course, exactly the level where most customers live. IAM used to be exactly that but they centralized that cross-cutting concern and it's improved a lot.


Every time I have to read some AWS docs I get blinded by enterprise buzzwording and just want to run away...


You learn to avoid those pages. Besides, there are solid reference pages behind. I can't say the same for Azure.


Honestly I’ve found most AWS services hard to work with because of IAM. Though I do agree with you about the docs.


My issue is that most of the examples in the documentation either have or assume lax IAM policies.


IAM wildcards are the new chmod 777.


It doesn't help that AWS IAM is often confusing and inconsistent. And some permissions that should have resource-level granularity and/or support conditions, don't. To be fair, this does seem to be improving somewhat, but even some newer permissions aren't able to be controlled in as granular a way as I would like.


One thing I find interesting is that AWS has added some safeguards to the console to protect against exactly this, since it's presumably a very common issue. As of the last couple years when you make any S3 bucket open to the world you see a big warning about it.

However if you're following the "industry best practices" and using something like Terraform to manage all your resources including IAM policies, you won't ever see the warnings. If you take a step back, it's somewhat bizarre that we've decided that having infra teams manage hundreds or thousands of lines of not-very-human-readable JSON across all the IAM resources they manage is the proper way to do things. I believe there are some linters that have some semantic understanding of IAM rules that could provide the same benefits, but at least when I last looked into it they weren't very mature and didn't match all of AWS' own rules.

After experiencing some of the pains of managing a large Terraform configuration in the past, I've definitely started to wonder if we take the idea of "infrastructure as code" too literally. I think manually writing text-only configuration files for infrastructure should start to be seen as an anti-pattern as well, and we should mostly be working with better, more intuitive UIs for creating resources and then outputting the representation in some log format (which can then be read back by the same tool for reviewing/diffing, replayed in additional environments, rolled back, etc.)


>> we've decided that having infra teams manage hundreds or thousands of lines of not-very-human-readable JSON across all the IAM resources they manage is the proper way to do things.

This is exactly why with few of my friends started to work on a tool that uses a typed language to express IaC. We can leverage and or relations for AWS objects. One quick example. S3 resource is PublicWebsite or ForwardOnly or PrivateBucket. The individual resources then have a bunch of mandatory properties (using and relationship between them). It is much easier to read and we have reduced the number of lines of code that we need to grasp to understand a service significantly. It is also possible to remove options that you do not want to give to developers at all (for example PublicWebsite is not a required option for most teams using S3). I really liked Terraform at the beginning when I thought they are going to improve significantly over the years but it did not happen. Instead they went down the same rabbit hole many other projects, lets invent a new language to express Iac. We do not need one. ML languages are perfectly capable to capture IaC and those languages are perfect fit while HCL lacks basic expressive power resulting in seggfaults/exceptions left and right. I still remember the first time we accidentally set both forward all requests to and website for an S3 bucket and we had to debug why Terraform just crashes with a meaningless error message. Imagine when you are trying to do something security related with such a tool. Not fun.


Sounds really interesting, is there a public repo up yet to take a look at?


Could you reach out on Keybase or email?


CapOne sponsors, develops, and runs internally an open-source tool called [CloudCustodian](https://cloudcustodian.io/docs/index.html) (recently accepted into the CNCF) that analyzes cloud resources at runtime and notifies developers of items that are out of compliance. This allows you to identify issues with resources even if they drift from their templated configuration.

It's a pretty good way to manage IT infrastructure at a very large scale rather than just relying on every dev to configure their infra perfectly and check in on the web console regularly to see if AWS raised any possible issues.


When infra isn't code, it's very hard to recreate. You can't possibly keep track of all the knobs changed by humans. It's easy to miss them.

I like your idea of backing manually created resources with a machine log that can be replayed. Even better if it's editable and can be turned into something more concise and documented.

Terraform is hard to read and manage. But it's better than manual, bespoke infra.


The original thread is good, because it states the problem and links to a number of tools which mitigate the problem in various ways.

I'd say that IAM is a too low-level interface (or language). It makes it hard for engineers to think at the convenient enough level of abstraction correctly. (Imagine writing e.g. a C++ compiler in 6502 assembly.)

An obvious solution would be to introduce a tool / language which allows to operate at the level engineers are used to think at, validate the structure, and analyze the potential breach impact for every resulting piece. (Sort of like, again, a compiler. Or at least something like Terraform as a first step.)

A few steps in that direction are already done with the tools mentioned in the original thread. But I suspect that a lot can still be done in this area, bringing fame and potentially money to those who would come up with a tool which becomes widespread. (I mean, it could be a wide project / startup idea for those who understand both IAM and formal methods well.)


This is exactly the sort of problem solved by cdk: https://docs.aws.amazon.com/cdk/latest/guide/home.html with cdk you generally don't have to mess with IAM constructs, and can just use the provided APIs to setup necessary permissions (and those permissions are always as narrow as possible by default)


Noob question but doesn't having a private VPC at least limit external users from accessing anything since they have to be part of the network?


A bucket policy can restrict access to only Access Points, which can in turn be restricted to VPC endpoints.


S3 buckets are accessible everywhere generally.


yes S3 is the one exception cause its global


The sooner that these cloud companies work to create a common open API the more secure and better off we'll all be. The idea of spending so much time being specialized for a particular cloud provider is stupid, when you've built your tech career not knowing actual technology but just a proprietary overlay that doesn't even resemble anything useful outside of the organization.


Apart from IAM complexity, I found it interesting that the author puts the cost of this breach at 80M. Yes, that's the cost of the fine, but the true cost including reputation, etc are definitely way higher than that.

80M to a company like Cap One with almost 30B in revenue is an easy write off. The other costs can be really hard to recover from.


You’re missing development cost to remediate all of the security issues revealed via auditing.


Make IAM part of the design of the application. If you need to use AWS API calls, then you need your app's design architecture docs (you have those right?) to list the IAM permissions needed to do each thing, and stuff that info into ADRs, and link to some IaC that you used to stand up your dev/test environment. All of this creates 1) formal IaC used to apply the permissions, 2) formal documentation of what functions need what permissions. As a final step during development, write tests that verify the IAM permissions are as expected.

You will of course need some way to scale this later, but as long as an artifact of your build pipeline is auto-generated IAM policy jsons, an org-wide security team can analyze them with automated tools and remediate as needed.


IAM is fine, AWS IAM is needlessly complicated with garbage policies and lack of explanation.


IAM is a mental model close to a language. And until you understand that mental model it might as well be Japanese to an English speaker. But once you are able to understand that model crafting extremely fine grained policies becomes a breeze.


This thread reads like an ad for PolicySentry


In practice, I've see access controls typically as being too loose creating risk in areas such as admin access or with wild cards as in the Capital One breach. The alternative is often too tight in which case developers struggle with ops to manage the things they need access to. Just in time access controls (and change logs) and isolating resources through the concept of a project or tenant is one approach, and, if you will pardon the plug, the approach we’ve chosen with our no code, infrastructure as code platform www.duplocloud.com.


> And honestly, the problem has been so difficult to solve, that I think every AWS customer leveraging Instance Profiles or machine roles is vulnerable to this somewhere. If one app gets compromised and you haven't limited blast radius, you're screwed.

What a shit product. I loved it back when we could discuss IT infrastructure without using AWS product names.

They have infiltrated IT and the prices will continue to rise. If you took the bait; you're screwed.

They are ripping through your data (hosted on THEIR machines) to compete against you.

Here are my thoughts. IT'S A TRAP! STAY AWAY!


I think this is why AWS has went to recommending the multi-account model with a service per account. That model greatly limits the blast radius of misconfigured IAM in an account so that if you lose a service. You lose that services data. But almost completely block any cross application compromise. That being said multi-account can be just as difficult as IAM if you don’t properly architect for it.


IAM is hard; but deciding that a web proxy shouldn't have access to IAM credentials should be easy. This is why I wrote imds-filterd.


https://github.com/cperciva/imds-filterd

That's clever. The format of the config file looks pretty intuitive as well.


AWS Zelkova is in theory supposed to find these sorts of issues. I haven't used it, so I'm curious what others think about it.


I went to a talk about it at Re:Invent and it does seem to solve the issue in theory but the service based on it (Access Analyzer) seems to only apply to very limited use cases.


Is Re:Invent worth the trip? For me the cost would be formidable.


Aside from the inherent complexity of complex systems, there's another layer we don't talk about and that is arbitrary complexity, particularly in communications, standards and documentation.

A lot of IAM does not need need to be that hard but concepts need to be poignantly clear. It's harder than it needs to be.


I wonder about the complexity and AWS motivations.

What does AWS gain by improving IAM? There are barely any competitors, so they won't be losing people for that. They offer their own AWS professional services happy to charge you for making it "understandable". Their service agreements largely absolve them of client mistakes. Which usually result in larger bills from AWS.


That's pretty cynical.

AWS is a ball of complexity because it grew organically that way, and they don't have a culture of explaining, or, keeping things simple.

Both of those things would require strong strategic guidance, and a real effort to do.

Unless Bezos edicts: "Our APIs must remain simple even as they scale, and we must document in a manner that keeps the 80% common path easy to use, while the remaining 20% arcane functionality available ..." then it would happen.

But it won't.

It's reasonably well curated arbitrary complexity, it is what it is.

This is not an issue anyone handles well.


No cynacism meant. My mistake. "motivations" was incorrect. I was trying to ask about how the business of AWS manifests such a thing. Which I think you've described. Thanks!



Are there any good papers out there for anyone crafting IAM systems?

More on theory and overall goals than "here's how you use SELinux"


FaaS helps here.

It's easier to reason about the access one function does than all the infra a monolith needs to access.


You’d think that, but Friday I just helped a dev team deploy a Lambda function that had full capability to update any Lambda in the account. The devs didn’t know anything about how the app worked, since it was a drive-by from the architect. He, in turn, just grabbed an AWS blog post, confirmed that it worked in his personal account, and called it a day.

Also, since the devs don’t know IAM, every resource request is a wild card. A function compromise would allow deployment of a new function that could extract every SSM secret, Cognito identity, and S3 object.


IAM stands for Identity and Access Management:

https://en.wikipedia.org/wiki/Identity_management


but still, there are services (e.g. dom9) that implement alert and monitoring for cloud infrastructure security.


What got breached?


Too much swearing in the twitter thread. Puts me off.


Another reason why microservices are a good thing. They result in micro level permissions for individual resources.

In saying that though, k8s in AWS was really shit at limiting what IAM roles containers could assume without it being the instance role. Crap like kube2iam and kiam came out to butcher the AWS metadata/instance networking. Thankfully AWS solved it with their new OIDC IDP.

True DevOps culture workplaces have Devs writing their IAM roles as they know what services the app might talk to.

Keen to try out some tools mentioned by the twitter person


You can just as easily give overly permissive IAM credentials to a microservice. That's what happened here. Some tiny little web service was breached, and that service was provisioned a full-access IAM credential.


If you're deploying a microservice on an individual ER2 inctance with an instance role, you're doing it wrong. Not sure where you got the idea it was a microservice. Names like a dodgy app server of some kind.


The point I am trying to make is "microservice" does nothing whatsoever to solve the problem of people incorrectly provisioning IAM permissions.


Nor does it in make developers better at their jobs.

The point i'm trying to make is if the scope of what a system can do is limited, it's permission boundary/model is easy to define.

Many things led to the incorrect provisioning of the IAM role. Lake of understanding of IAM for starters as well as consequences around it.

By no means am I saying microservices would solve the problem. But it sure does make it easier to define what permissions your app needs to have as well as limit the blast radius of what is exposed when done correctly.

This is impossible with monoliths on EC2 instances.


Capital One uses microservices with multiple layers for privileging and is "true devops" with devs typically deciding IAM policies.

One of the big issues with microservices is that you need to invest a whole lot more into infrastructure and tooling to be able to run them sanely. Most companies are really hesitant to invest sufficiently which leads to burn out, low productivity, and various ultimately avoidable bugs. This compounded with microservices being inherently more complex you often end up with tangled and messy products that are hard to understand software wise which also impacts security. Having worked at Capital One I think that's certainly one of the big issues at play.

For devops, I personally don't get it, maybe I just haven't seen it done well or maybe when it's done well it just disappears into the background. But expecting all developers to be good at application coding and also have an in depth understanding of AWS security best practices is unreasonable. Something is wrong with the levels of abstractions being worked at.


I don't see how microservices necessarily lead to what you call micro level permissions.

If anything it's IAAS/PAAS and the inherent requirements that springs from that, to explicitly manage access that has started to drive this.

Still, some deployment environments makes it extremely tedious to actually manage fine grained, least necessary privilege access. Especially for smaller outfits, setting up all the security specs for a quite typical setup at a good granularity has a lot of the feel of writing assembler code for a mcu with a bad datasheet.

Figuring out which rights you actually need is sometimes hilariously convoluted, as examples often use excessively large scopes, and sometimes even figuring out which service to attach them to can be extremely non-obvious.

I hope that in time there'll be tools on top of the k8s specs that takes these chores out of the equation, maybe there already are? I haven't tracked k8s closely, as it seems to mostly cater to larger outfits as of now.


If your microservice only updates billing details on a dynamodb table, it'll never be vulnerable to having someone take over it and stealing all the data from it.

It's what I call a micro level permission. The principal/actor can only write to one resource.

Where places get it wrong is when their application writes to dynamodb, reads from S3, does a scan on another table etc. It leads Devs to making overly permissive permissions while they debug why their app isn't working like it use to.


I don't see how microservices


Is that a pun? :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: