"this looks about right and has no obvious bugs" is my standard when reviewing human code, and it's my standard for machine-generated code too. no reason to formally verify GPT-4 outputs if I'm not formally verifying my coworker's either.
Well... after fairly long experience, we have discovered that your standard is mostly adequate for human generated code (as long as it's not going into a critical system). That may be based on the (empirically collected) statistics of how human-generated code fails - that if it's wrong, it usually either "looks" wrong or obviously fails.
GPT-produced code may have different failure statistics, and therefore the human heuristic may not work for GPT-produced code. It's too early to tell.