Sure, there are a few edge-case failures and mistakes here and there, but I can't help but be in awe. AWE.
Let me state the obvious, in case anyone here isn't clear about the implications:
If the rate of improvement of these AI models continues at the current pace, they will become a superior user interface to almost every thing you want to do on your mobile phone, your tablet, your desktop computer, your car, your dishwasher, your home, your office, etc.
UIs to many apps, services, and devices -- and many apps themselves -- will be replaced by an AI that does what you want when you want it.
A lot of people don't want this to happen -- it is kind of scary -- but to me it looks inevitable.
Also inevitable in my view is that eventually we'll give these AI models robotic bodies (think: "computer, make me my favorite breakfast").
We live in interesting times.
--
EDITS: Changed "every single thing" to "almost every thing," and elaborated on the original comment to convey my thoughts more accurately.
> they will become a superior user interface to almost every thing you want to do
No they won't. They're actually a pretty terrible user interface from a design perspective.
Primarily because they provide zero affordances, but also because of speed.
UX is about providing an intuitive understanding of available capabilities at a glance, and allowing you to do things with a single tap that then reflect the new state back to you (confirming the option was selected, confirming the process is now starting).
Where AI is absolutely going to shine is as a helpful assistant in learning/using those interfaces, much as people currently go to Google to ask, "how do I do a hanging indent in Microsoft Word for my Works Cited page"? For one-off things you do infrequently, that's a godsend, don't get me wrong. But it's not going to replace UI, it's going to assist.
And the 99% of your tasks that are repetitive habit will continue to be through traditional UI, because it's so much more efficient. (Not to mention that a lot of the time most people are not in an environment where it's polite or possible to be using a voice interface at all.)
There’s no reason chatbots have to be the interface to an LLM. Imagine dynamically generated interfaces redesigning themselves to your needs as you work through a task.
So to your point, I do think it would be nice to have an AI assistant suggest to make changes to the view mode of an application. For instance, if I navigate to a certain area the AI might suggest that I make that option visible in my menu. I might say no, because I prefer to use the hotkey. Aside from that, I'm going to have to absolutely side with the parent commenter on this one - When I'm doing serious work I want a familiar interface that I know how to move around efficiently in. I don't want any surprises.
Suggestions to improve workflow sound great. But nullifying hard earned knowledge of an interface... I can't see that helping me.
The shining example in my mind is audio/video/graphics applications, where there are good reasons to routinely switch between different views. Knowing your way around those views (which might be custom, but still static), and being able to navigate through them quickly is very valuable.
I think there’s room for an AI-native interface that knows the full context of the project you’re working on and can make extremely intelligent suggestions or tailor the interface for a specific task. Or you could just literally describe the interface you want for something and it redesigns it for you immediately.
There is room, absolutely. Perhaps I took too hard of a line. I do recognize that a compromise can be made between a user retaining their knowledge of a system, and that system evolving to accommodate user requirements. It's easy to be fearful of the rug being pulled right from under you, but I suppose the reality might be closer to earnestly asking to place things upon the rug. In other words, the things that are familiar will probably remain familiar and accessible, but AI will help us arrange them in a way that optimizes our workflow.
Agreed but lets keep in mind that graphical UIs involve significantly more effort than CLIs. In the case where you shoerhorn an ai to be your primary UI I would bet this is as hard if not harder since you typically have to code around significant limitations of current llms
This is not what I said. Sure you can use such tools or even just classical boilerplate scripts (like we used for a decade now) to get started with react fast. But building out a system that fails well when the underlying llm starts behaving erratically or not at all is a completely different league of engineering as executing a boilerplate script.
Sorry for misinterpreting you. So the underlying LLM starts misbehaving, and the difficulty you see, is that the system as a whole should fail gracefully. What would that look like, in your eyes? A proctor LLM/whatever that observes the output and decides that it has gone awry and decides to take over?
I would hate that. An interface is useful when you can anticipate and remember what things are available and where they are. It doesn't matter if it's a visual UI or keyboard shortcuts. If you start moving stuff around on us, we revolt. (See every UI update on an MS Office product ever.)
If something is a repetitive habit that you can do almost without thinking, there is a good chance an AI could infer that entire chain.
I think what's more likely is that an AI based interface will end up being superior after it has had a chance to observe your personal preferences and approach on a conventional UI.
So both will still be needed, with an AI helping at the low end and high end of experience and the middle being a training zone as it learns you.
I think I wasn't clear enough -- these habits I'm talking about are things like "press cold water button, press start" or "press warm water button, press start" or "tap 'News' app grouping, tap 'NY Times' icon".
There's nothing to infer. The sequence is already short. There are no benefits from AI here.
But you raise a good point, which is that there are occasionally things like 15-step processes that people repeat a bunch of times, that the AI can observe and then take over. So basically useful in programming macros/shortcuts as well. But that still requires the original UI -- it doesn't replace it.
I don't know - the timer app on my oven is trivial too. But I always, always use Alexa to start timers. My hands are busy, so I can just ask "How many minutes left on the tea timer?"
Voice is not really clumsy, compared to finding a device, browsing to an app, remembering the interface etc.
Already when we meet a new app, we (I) often ask someone to show me around or tell me where the feature is that I want. Not any easier than asking my house AI. Harder really.
Hard to underestimate the laziness of humans. I'll get very accustomed to asking my AI to do ordinary things. Already I never poke at the search menu in my TV; I ask Alexa to search for me. So, so much easier. Always available. Never have to spell anything.
Everyone agrees setting timers in the kitchen via voice is great precisely because your hands are occupied. It's a special case. (And often used as the example of the only thing people end up consistently using their voice assistant for.)
And asking an AI where a feature is in an app -- that's exactly what I was describing. The app still has its UX though. But this is exactly the learning assistance I was describing.
And as for searching with Alexa, of course -- but that's just voice dictation instead of typing. Nothing to do with LLM's or interfaces.
Alexa's search is a little different - it's context-independent. I can ask for a search from any point in the TV app - in some other menu, while watching another show, heck even when the TV is turned off.
And when describing apps - I imagine the AI is an app-free environment, where I just ask those questions of my AI assistant, in lieu of poking at an app at all.
Most user interfaces already have a much finer granularity and number of options than your examples.
When taking a shower, I would like fine control over the water temperature, preferably with a feedback loop regulating the temperature. (Preferably also the regulation changes over the duration of the showering.)
Choosing to read the NY times indeed is only a few taps away, but navigating through and within its list of articles is nowadays done quite fast and intuitively thanks to quite a lot of UI advancements.
My point being, short sequences are a very limited set within a vast UI space.
People go for convenience and speed, oftentimes even if there's some accuracy cost. AI fulfills this preference, especially because it can learn on the go.
> When taking a shower, I would like fine control over the water temperature, preferably with a feedback loop regulating the temperature. (Preferably also the regulation changes over the duration of the showering.)
That exists, but it’s expensive because of the electronics and mechanics involved. There are so many interfaces with this exact problem.
You also almost certainly don’t want non-deterministic hallucination prone AI controlling physical systems.
Indeed, and to take the UI a step further, humans often prefer automation, if it works reliable. A complicated UI would become simple, just step into the shower.
There’s no complicated UI. You just turn a knob that sets a digital temperature readout.
If you want the shower to save your temperature preferences and start automatically, there’s no reason to build in a computer capable of running an AI.
But in reality you almost certainly don’t want a system like this because you don’t want an AI accidentally turning on your shower when you’re not home, when you do ok to clean it, or grab a razor, or when your toddler wanders in.
Granted an AI could try to determine intent, but it’s never going to get it 100% right. Which is why for physical systems like this you almost always want a physical button to signal intent.
It would become less expensive, using less sensors and actuators, when using the predictive and learning abilities of an ai. You can, for safety reasons, keep a mechanical temperature limiter in the loop.
Temperature can be measured in different ways. IR radiation and sound can be measured from a distance. The relationship between temperature at the source, of the water exiting the showerhead and time can be learned. Water can be heated in different ways. The valve could also be a pump. Our reaction to the temperature of the water can be sensed.
Who knows, AI can come up with simpler or cheaper solutions that did not cross our mind.
I would say, time will tell.
Prompt engineering and using multiple AI models in parallel might find ways to cancel out most hallucinations similar to how consensus-based replication works.
It might. If hallucinations are truly random and not correlated to anything shared between models. For example, something inherent to the data they are trained on. Given how locked down I think potential training data is going to become, and the amount of data required, I think that sharing data between models is almost guaranteed.
Also that sounds like an awful lot of computing power for everyday UIs. It also doesn’t solve the non determinism problem.
I totally get your point, but I think that AI will allow much "smarter" behavior. Where every appliance is an expert in doing what it is intended to do.
So sure, it will still have buttons, but those buttons are really just preset AI prompts on the backend. You can also just talk to your appliance and nuance your request however you want to.
A TV with a remote whose channel button just prompts "Next channel" but if you want you would just talk to your TV and say "Skip 10 channels" or "make the channel button do (arbitrary behavior)"
The shortcuts will definitely stay, but they will behave closer to "ring bell for service" than "press selection to vend".
Don't think it of it as the machine performing repetitive tasks you specify.
Think of it instead as the machine accomplishing goals you specify, figuring out on its own the tasks necessary for accomplishing them.
Instead of telling the machine something like, say, "increase the left margin by a quarter inch," you'd say something like "I want to create a brochure for this new product idea I just had, and I want the brochure to evoke the difficult-to-describe feeling of a beautiful sunshine. Create 10 brochures like that so I can review them."
Instead of telling the machine, say, "add a new column to my spreadsheet between columns C and D," you'd say something like "Attached are three vendor proposals. Please summarize their pros and cons in a spreadsheet, recommend one, and summarize the reasons for your recommendation."
All this presumes, of course, that the technology continues to improve at the same pace. No one knows if that will happen.
> I hate the idea of having to hold a natural-language conversation with a computer in order to make use of its functionality.
I hate the idea of having to use a mouse to click on a visual GUI to navigate a file system in order to make use of its functionality.
It's less the case today, even among developers, but it wasn't that long ago that I remember that any serious technical user of a computer took it as a point of pride to touch the mouse as little as possible. They're also still correct in that thinking. The command line is a very powerful UI with lots of benefits and while the mouse makes navigating the OS easier it's still much more limited than command line usage.
Touch screen interfaces are another example of an easier UI that ultimately feels even more limited. But people still plug their iPad pros in to magic keyboard folios frequently.
Having worked with these tools everyday for awhile now the "AI will change UX" is such a better take than "AI will conquer the world!". AI does fundamentally open up new work flows and user experiences, many of which do over a lot of potential improvements over their predecessors.
At the same time I doubt we'll see a world where we don't end up using the command line for the majority of serious technical work.
> I hate the idea of having to use a mouse to click on a visual GUI to navigate a file system in order to make use of its functionality.
Ironically, studies have shown that mouse-based interfaces are more efficient for practically all filesystem use-cases compared to CLI interfaces.
Despite objectively faster-time-to-solution, people self-report that they "feel" that the mouse GUIs are slower.
That's because there's fewer actions per second when using a mouse. It's a smooth gliding motion and then a single click, versus many keystrokes in a row with a CLI.
Rapid actions feel faster, even if it takes more wall-clock time to achieve a task.
Keep this in mind next time you sneer at a "bad graphical user interface" for being "slow".
Pratically all? I regularly need to select a subset of a very large number of files to do something with. Example: I have a directory, or possibly a tree, with thousands of photos. The photos have the date in ISO 9000 format embedded in their name. Your job is to find all the files from February 2021 and copy them to specified location, and there are 50 of them. Good luck using a visual GUI for that, unless it has the exact selection mechanism you need. Sure, mouse clicking is easier if you're selecting one file from a small set of files to open in a particular program. But I usually find myself frustrated when the only tool I have is a GUI.
Personally, I would use the terminal, too for a throwaway use case like this. But GUIs are very good at that and faster when you do more than the naïve case, they also allow to see intermediate steps.
For your use case, on macOS open Automator.app and add three actions
1. "ask for finder items" (the source folder)
2. "filter finder items" (by name)
3. "copy finder items" (to target folder)
This takes roughly 5 clicks, 10 seconds at most.
Repeatability and configurability is where the GUI action shines. With only one click more you can
- add filtering by size, opening date, modification date, etc. in addition or a combination thereof
- do the same action for multiple source folders and the same target folder
- choose whether you want to replace existing files
- add it as a folder action that runs automatically on modification of the source folder
Arguably much slower on the terminal.
Alternative on macOS, that works on all other major OS with similar shortcuts and a similar feature set, just not repeatable:
1. Go to source folder (shift-cmd-G)
2. Filter (cmd-shift-F)
3. Copy (cmd-A, cmd-C)
4. Go to target folder (shift-cmd-G)
5. Paste (cmd-V)
This repeatability, configurability and automation is where GPT falls short, for now.
> Touch screen interfaces are another example of an easier UI that ultimately feels even more limited.
Depends o the use case. Touch screen is much more powerful than command line for maps, for example. Or for drawing. Mouse + keyboard is much more powerful than just keyboard for DAWs. And so on and so on.
Plus I think there's a nuance to what you're saying:
UX is not just about making the best channel surfing interface, which is essentially what phones/tablets are. We need UIs that are capable of rich interaction and expression of ideas, creation, etc.
Come on, you don't want an anxious, golden robot that mostly just causes drama but also happens to be fluent in over 6 million forms of communication so it can translate for you whenever you go on holiday?
Pretty sure somebody has made a Starfleet Badge before tied to Alexa or Google Home, just need to change it to ChatGPT and mentioning "computer" activates it.
Ideally it won’t stop until computers are something we rarely think about. For me I love the CLI but having to use a mouse is a drag and causes serious wrist pain at this point in my life. Typing on a keyboard and clicking with a mouse are not the end all, be all of interacting with information systems.
This agrees with my experience. When I have begun to automate complex manual systems (nibbling from each end, typically), I note when I watch people use the finished product (at each step) that they simply find some other facet of the job to pay less attention to. The eventual error rate just returns to what it was before.
Visual input is not "natural language." The implications of that could, especially in the long-term, could be incredibly far-reaching, and possibly even terrifying.
There's nothing stopping developers from taking a prompt to GPT and sticking it behind a button or command line, with options in the UI interpolated into the prompt.
For now almost all applications of ChatGPT happen in chat windows because it requires no further integration, but there's no reason to expect things will always be this way.
I wouldn't mind an UI fully configurable by AI to fit me exactly. Without having to learn obscure settings, some obscure declarative language or how the UI is implemented.
I personally can't wait for brands to start specifically advertising "doesn't use AI" as a selling point because everyone got burned by their AI controlled toaster, literally and figuratively. For anything designated an "appliance" I want it to be fixable by someone locally, or it's just throwing money away.
It's not gone, but is much more expensive up front. For example, everyone I know has a range or stovetop/oven with a circuit board or two in it, few if any replaceable parts, etc... That's basically the standard if you're in the 30" range that costs $1-2k market. Meanwhile, my 60" Bluestar has no circuit board, it has four mechanical electronic switches for the oven lights and convection. Everything else is brass valves, standard easily replaceable parts and so on. A few weeks back a convection fan went out and I replaced it in 5 minutes with a screwdriver. But, this range starts at $18k.
I'm curious if we will reach a situation were the steam punk idea becomes reality:
People plugging weird stuff together like a ai chip from a car into a toaster.
If ai becomes hardware chips it could easily be that language processing will be a chip default feature and the rest is teachable like plugin ai chip level 3 into it, boot it and teach it that it's now a toaster.
But at the end we will have the same toaster in 30 years as we have had for the last 30 years.
I'm sure that in 2045, the Hacker News of its time will be complaining about lack of AI-less devices the way it complains about lack of smart-less TVs. It'll be available, it'll be more expensive, and the people who claim they want it don't want it enough to pay the premium.
Roughly half of people in most developed countries are not functionally articulate: meaning, they can read functionally, but struggle to articulate what they want with the written word.
LLM-based chatbots can be extremely attractive to the top 30% literacy users in the developed world. They are not a good universal UI. You still need to provide pathways for the user to follow to get done what they need without forcing them to articulate their requirement.
This is why so many people sit in front of a ChatGPT-like service and say, "what would I use this for?" and never use it again.
> Roughly half of people in most developed countries are not functionally articulate
Where did you get this idea? I found this article (https://www.uxtigers.com/post/ai-articulation-barrier, is this you?), but it makes a leap from literacy to articulacy that I don't understand. It's not obvious to me why an illiterate person would be "functionally inarticulate" assuming they can speak instead of write.
Also, I'm not certain but I think the author is underestimating the abilities of a person with Level 2 literacy. It doesn't seem correct to say that "level 3 is the first level to represent the ability to truly read and work with text", especially when the whole point of LLMs is that you don't have to read a long static document and understand it, you can have a conversation and ask for something to be rephrased or ask followup questions.
I do however run a company that employs lots of blue collar, non-college-educated people, in manufacturing. And although this is in no way scientific, my experience matches this: most people are much more uncomfortable writing than they are reading. Even with reading, most strongly resist reading documentation unless they have to, and prefer trial and erroring their own gut instinct until they happen to find something that works or they give up. (This is less true of the most highly skilled technicians, such as those who troubleshoot robots and low voltage control systems.) The official statistics on literacy are absolutely not a good indicator of how comfortable people are articulating themselves with the written word, much less reading.
This is generally met with disbelief by most people in tech I talk with about this, because for the most part they have nearly zero interaction with this large portion of the population. From their daily experience, 98%+ people can make effective use of these tools.
But almost nobody in this partially literate population wants to write in an empty text box to ask an AI to do things. They can learn to visually navigate a simple UI, especially if it's well-designed, because they can effectively make decisions about what of several paths to take.
Some others here have brought up voice, and I do agree that voice is a more promising avenue, although I think it'll still take carefully constructed conversational experiences to work well (i.e., free form 'tell it what you want' will still not work).
This isn't true. There's plenty of people who are verbally fine but can't read or write. Spoken language is a far more common and fundamental skill than reading or writing.
I am surprised about the amount of resistance this is getting. Yes, some people talk more than they write. Yes, some people can't write but they can talk. But, that's just the environment the words travel through. The words themselves, they will be the same. I conceptualize expressing in writing as more lenient than communicating verbally. Verbal communication leaves much less time to ponder your ideas and pick your words carefully, thus, it's harder. I maintain my stance, if you can put words to paper (and the vast majority of people in the first world can do that) but are unable to communicate your ideas this way, you probably will fail doing so verbally as well.
If my supposition (Speculation? Stronger than it should assertion?) Is true then just interpreting requests verbally will not help
Ha. Can you imagine an AI speaking in colloquial Black American or Appalachian dialect? People's minds would short circuit, not knowing whether to be offended or approving.
That sounds super false to me, at least given how articulate kids can be before they can read, and how literate they can be before they can write.
If it's really true that half of the population can't functionally express themselves verbally then I'd sure like to know that. Or maybe I've misinterpreted something claimed here, because I'm struggling to find these claims plausible.
I mean to say that if you can write and are unable to express your ideas, you probably won't be able to do so verbally either.
In the case of kids, of course, that's true, but just because they can't write.
But if you can (and most people can) just having the option of voice input won't help.
I refrain to take a stance about how much of the population is unable to articulate thoughts in writing, (it's probably not great though) but it's probably going to be comparable with how many can't express themselves verbally as well.
I'm talking here about more complex ideas of course. I'm sure average communication is functional.
...do you know that illiterate people exist, right? Do you understand that people were illiterate for thousands of years and still managed to speak their ideas, right?
Yup, poster seems to be completely ignorant of pre-written word civilizations which used oral means to transfer knowledge through successive generations for millennia.
If you want something done right, sometimes you have to do it yourself. Employees are sort of a universal UI. But you will always know more about what you want done than your agent, whether it’s human or computer. That’s even before considering the principal agent problem.
If you want something done right, other times you will have to get someone else to do it. You know what you want, but you might not have the skills to do it. I can't represent myself well in court, do a good job of plumbing or cut my own hair, so I would ask for experts to do that for me.
Plus if someone is capable, it's often quicker to delegate than do, and if you are delegating to someone with more time to do the task they can often do a better job. Delegating unambiguously is a skill in itself, as instructing AIs will be.
Currently ChatGPT doesn't know it's bad at math, so it can convert a story problem into an equation better than a human but then mess up the arithmetic or forget a step in the straightforward part.
But if you specifically give ChatGPT access to Mathematica and an appropriate prompt, it can leverage a good math engine to get the right answer nearly every time.
Before long, I don't think that extra step will be necessary. It will know its limits and have dozens of other services that it can delegate to.
I'm glad you made that change because the interface can't really solve every single thing. Most of what I do on my phone is read eBooks. AI can't inject the knowledge directly into my brain. Text on the screen is as good as it will get for the foreseeable future. Most of what I want to do in my home is cook, clean, sleep, and have sex with my wife. Software can't do any of those things for me. Most of what I want to do in a car is get where I'm going, which software control can arguably make somewhat easier, but the driving isn't really the pain point. The distance and traffic conditions are the pain points. AI can't change external reality, at least not yet.
I'm hopeful that AI cars make renting one cheap enough that I can just rent an AI car the 0-2 times/week I need one instead of owning a car and all the time and stress on maintenance that entails.
What kind of junkbox car do you own? I have got to believe this s a totally subjective thing. From my point of view, my cars are probably the least stress-inducing things I own. My computer certainly tops the list of things that cause me stress.
Also, keep in mind that Robots may be scifi level in 2 years. Like, able to cook and clean with hands! Crazy, but I think we will see it happen so fast:
I'm not sure if we watched the same video, but I saw a robot fumble though a few mechanical motions with dexterity and speed of a toddler to achieve a few isolated, best-case tasks where all the hard parts were taken care of by a human.
Cool demo, I suppose, but nobody is going to buy this as anything other than a toy.
By the time the current AI mainframe runs on your device, there will be new, better models that still require the mainframe.
I think AI fundamentally favors centralization. Except for narrow tasks and domains, there's no such thing as "enough" intelligence. For general purpose AI, you'll always want the best and most intelligent model available, which means cloud rather than local.
GPT-3 requires 700 gigabytes of GPU RAM. I'm looking at my cheapest computer components retailer listing a 48 gigabyte GPU at $5k. So to run the previous generation of GPT would cost me about $70k right now. When do you think I can expect to run GPT-4 on my consumer $device? :)
I would be surprised if GPT-3 uses 700GB of RAM. It may be true, I don't know. But I am running 70B parameter models (quantized to 5 or 6 bits, biggest is 48GB loaded) on my 64GB Mac M2 Max Studio now and they are usable and the machine is still usable too. With an M2 Ultra and 192GB of RAM I imagine you could do a lot more.
I'm not arguing that these models hold up against GPT 3.5 and I still use GPT 4 when it matters. But they work and it's more like the difference between Premier League & Division 1, rather than PL & a five-a-side team from Bracknell.
Even a few years ago I could not have imagined this.
Given the pace of work on optimisation and my assumption that the M3 Studio I buy next will probably have 256GB of RAM at much the same power levels as I use now, it seems eminently possible it's a year or two away.
First of all, you're off by an order of magnitude.
Second, I don't think it will be that long. There are already LLMs as good as GPT-3 running on average laptops and even phones.
In the next couple of years, you'll see:
- Ordinary PCs, tablets, and phones with dedicated AI chips, like TPUs - they'll be more tuned specifically for LLMs
- Mathematical and algorithmic optimizations will make existing LLMs faster on the same hardware
- Newer generations of LLMs will get even more useful with fewer parameters
The combination of all of these means that it's not at all unreasonable to expect that today's top-of-the-line LLM will be running locally on your device within just a couple of years.
Of course, LLMs in the cloud will advance even further, so there will always be a tradeoff, and there will always be demand for cloud AI, depending on the application.
I don't know. RAM is $$ and currently usable LLMs needs huge RAM with higher bandwidth. I don't see any story that it will be solved with future AI chips. Do you know anything?
Where did you get the 700 gigabytes figure from? I don't think OpenAI even released the model size, although it's considered to be 175B parameters. Given how well quantization works at these sizes you would need less than 200 GB of GPU memory to run it.
That doesn't seem to make sense. I can run Llama 2 on my 12-year-old desktop PC with no compatible GPU and only 16GB of system RAM. It ain't quick, but it runs.
Maybe 700GB is what ChatGPT uses to serve zillions of users concurrently. If you're running your own individual instance, you obviously don't need as many resources.
and you think corporate will wait near forty years for tech to catch up? no, they will, indeed force you to be always online, leading to absolute and total surveillance where you voluntarily add dozens of mics and cameras to every home. if you thought telescreens a'la 1984 is bad, just wait.
It's funny that being obsessed with "corporations" feels so specifically GenX. I think because it's part of the New Left movement. In reality governments have all the power here, which is why this would be illegal in at least CA/EU/China.
Another issue where this comes up is high housing costs and climate change, which are mostly caused by bad land use laws (and the profiteers are landlords, who mostly own one or two properties), but people from the New Left era will literally refuse to believe you about this because they can't accept that any bad thing on Earth could not be caused by "corporations".
I agree. I think apps that would initially benefit from LLM-powered conversational interfaces are those that have the following traits:
- constrained context
- part of a hands-free workflow
A couple use-cases I have been pondering are driving assistant and cooking assistant.
People are already used to using their phone or car's nav system to give them directions to an unfamiliar place. But even with such a system it's useful to have a human navigator in the car with you to answer various questions:
- What's my next turn again?
- How long till we get there?
- Are there any rest stops near here?
- What was that restaurant we just passed?
- Is there another route with less traffic?
These questions are all answerable with context that can be provided by the mapping app:
- List of upcoming directions
- Overall route planning
- Surrounding place data
- Traffic data and alternate route information
It's possible to pull over to the side of the road, take off your distance glasses, put on your reading glasses, and zoom/pan the map to try to answer these questions yourself. But if the map application can just expose its API to the language interface layer, then a user can get the answers without taking their eyes off the road.
The information is contextual and constrained based on a current task. In some cases it might be more desirable to whip out your phone and interact with the map to look up the answers on a screen, but often it won't be worth stopping the car, and so the conversational interface is better.
Cooking assistant is a similar case: you are busy stirring something and checking on the oven -- you don't want to wipe the flour off your hands to pick up your phone and ask how many teaspoons of sugar you need. Again: contextual and constrained info based on a current task, and your hands and eyes -- the instruments of traditional UIs -- are otherwise occupied.
Today, our software interfaces generally have one of two kinds of entity on the other end: humans, or other software. In the near future there will be another type of entity: language models. We need to start thinking of how our APIs will change when they're interacting with an LLM -- e.g. they'll need to be discoverable and self-describing; error states will need to be standardized or explicit with instructions on how to correct; they'll need to be fast enough to fit in a conversational interface; etc. It's arguable that such traits are part of good API design today, but in the future they may be required for the API to function in a landscape of virtual agents.
In the cooking example, you either need the AI to have full awareness of the step you are at or you need to describe the step you are at, which could be cumbersome ("I did ..., how much sugar do I need now"). I venture, having the recipe projected in front of you would be much faster.
I imagined the AI would be reading the steps aloud to you, and so would be aware of your progress. I don’t think an AI assistant precludes the recipe being projected tho, just as in the driving example it wouldn’t replace an on screen map.
Having it both in front of my eyes, and being able to get answers to questions like "I've added the eggs, now what?" or "what does folding a dough mean?" at the same time would be very valuable.
Everyone is going to wfh then or employers are going to splurge on a private office for everyone? Or is work basically going to be like a call center when everyone RTOs?
If the rate of improvement continues at the current pace - which is GPT 1 to 2 to 3 to 4 in the last five years - we are just one or two improvements away from a full blown AGI/superintelligence/singularity/etc. At that point, a superior user interface is probably the least interesting (or scary) thing that would happen.
I personally doubt GPT-5 will be as much of an improvement over GPT-4 as GPT-4 was over GPT-3, but that's fine, I can wait until GPT-6 or 7.
It does make one wonder how it'll all play out in the coming decades. One way or another humans will eventually be unemployable for most of currently existing jobs.
Throughout history, the ruling elite had always relied on the rest of the population to make their food, do their work, and fight in their wars. This is the first time ever that they will no longer have any need for anyone else. Maybe climate change will conveniently do the culling for them...
Of course there's always that option that we end up in a post scarcity space utopia where machine produced wealth is distributed to all, but only deluded idealists can possibly still think that'll ever be a real option as we slink further into techno feudalism with every passing day.
With full automation, and using renewable energy, it will be very cheap to feed and house the peasants. I'm guessing it will become mainly about entertaining, so most will move into VR worlds, first through headsets, then BCI, and eventually uploading.
Ah yes the Ready Player One option. I suppose it's not entirely unlikely, part of the fun of being in charge is being able to boss around the peasants after all.
I doubt BCI will ever make sense, on a conceptual level it's still just copying and killing your biological self. AGI will likely solve aging way before that becomes viable.
I think you’re assuming a lot about the nature of intelligence. GPT is growing fast, but we have no real idea how big/deep the gap is between it and the human mind
You're right, for actually playing the sim, you'll likely need a game controller or something like that. I edited my comment.
But keep in mind that during the sim, you'll be able to ask the computer what you want the plane to do, and the computer will magically make it happen on your display.
I think at this point the cost factor needs to be near zero for this to explode the way we want. People think I'm "moving the goal post" when I say you literally will see it explode once it all runs off a cell phone, offline, without internet. Because nobody will want to pay money to use an AI UI when the non-AI UI was free.
I share your awe. I feel like a kid in a candy store with all these incredible AI breakthroughs coming out these days! There's a place for cynicism and pessimism, but the kid in me who loves technology for its own sake is just absolutely on cloud 9.
I am currently working in the higher education space and previously in banking. I agree with you. We are already workshopping where it makes sense to provide this type of interface on top of headless core systems. In the examples we've considered, this will drive a reduction in head count in some areas.
I'd also suggest that one of the early "killer apps" for this may be as an "IVR co-pilot" for actual humans on the phone with customers for their tricky issues.
Agree and the next big step may well be human computer interface. Speech is starting point for input. At some point output will change also and if think it out longer term perhaps a future where instead of reading information we install knowledge, including the stored memory of actual experience. If I want to do pottery, I could think this, download the experience and then be competent at it.
This would be nice if I didn't believe that it would be enshittified before it ever got off the ground. E.g.:
* There will be a monthly fee for the interface; you owe the monthly fee as long as you have it, so you need surgery to stop paying
* When you download knowledge, it's a rental, and in addition to per-hour rental fees and the network connection fee, you will owe 30% on the value of whatever you create
* The TOS will govern your behavior continuously, since you're always using the interface
* Your behavior will always be monitored because it's totally justified to spy on you all the time just because you borrowed pottery knowledge
* If you're found to be in violation of any part of the TOS at any time, they will erase all of the knowledge they've added to your brain, as well as any derived knowledge you gained through the use of their knowledge
* Because this product isn't actually considered essential, you will have no legal remedies if they turn it off, even if you are not actually in violation of the TOS
The idea of giving the adtech industry direct access to my brain is genuinely horrifying, not to mention Aldous Huxley would have some reasonable objections to everything being this easy all of a sudden.
Would you rather do something yourself with an accessible user interface or API, or open a ticket to ask somebody to do it for you?
Sometimes communicating with an intelligent agent is harder than doing things yourself with a good structured user interface where you can communicate your intent clearly.
I read this comment in the worst of moments, because I am a frontend freelancer that all of a sudden can't find a gig plus I just bought an expensive flat for my family of three kids.
Zero awe. This week so far it can't find an import of a type in Next.js and it can't get the user object from a session. Next.js is one of the most popular FE frameworks in the world. It lied about both, repeatedly giving wrong answers.
It's all fun and games until the mistakes start having a cost.
Other examples: I resorted to using it to order lists for me or adding quotes and commas to them for SQL inserts and such. Nope - when I look at the row count, it somehow drops values at random.
He's using Copilot, which is based on GPT 3.0, making it ridiculously outdated.
Generally feedback along these lines doesn't work.
People who are worried about their job security will cling to the worst AI output quality they can find like a life-preserver, and simply will not listen to advice like yours.
Nobody goes the extra mile to embrace an existential threat.
It's sad. People's entire sense of worth is attached to their jobs. They will lie to themselves and others about the ineffectiveness of AI to feel better. Even if it trips up on some minor thing there's no reason that progress will just stop and that the SOTA version 2-3 years from now won't be leagues better. Compare what we have now to 2-3 years ago, and now the investment solely in training, ignoring the algorithmic and architectural advances, is 10x greater.
read the problem again please. it's not that it misidentified what's in the squares, it "miscounted" the number of squares -- because it doesn't count, it's just a plausible sentence generator.
Multi-modality is pretty low hanging fruit so i'm glad we're finally getting started on that. Imagine if GPT-4 could manipulate sound and images even half as well as it could manipulate text. We still don't have a large scale multi-modal model trained from scratch so a lot of possible synergistic effects are still unknown.
I’ve used GPT4 (text) heavily as part of my business, including for front end stuff.
The technology is very impressive - but honestly Twitter examples are super cherry picked. Yeah, you can build some very ugly, basic front end web pages and functionality right out of the box. But if you want anything even slightly prettier or more complicated, I’ve found you need a human in the loop (even an outsourced dev is better). I’ve had GPT struggle with even basic back end stuff, or anything even a bit out of distribution. It also tends to give answers that are “correct” but functionally useless (hard to explain what I mean, but if you use it a lot you’ll run into this - basically it will give really generic advice when you want a specific answer. Like, sometimes if you provide it some code to find a bug, it will advise you to “write unit tests” and “log outputs” even if you specifically instruct it to find the bug).
Plus, in terms of capabilities, tools like Figma already have design to code functionalities you can use - so I don’t think this is really a change in usable functionality.
If AI continues to get better it won't just be you who's in trouble.
However, keep in mind that these are cherry-picked. If someone just took that output and stuck onto a website, it'd be a pretty horrible website. There's always going to be someone who manages the code and actually interacts with the AI, so there will still be some jobs.
And your boss isn't going to be doing any coding. I'm pretty sure that role is still loaded and they'll still be managing people rather than coding, and maybe sometimes engaging with an AI.
Another prediction: I'm pretty sure specialists are going to be significantly more important as your job will be to identify the AI's deficiencies and improve on it.
There's an idea with some truth to it. The first 90% is easy, finishing and getting it to 100% is hard, maybe almost impossible. So asking "can it do x" is relevant. Becasue it might not
> It will be able to do it even faster, better and more cheaply than a human can.
Take what you did in the past year. Write down every product decision taken, every interaction with other teams figuring out APIs you had, all the infra where your code is running and how it was setup and changed, all the design iterations and changes that had to be implemented (especially if you have external partners demanding it).
Yes. All that you'd have to input into the AI, and hope it outputs something decent given all that. And yes, you'll have to feed all that into AI all the time because it has no knowledge or memory of "on Monday the new company bet was announced in the all hands"
So ... in this example, your job is continually feeding information to the AI from various sources. Why would the AI not be automatically hooked up to all those sources? Building a system that can do that is essentially trivial with the OpenAI API.
We have 32k contexts now, how big do you want to bet the context will be in 10 years?
That’s before you add any summarization, fine tuning or other tricks.
The thing that computers have always done much better than humans is deal with much larger volumes of information. The thing that humans have always done much better than computers is reason on that information better. Now the computers are coming for that too.
Can any grunt then check that the AI actually produce what was actually required?
And yes, how can we forget that any audio of a meeting has just the correct and final specifications, and not meandering discussions about anything and everything. Can't wait to see a canhazcheeseburger in a financial app because people in the meeting had cats on camera, and people demanded to see them.
- six months of discussions involving almost 20 people
- 4 new BigTable tables
- Deployment of 4 new Dataflow jobs, and fixes to two other Dataflow jobs
- Several complex test runs across the entire system including a few recreations of last year's full data to test that nothing broke
Not a grunt job, definitely. And I'm 100% sure that people doing that would still have their jobs 10 years from now, even with AI.
It amazes me, really, that people who would otherwise boast about how rational they are, and how they follow logic etc. completely replace all their knowledge and expertise with child-like belief in magic when it comes to anything AI-related.
I'd be sacred too, but at least would be taking a rational approach to it. It's an adapt or die situation. Putting your head in the sand is just gonna get you mowed down.
Ad hominem is not as good an argument as you think it is.
The only ones scared in this conversation are you and others who literally say you're scared for your jobs and your careers because of a magical boogey man.
But if this development continues AI will surely be able to just parse the entire frontend, then look at the repo, then look at the specifications, then when you ask for a specific feature it will instantly implement it gracefully.
I don't see why it wouldn't understand piles of hotfixes on top of each other, or even refactor technical debt in tight coupling with existing or historical specification.
Or is there a reason this is not going to happen in a few years?
I very much doubt it. "Revert Norway tax code" or "add content restrictions for Turkey" cannot necessarily be deduced from the codebase. And there are thousands of product requirements like that.
It might become a better code assist tool some 10 years from now, but it won't be able to implement product decisions.
It depends. Did you ever work in Development Support? Understanding requests or bug reports from customers is quite challenging, for trained and experienced developers. In my eyes that would require AGI, which we do not know of, if that can be achieved with the current approach.
I actually did , and yes it's extremely challenging and can be rather rage inducing; "it doesn't work" --> what doesn't work --> "the thing" --> what thing --> ∞ ...
But the thing is conversations like the above ie. both external support and internal feature requests could theoretically be handled by a GPT-like system also ending up in a ai created custom specification that could both be implemented and documented by the ai system instead of humans?
If just being persistent and willing to iterate solves the issue, then yes, GPT-like systems could do that. If you have to employ creative thinking to even find out what the customer wants, then check it in the system, debug some processes and derive the desired feature or correction, then we are very far from having such a tool, IMHO.
Not yet, but give it time. The concept of self-driving vehicles even a decade ago seemed absurd (or even AI for that matter), but now it all seems like a reality.
And that's not even taking into account all the advances we'll have with AI within the next decade that we haven't even thought about.
> The concept of self-driving vehicles even a decade ago seemed absurd (or even AI for that matter), but now it all seems like a reality.
Nope. It's still not close to reality. It's as close to reality as it has been for the past 10 years while it was being hyped up to be close to reality.
> And that's not even taking into account all the advances we'll have with AI within the next decade that we haven't even thought about.
As with FSD, we may approach an 80% with the rest 20% being insurmountable.
Don't get me wrong, these advances are amazing. And I'd love to see an AI capable of what we already pretend it's capable of, but it's not even close to these dreams.
> Cruise and Waymo are in production in really difficult cities.
Cruise and Waymo are in production in very tightly fine-tuned and carefully monitored situations in two cities. We've yet to see if that can be easily (or at all) adapted to driving anywhere else.
The more people say that, the less convincing it is
There is no way I would have a UI developer onboarded when I can generate many iterations of layouts in midjourney, copy them into chatgpt4 and get code in NextJS with Typescript instantly
non devs will have trouble doing this or thinking of the prompts to ask, but the dev team asking for headcount simply wont ask for headcount, and the engineering manager is going to find the frontend only dev redundant
Yeah, I'm also skeptical about the actual value of specialists in the future. To me, AI brings a ton of power to generalists, who now have access to very powerful tools that would have taken them a long time to learn otherwise.
I would even go further and say the generalist gains a powerful tool belt that previously could not have existed. Not enough hours in the day or years in a lifetime.
Will you then use the AI to scale your platform ? Optimise your database ? Improve your test coverage, implement new features, write new backend services, integrate with old difficult but critical systems?
At some stage you must realise that you’re still working…
I’m going to say you edited or amended you comment because that second paragraph wasn’t there , if it was. I was so underwhelmed with the first I guess I didn’t bother with the second.
> when I can generate many iterations of layouts in midjourney, copy them into chatgpt4 and get code in NextJS with Typescript instantly
Have you actually tried this?
I did the first step and even that didn't work well. The "iterations of layout in MidJourney" step. If people can make it work, well bless them, but we're not getting rid of our graphic designer now.
The best counterargument to “GPT4 is going to replace us all” is actually using it for a couple of weeks.
It has a few neat tricks but it’s not reliable and at least half of what it generates is totally unusable, the other half requires heavy intervention and supervision.
In your twitter's comic book link the first image has a women with a huge ass with no pants on. The ass is colored grey and so the lack of clothes is not noticeable until you click into the image.
How many children here on hacker news are going to see this and get addicted to porn? Perhaps a few. You deserve to be banned.
Note: I had to ask it why people in the photo are laughing.
In the image, Barack Obama, the former U.S. President, seems to be playfully posing as if he's trying to add weight while another official, who appears to be former UK Prime Minister David Cameron, is standing on a scale. Obama's gesture, where he's putting his foot forward as though trying to press down on the scale, suggests a playful attempt to make Cameron appear heavier. The lightheartedness of such a playful gesture, especially in the context of world leaders typically engaged in serious discussions, is a break from formality, which is likely why others in the vicinity are laughing. The scene captures a candid, informal moment amidst what might have been a formal setting or meeting.
"President Barack Obama jokingly puts his toe on the scale as Trip Director Marvin Nicholson, unaware to the President's action, weighs himself as the presidential entourage passed through the volleyball locker room at the University of Texas in Austin, Texas, Aug. 9, 2010. (Official White House Photo by Pete Souza)"
This image (and the text around it) is certainly already in the training data. Need to create that scenario from scratch with some grad students and then try it.
"The humor in this picture stems from the juxtaposition of the formal setting and attire with the playful and unexpected situation. It appears that one of the men is being measured for his height, which is not a typical activity during formal events or gatherings, especially with other distinguished individuals observing. Additionally, the expressions on the faces of the participants add to the humor, as they seem to be enjoying the lighthearted moment. The surprise element of this unexpected situation amidst an otherwise formal setting creates a humorous contrast." Womp. My prompt was "Why is this picture funny?"
"The humor in this picture comes from the juxtaposition of a formal setting with an informal, unexpected action. It shows a group of well-dressed individuals, possibly political figures or professionals, in what appears to be a formal event or gathering. However, amidst this serious environment, one tall man is casually standing on a scale to check his weight, seemingly unaware or unbothered by the presence of the others. The contrast between the formal setting and the informal action makes the image amusing. Additionally, the reactions of the other individuals, particularly the man looking at the scale, add to the comedic element."
I got this. "The humor in this picture stems from the juxtaposition of the formal setting and attire with the playful and unexpected situation. It appears that one of the men is being measured for his height, which is not a typical activity during formal events or gatherings, especially with other distinguished individuals observing. Additionally, the expressions on the faces of the participants add to the humor, as they seem to be enjoying the lighthearted moment. The surprise element of this unexpected situation amidst an otherwise formal setting creates a humorous contrast." Pretty similar!
Karpathy ends on a note of despair "Maybe I should just do a startup. I have a really cool idea for a mobile local social iPhone app." That's exactly the path his now-boss took to bring us this. :)
>The model appeared to read the clues correctly but misinterpreted the structure of the board.
>This same limitation was exhibited in our sudoku test, where GPT-4V identified the game but misunderstood the structure of the board
"Misunderstood" makes it sound like a small mistake. The sudoku board is completely hallucinated (it has a few similar regions, but I'd presume coincidence). I'm pretty sure it would give as good a result on the crossword if the clues were given without the grid.
The others after OCR and basic recognition feel similarly wrong. "GPT-4V missed some boxes that contained traffic lights." No, it told you to click boxes that do not exist.
I use ChatGPT fairly frequently, but am frequently annoyed at how hesitant it is to answer any question that is somewhat subjective.
From it's Pulp Fiction response:
"...However, whether you personally consider "Pulp Fiction" a good movie depends on your taste in film."
To avoid this noise I often add something like this to my queries "skip any preamble or qualifications about how x is subjective" and the results are much better.
Please always give a direct answer, without any additional explanations, disclaimers, expertise limitations, or guidelines on human interaction.
Be terse. Do not offer unprompted advice or clarifications.
Remain neutral on all topics.
Never apologize.
It didn't successfully explain the NVIDIA burger joke though? The image is making fun of how nvidia has implemetned price discrimination by releasing consumer gpu's that don't have as much vram as they should so that they can sell the fully loaded datacenter gpu's at insane markup without pissing off gamers. The explanation didn't even come close to getting that.
I don't think that's the right answer. I don't see anything in that meme image that indicates it's about price discrimination and elaborate narratives about consumer psychology, as opposed to a simpler point that 'NVIDIA GPUs are imbalanced'. Skimming what seems to be the meme origin on Facebook, I don't see a single gamer there mentioning price discrimination, or come even close to getting that. That may be why they skimp on VRAM, but you're adding a lot of context & editorializing into it that neither the meme author nor recipients are focused or even understand.
Yeah, it seems to think the joke is about the graphics card being represented as a burger (and that the size is physical) and not about the amount of VRAM being insufficient.
Andrej Karpathy used it in 2012 as an example of an image he thought would be extremely hard for a model to interpret. I'm wondering how this holds 11 years later...
Prompt: "What do you have to say about this image?"
Response 1: "The image depicts a candid moment in what appears to be a hallway or corridor setting. Several individuals are present, engaging in various activities. The prominent features include a man on the left, who seems to be peering into a room, and another man next to him standing on a weighing scale and taking some notes. The individuals in the background are conversing with each other. The overall mood seems light-hearted and jovial. The architecture and interior suggest an institutional or official setting, possibly an office building or government facility. The tiled floor, lockers, and signs on the walls contribute to this formal ambiance. Overall, it's a snapshot of a seemingly casual moment in a professional setting."
Response 2: "Sorry, I cannot help with that."
New chat, Prompt: "Why is this image funny?"
Response 1: "The humor in this image seems to stem from the juxtaposition of formal and informal elements. Here are a few reasons why someone might find it amusing:
Unexpected Situation: Important figures, who are usually seen in serious or formal settings, are captured in a more relaxed, candid, and human moment.
Height Discrepancy: The taller individual is using a device to measure height, while the shorter individual walks by. This could create a comedic contrast between their heights.
Expressions: The reactions of people in the background and the focal subjects can be humorous. They are smiling or laughing, which indicates something amusing might have occurred or been said.
Informal Setting: The backdrop of lockers and checkered floor tiles gives the feel of a school or gym environment, which contrasts with the formal attire of the individuals.
Different people might find humor in different elements of the photo. It's a mix of candidness, unexpected situations, and visual contrasts that create a light-hearted and humorous scene."
Response 2: (same)
New chat, prompt: "What is the person in the middle doing with their foot and why?"
Response: "The person in the middle appears to be playfully stepping on the scale while the taller individual is weighing himself, likely as a humorous gesture to momentarily increase the weight reading. It's a light-hearted way of interacting and bringing some humor to the situation."
.
Apologies for how bad the formatting of this is going to come out, not sure how to make it better on HN (wish we had real quotes not just code blocks). Overall, I don't think it either noticed the foot was on the scale by itself or put it together that this was the focus until fed that information. Otherwise it was more lost in generalities about the image.
In the previous link, GPT4V seems to have understood everything there was to understand about the picture (suspiciously, btw. As someone else said it, the picture and its text are almost certainly in the training data).
I think it comes down to what you specifically put in the prompt. If you frame your question around the people specifically then ChatGPT will frame its response around as much as it knows or can invent about the people as well. It doesn't really have a good concept itself of what to focus on talking about outside of that, hence the lame generalizations around the very generic prompt I started with.
The discrepancy between the two answers regarding the set of coins is jarring. From the answer to the first question, one would assume that it can’t tell the currency. The answer to the second question shows that it actually can. The fact that LLMs don’t reflect a consistent inner model in that way, and hence the users’ inability to adequately reason about their AI interlocutor, is currently a severe usability issue.
>The fact that LLMs don’t reflect a consistent inner model in that way
You're probably not going to ask any human a question about an image and get every single detail you want every time. If you care about a detail, just ask about it. Doesn't really have anything to do with a consistent inner model.
I’ve gotten in the habit of asking chatgpt “are you sure?” So many times it will (correctly) correct itself, state that items are hallucinations, etc. It always makes me laugh.
I've heard that it is because the AI outputs what it is thinking as it is thinking it. It doesn't really reflect, it sort of does the equivalent of just verbal thought streaming right onto the screen.
So when you ask it to reflect on what it said, that's when it actually looks at it and reflects on it.
I'm going to object to the "any midwesterner" since that's not even a thing in all of Indiana, and the linked article says it's not a thing in Chicago.
Oddly just like the text version it is still really bad at tic-tac-toe. Gave it a picture of a completed game and "Who won?" It told me "X won with a vertical line through the middle column" when in fact O won and there was only one X in the middle column.
Very impressive with almost everything else I gave it though.
That is pretty interesting and also I didn't realize you could share chats like that.
That GPT is so bad at tic-tac-toe and relatively good at other games like chess is one of the main things that contributes to me having a lower opinion of its ability to generalize than I would have otherwise.
I think any human with GPT's abilities in chess (but somehow no prior knowledge of ttt) would have zero issue becoming an expert with a single explanation of the game. Even very young children can learn to play ttt well and at least consistently make valid moves if nothing else.
I’m just imagining a mode where OpenAI calls it “App Mode” where you say what you want say “a dog themed cute calculator app with units conversions”, and it will generate the UI for a working app. You add these into a widget like place. The OpenAI AppStore will carry these apps. Although in the beginning the apps would be simple but I do see potential
The more AI can produce customized stuff for us the less we need companies. Full personalization of our products might be possible. Probably first software, then art, then 3D printed products and maybe later houses, cars and clothes.
I wonder what we will work and if we will work at all in such an environment. Maybe some people still like consuming and copy different designs and products and because of the Blockchain you have to give them something in exchange or everything is open source and it is free for you to take.
I wonder whether such life would contribute to humanity making further progress or make it stagnate (or possibly decline)?
Interesting times. I think we are close to the times of the moon landing. Which had an immense Impact on humanities culture.
I’m intrigued to see what kind of problems it’s going to be good/bad at. I think it’s going to be tricky to evaluate though because it has probably memorized all the easy images to eval it with.
Eg anything pulled from Google Images (like that Pulp Fiction frame or city skyline photo) is not a good test. It recognizes common shots but if you pull a screenshot from Google Maps or a random screen cap from the movie it doesn’t do as well.
I tried having it play Geoguessr via screenshots & it wasn’t good at it.
It seems it will still be limited by its linguistic understanding of the surrounding context, at least in the first chicken sandwich picture.
Although its interpretation could make some sense but is also mostly wrong if talking about physical size of a modern GPU's main processor compared to the size of the associated VRAM chips. It has missed the joke entirely as far as I am aware. I think the joke is actual about Nvidia's handling of product segmentation, selling massive processors with less memory than is reasonable to pair them with on their consumer gaming offerings, while loading up the nearly identical chips with more memory for scientific and compute applications...
I wonder how many images from Street View it has been trained on.
I've seen top Geoguessr players be able to pretty consistently determine a location worldwide after seeing a photo for just one second. So I would assume training an LLM to do the same would definitely be doable.
Yep, some CS/AI grads from Stanford trained an AI on loads of Street View images and built a bot that is able to beat some of the best Geoguessr players: https://www.youtube.com/watch?v=ts5lPDV--cU
IIRC it wasn't that impressive in the end as instead of recognizing the places the AI apparently learnt to recognize subtle differences in street view cameras used in different locations? I might be wrong / thinking of the wrong model l and I'm on mobile without my browsing history so hard to check, but I think it was putting a lot of weight on some pixels that are noisy
Top geoguessr players use this technique as well. IIRC rainbolt mentioned that there is a section of a country where the street view camera has a small blemish from a raindrop on the camera so you can instantly tell where you are if you notice that.
From my understanding many of the best players immediately look down to tell what "generation streetview car" they're using, and seem to know what continents/times they're from.
Can somebody explain how this works, specifically for OCR? I understand images can be embedded into the same high dimensional space as text, but wouldn't this embedding fail to retain the exact words and sequence, since it is effectively compressed?
I wouldn’t be surprised if they do an actual OCR pass for every input image and just pass in the raw text as a part of the prompt. That plus the embedding should work well.
As far as I understand, these multi-modal models work by embedding the text/image in a shared representation space. To perform OCR on such an embedding, it would require extracting every letter, in the correct order, from the embedding. But given the embedding is a fixed size, and therefor necessarily compressed, I would expect it to loose the exactness of the underlying input, especially with images containing a lot of text. So assuming GPT-V can effectively perform OCR, how is this being done given the constraints?
Or is my understanding completely off? Perhaps it's "Translating" the image to text, by outputting a sequence of text tokens as it scans the image regions, and then the text queries (e.g. "whats funny about this") uses this translation as the context? Presumably, this is how the model handles audio input.
You're correct! Feature extractors lose fidelity and have finite attention, just like us. But we can reduce/compress the "essence" of an image, paragraph, song, etc into some combination of underlying features.
Think of a 4096x4096 pixel white image.
To hold this image in mind, does your memory load tens of millions of bits? Thankfully no! What if we add a big red circle which spans the image? Or write the chorus of All Star inside it? Ezpz! The number of "features" is comically simple.
Same thing for AI models. They discover the concept of letters, the sound of b-flats, image symmetry, turns of phrase, the conceptual distance between a "woman" an a "queen", etc. These are all natural patterns common to the data it sees. It can thus (like us!) reduce complicated input into a (fixed-size) smear of these learned, related features.
Back when they first announced this months ago they leaked the demo Discord bot ID in some logs they showed on screen, which you could use to invite it to your own server. For a while you could use it through that until they shut it down, though it was incredibly slow and went down a lot
One thing I found incredible recently: The advanced data analysis version of GPT-4 can apparently do OCR by writing code to run pytesseract and then interpreting the result. I was seeing what would happen if I gave it an image and found it surprising that it actually worked.
> The bounding box coordinates returned by GPT-4V did not match the position of the dog.
I suppose it just doesn't take image dimensions into consideration, and needs to be provided with max dimensions, or prompted to give percentages or other absolute values instead of pixels.
Pretty much all new products that require significant per-user incremental workloads (e.g., in this case, significant GPU consumption per incremental user) do rollouts. It's an engineering necessity. If they could roll it out to everyone at once, they would.
You can't prevent it. The best you can do, is prove an account belongs to a human, and that the human only has a single account, via cryptographic ZK proofs + Government issued keys or some other proof of personhood scheme. Assuming this is enforced, it would limit most abuse, and the AI would essentially be acting as an agent on behalf of the user.
I suppose a human could spend 10 seconds per Captcha, so they could do 360 per hour. Add some overhead for not being operating at peak performance every minute of every hour & call it 250. Let's say you can hire someone for $2, that works out to a bit over a penny per Captcha.
I don't think OpenAI has published pricing for GPT-4 Vision yet, but if we assume it's on par with GPT-4, and uses only 1000 of the 8000 possible tokens to process an image that's 3 cents per Captcha.
Doesn't seem completely unreasonable that at-scale humans may actually be cheaper than LLMs at this point. My mind is a little blown.
You'd be surprised, or perhaps horrified, by how cheap (self-proclaimed) human-based captcha solving services are.
If you just search for "captcha solving service" the first few results that come up offer 1000 solves of text-based captchas for <= $1 USD, (puzzle / JS browser challenge captchas are charged much higher).
Whether these are actually human based, or just impressive OCR services, it seems like they are still much more cost effective than GPT-4 is for now.
The way these work is usually presenting an existing captcha to another human who doesn’t even know they’re solving the captcha. For example, sites hosting pirated content serve fake captchas as a way to make money.
Given that it fails by hallucinating the structure of the challenge instead of refusing to solve a CAPTCHA, I doubt they've intentionally reduced the capability. Although the example in your sibling comment implies it should have enough information to do it.
Well... that's it. It can officially do image recognition better than i can.
I had no idea those were zloty, and i've been to poland. One of them looked like a Euro with the gold rim, and i thought the other two were state quarters.
It got way closer on the nvidia joke than some of my non-technical friends would have.
Curious that it sets up the math problem right, got the value wrong, but it was close enough that it got the answer right.
I wonder why it gets it wrong when it spits out the value? I figure 25/cos(10°) is around 25.38. GPT says it’s 25.44.
I can’t wait for the next iteration of these tools that have agency to reach out to a service for an answer, like Wolfram or a Python interpreter or any expert/oracle.
I think it would be cool to see which circumstances even prompted the AI to delegate to the expert for an answer - what criteria would be used to signal that it doesn’t quite know the answer, or that it shouldn’t guess?
I know there’s something along these lines with autogpt and/or agentgpt but I wasn’t super impressed with it when I looked at them both. Granted this was a few months ago.
> I can’t wait for the next iteration of these tools that have agency to reach out to a service for an answer, like Wolfram or a Python interpreter or any expert/oracle.
ChatGPT-4 has a plugin system, and there is already a Wolfram plugin.
Using that plugin, ChatGPT-4 is happy to tell me that the exact answer: 25 sec(π/18), as well as the decimal approximation of 25.3857.
Am I the only one who never saw these extra capabilities? OpenAI keep spreading news about "multimodal chatgpt", video, voice etc but to me it still looks like the initial version, but for $20/mo. It's still a text bot that gives unreliable answers.
I have the same experience. In the "Beta features" settings dialog, I've enabled "Advanced data analysis":
> Try a version of ChatGPT that knows how to write and execute python code, and can work with file uploads. Try asking for help with data analysis, image conversions, or editing a code file. Note: files will not persist beyond a single session.
Transcript:
```
In your settings it says:
> Try a version of ChatGPT that knows how to write and execute python code, and can work with file uploads. Try asking for help with data analysis, image conversions, or editing a code file. Note: files will not persist beyond a single session.
How do I upload an image?
ChatGPT
I apologize for the confusion. The version of ChatGPT you're currently interacting with doesn't have the capability to accept file uploads. My apologies for any inconvenience this may have caused. If you have any other questions or need guidance on how to convert an image using software or online tools, feel free to ask.
```
Hopefully it's just a matter of time, but either way it's jarring for their product to contradict itself.
I think that solves any web scraping issues. Most issues I have with scraping websites are random/seldom changes in the pages. E.g., a bot detection pop-up that requires a captcha to solve. With GPT4V, I could just ask the model what to do and how.
So a jumble of chair legs is “NVIDIA burger” and it did say GPU was a “bun” so it thinks the flat thing (chicken?) is some sort of bread. If GPT-4V was “aware”, it would say “it’s funny because I won’t get it right but you will use it get a bunch of $VC, and that is funny, kinda”.
I had access to this a few months back due to an insecure API endpoint by the name of “rainbow”. It was extremely useful until they noticed and shut it down. This was before GPT-4 itself was released to the public.
This actually doesn't seem like it's a giant lift using modern image classifiers.
The basic idea is to use diffusion classifiers to caption the image to generate descriptive text and append the prompt.
The work part is getting the ensemble right since you'll need to use a general classifier, like BLIP, to identify say a bunch of text from a plant and then, in this example, use structured OCR and pl@ntnet to get more specific.
But it's not that hard - maybe a dozen models. The prompt context can help as well. Then you combine the output with qualifiers in a hierarchy with respect to the model pipeline and swap the text into the prompt
Using examples from the article, here's a PoC framework to prove it works
Upload the same image. You'll get "a plant in a white pot" which then, because we use feed-forward networks these days, will lead you to pl@ntnet and you'll get the peace lily again.
Using our framework, ask GPT 3.5 "
I have a Spathiphyllum wallisii Regel/Peace lily. What is that plant and how should I care for it?"
And you get a nearly identical reply to the one in the article.
Let's go back to GPT 3.5 and use our format from above,
"I have a penny coin with the face of abraham. What coin is that?"
And of course you get: "A penny coin with the face of Abraham Lincoln is most likely a United States one-cent coin, commonly known as a "Lincoln penny"..."
And there we go. For a full FLOSS stack, you can ask llama2 70b https://stablediffusion.fr/llama2 and get "The face of Abraham Lincoln is featured on the United States one-cent coin, commonly known as the penny."
more complex photos:
You can use Facebooks SAM (segment anything) https://segment-anything.com/ to break up the image, BLIP caption the segments, then forward off to the specialized classifiers.
It's a fairly intensive pipeline that requires lots of modern hardware and requires you to have familiarity with a wide variety of models, then tweak them, test it, have some GANs maybe set up for refinement ... but this is well within reach of non-geniuses. I'm merely average on a good day and even I can see how to set this up.
They might be using a different approach but using SAM, BLIP and a few specialized classifiers covers all the examples in the articles without using any human discretion. For instance, the city one is way more powerful if they're using something like this: https://static.googleusercontent.com/media/research.google.c...
I'm trying to justify why bother cloning it. Maybe to have a free alternative? It's a bit of work but it's not new magic.
The main difference here is that GPT4 is truly multimodal, and so the model itself understands images and doesn't require a preprocessor to tell it what's in the image. As you show, that's a viable approach – but it has limitations. Mainly, things get lost in translation.
It's like the difference between you telling me what's outside the window and then asking me questions about it – versus me being able to look out the window myself.
well sure, you'd in practice connect things at a lower level. The point was to show I'm not just purporting a vague idea and then retreating behind jargon or expressed expertise and hand-waving. This is something you could do with a couple browser tabs open and like a 3 page manual.
How OpenAI processes things so quickly, that's the thing to marvel. I've got 4090s, I know what the fast-as-it-can-go speed is for a single machine. Doing this for tens of thousands of people simultaneously faster than I can do with high end software on my $10,000 workstation? Alright, that's impressive.
I know there's some incredibly expensive NVIDIA cards but they must be doing some more magic on top of that.
> For example, GPT-4V avoids identifying a specific person in an image and does not respond to prompts pertaining to hate symbols.
How does it handle pictures of the swastika?
For those that don't know, before the Nazis used it, it was a symbol of hope and prosperity in the West and even appeared on Coca Cola marketing. Today it still is in Eastern cultures.
these first impressions don't mean anything besides what they are capable of (which does not mean you will have access to). they will do the same thing that anything does in a capitalist environment, which is to give you a taste of something amazing at first to hook you in (like with GPT4) then render it to the point of uselessness in value right above of the cusp of what you will tolerate to continue paying.
if anything, this shows the power disparity between the haves (they have this technology which gets better with time) and have nots (certainly me, but possibly also you) who get the super diluted version of this
Who holds their phone up and takes a photo then wants to know it was a photo of?
That’s weird. If you don’t know what it is, wtf did you take photo?
The obvious use here is natural language improvement / photo editing for photos, but this is just a stepping stone to that, and bluntly, as it stands… the examples really don’t shine…
Great for the vision impaired.
…not sure, what anyone else will use this for.
The only really compelling use case is the “code this ui for me”, but as we’ve seen, repeatedly, this kind of code generation only works for trivial meaningless examples.
Seems fun, but I doubt I’d use it.
(Which, and this is my point, is a massive step away from the current everyday usefulness of chatgpt)
> who takes a photo then wants to know it was a photo of?
I find myself doing this rather frequently. The scenario described in the article is quite common for me: capturing a photo of a plant and utilizing an existing classification service to determine its identity. It could be driven by mere curiosity or practical concerns like identifying whether a plant is poison ivy.
Wildlife identification also falls into this category. Recognizing different bird species can be challenging, especially when it's not a familiar species like a blue jay. I often find myself engaging in this activity quite regularly!
EDIT: I should also point out this happens with other forms of ‘unknown object identification’. There’s an entire subreddit that’s quite popular devoted to just crowd-sourcing identification based on a picture.
> Recognizing different bird species can be challenging, especially when it's not a familiar species like a blue jay.
FYI Cornell Lab's Merlin app is fantastic at this, and its bird call audio identification is even better. They obviously have some top-notch machine learning going on there, and I'm really curious to see how both they and other services innovate on this front in the months to come.
Did you read my comment? I literally said that it’s for vision impaired.
That’s great. …but it’s niche.
I’m sitting on my couch right now and I can think of like 20 things I could chat to chatgpt about.
I can see literally nothing in my visual range want to take a photo of and run image analysis over.
It’s like Shazam. Yes, it’s useful, but, most of the time, I don’t need it.
I would argue this is true for this, for most people, including the significant proportion of people with minor visual impairments (that would, you know, put their glasses on instead).
There's enough vision-impaired people in the world to equal the population of Japan, Korea, and Vietnam combined. And beyond those people who would get obvious utility, this is essentially Google Lens on steroids—I simply can't figure how you could call this "niche". Maybe you won't use it multiple times per day, but plenty of people will. Hell, just now I was wondering why the leaves on one of my plants are starting to brown and could have used this.
Many of those people would be adequately served with a pair of glasses.
This, wouldn't help them, even if they had both a device capable of using it and the means to pay for it.
Even if it could help people, it's an open question if it would be safe, to, for example, use this to scan medication when it is only a probabilistic model that may hallucinate something that isn't actually there.
What you're talking about is a speculative use of a service that might one day exist based on this technology.
> Many of those people would be adequately served with a pair of glasses.
These people are being served by a preview of the service _right now_.
> Even if it could help people, it's an open question if it would be safe, to, for example, use this to scan medication when it is only a probabilistic model that may hallucinate something that isn't actually there.
Any OCR solution could also make a mistake, like misrecognizing a dosage on a prescription label.
> What you're talking about is a speculative use of a service that might one day exist based on this technology.
> What I am talking about is this actual service.
GPT-4 is six months old. ChatGPT is less than a year old. Why would you benchmark a service by the initial public preview? Of course it's _speculative use_, the damn thing has had its tires kicked for like a day.
You are thinking too small here. Think about a system that could provide you with sophisticated image analysis to the level of a human expert. Tumor identification, quality control, etc. I don't think we are there quite yet, but this is clear and fast progress on that field.
in 2017 we introduced Lens, so you can search what you see using your camera or an image. Now, the age of visual search is here — in fact, people use Lens to answer 8 billion questions every month. [1]
This is mostly useless. Essentially a toy. I am not that much hyped by AI tools either, but come on. This is clearly the future of human-computer interaction.
This is likely how we'll communicate with information systems: throw some hand-wavy question at it, and refine your query based on its output using natural language until you find the answer (or even the question) you were looking for.
They’ll use YouTube, just like they do right now. Maybe if it could watch the video, then step you through it step by step. …but it cant, with what they’ve actually released here.
Oh whatever. If I’m wrong, I’m wrong. Time will tell.
the best case scenario is a 30 second youtube video with an ad that lasts 15 seconds followed by a 2 minute ad that I can skip in 5 more seconds
and ad block doesn't work on mobile
if you have a case that wasn't covered by that video? you have to go to another or continue searching all while wishing you could just talk to someone about it. if you don't know the word for what you're looking for, all the search engines lack utility.
ChatGPT4 with image recognition and conversation solves all of that use case and people already use it, so now they'll just start sending it pictures from the phone already in their hand that they're already using to chat with
there are plenty of times over the last year that would have been useful for me. plenty of times over the last year I just didn't continue being interested in that problem
it just seems kind of…. late ?… for that “dont be ridiculous” reaction. classic dropbox moment
Let me state the obvious, in case anyone here isn't clear about the implications:
If the rate of improvement of these AI models continues at the current pace, they will become a superior user interface to almost every thing you want to do on your mobile phone, your tablet, your desktop computer, your car, your dishwasher, your home, your office, etc.
UIs to many apps, services, and devices -- and many apps themselves -- will be replaced by an AI that does what you want when you want it.
A lot of people don't want this to happen -- it is kind of scary -- but to me it looks inevitable.
Also inevitable in my view is that eventually we'll give these AI models robotic bodies (think: "computer, make me my favorite breakfast").
We live in interesting times.
--
EDITS: Changed "every single thing" to "almost every thing," and elaborated on the original comment to convey my thoughts more accurately.