How GPT-3, your smartphone and Augmented Reality can disrupt a dinosaur industry.
We all love a good picture. The history of photographic studios and photography dates back to 19th century with the first camera. The earliest photographic studios made use of painters’ lighting techniques to create portraits. In my country, generations of Indians would assemble under the studio lights to get that perfect family portrait. We have come a staggering distance since then.
Today, these photo studios that were responsible for many families and their portraits, have all but disappeared. Aspiring models, commerce catalogues and even the largest of families that would step in for passport photographs to head west have all but dried up entirely. Ironically, we click more photos than ever and share these photographs more often than in any previous moment of history.
The disruption of the industry is hardly a surprise given the changes in technology over the last decade. There are two distinct phases to this shift.
Phase 1: The best camera is the one available in your pocket
When the iPhone launched with a camera, and every other manufacturer followed; these small sensors were useful but limited in their ability to produce quality images. Applications such as Instagram in it’s early days compensated for the lack of image quality with filters, that made the app wildly popular. However, the velocity of improvement in the cameras in a smartphone since Instagram’s early days has been tremendous. What made the app popular in the early days is no longer a much used feature, as photos taken on your smartphone have become exponentially higher quality.
Most photo studios that opened to cater to customers in a pre-digital India are on borrowed time. Globally, these studios now dwindle in numbers. A photo studio in the age of selfies is fated to be a business where the act of looking and the act of clicking is geared towards a single outcome — how to get a photograph out for the customer with the click of a button and the speed of a file download on a computer. But what led to a dramatic improvement in the quality of mobile photos?
Phase 2: Computational Photography
In 2015, Google realized how behind it was in the photography space, and decided to ramp efforts up with an engineering mindset. Marc Levoy, a renowned computer graphics researcher took over the computational photography team at Google Research and quipped:
“The notion of a software-defined camera or computational photography camera is a very promising direction and I think we’re just beginning to scratch the surface. I think the excitement is actually just starting in this area, as we move away from single-shot hardware-dominated photography to this new area of software-defined computational photography.”
The most impressive recent advancements in photography have taken place at the software and silicon level rather than the sensor or lens — and that’s largely thanks to AI giving cameras a better understanding of what they’re looking at.
It’s not uncommon these days for phones to take better photos in some situations than a lot of dedicated camera gear, at least before post-processing. That’s because traditional cameras can’t compete on another category of hardware that’s just as profound for photography: the systems-on-chip that contain a CPU, an image signal processor, and, increasingly, a neural processing unit (NPU).
This is the hardware leveraged in what’s come to be known as computational photography, a broad term that covers everything from the fake depth-of-field effects in phones’ portrait modes to the algorithms that help drive the amazing AR effects & filters you have come to demand from your smartphone.
Computational photography is the use of computer processing capabilities in cameras to produce an enhanced image beyond what the lens and sensor pics up in a single shot. Computers in photography are not a new thing by any stretch of the imagination. Every camera of the digital age has required processing power to create the image. Even before the dawn of digital, processors were used in film cameras. They controlled things such as auto exposure modes, autofocus and flash output. The potential for computational photography has been known about for some time. However, the advancements in deep learning over the recent years has unlocked an entirely new breed of smartphone capable photographs.
HDR, Bokeh and Stabilization
These are the three staples of computational photography as of today. They have been recently joined by high key black and white and night modes. The latter demonstrate how the power of processors is becoming more and more important in photography.
But how do they work?
Photographers are historically used to one approach. Press the shutter, take one shot then press the shutter again. Even the very fastest continuous modes work in a similar way. They simply continue to take single shots until the photographer releases the shutter button.
In computational photography, when you press the shutter the camera will take multiple images virtually simultaneously. It will then process those images in real time into a single shot. HDR is the simplest form of this and has been around for a while. The camera takes a 5–6 shot bracket and merges them immediately.
Step up to Bokeh however and we can see how powerful modern smartphone are. Bokeh in physics based photography requires large sensors and wide aperture, fast lenses of at least a moderate focal length. Clearly something that’s impossible in a phone.
To counter this the smartphone takes multiple images each concentrating on a specific technical detail. For example it might take shots to control exposure, focus, tone, highlights, shadows and face recognition. It will then merge them, analyzing all the data within each shot and attempt to mask the subject from the background. It will then add a blur to that background to simulate Bokeh. All of this is done virtually in real time.
Night modes and high key filters use similar processor intensive techniques. And these are really only just the start. This is applicable in the field of videos as well. Just look back at video capabilities over the last few years. A while ago, the standard video format for stills cameras was 1080p at 24fps. Now most new camera’s shoot 4k at 60fps, and will very soon breach the 120 and 240fps mark. That’s a quantum leap in processing power and in just a few short years.
This quantum leap is now poised to be prevalent in another industry that might go extinct like the way portrait photo studios went
Photography Studios vary greatly from one to the next. Some are quite small and operated by a single person or a handful of people. Others are quite large and have hundreds of employees. Some studios will handle all deliveries, shipping and marketing in house, while others will outsource those requirements. All studios need these resources to some degree, but how they have access to them often varies. Typically, a photo studio would have:
- The Photo Studio Staff: The Creatives who bring products to life
- A Studio Proper: The primary artistic workspace
- Makeup & Wardrobe
- Dark Rooms
- Prop Rooms
- Graphic Design Space
- Display Rooms
- Logistics arm: For Shipping and receiving the products
Buying things online needs great images.
Why? images help build confidence and help convert more customers and are the primary source of building confidence with the buyer. Given how critical images are to selling online, businesses leave no stone unturned and spend heavily in the product photography process. However this process can tend to get quite exhausting:
Product photography has not changed for multiple decades. This means that there are considerable bottlenecks with high costs, limited scale and brittle workflows. For example, if your business or a manufacturer decides to change a detail on the product or updates a new color, the whole process has to be repeated.
Using 3D software, brands can now generate compelling visuals through rendering them, rather than physical photoshoots. While this solves many of the traditional bottlenecks with product photoshoots, 3D rendering involves meticulous modelling, setting up virtual scenes and generating images, better known as ‘lifestyle shots’
GPT-3 (and iGPT)
OpenAI, an AI research foundation started by Elon Musk, Sam Altman, Greg Brockman, and a few other leaders in ML, recently released an API and website that allows people to access a new language model called GPT-3. GPT-3 is a truly groundbreaking technology in a few areas.
GPT-3 is essentially a context-based generative AI. What this means is that when the AI is given some sort of context, it then tries to fill in the rest. If you give it the first half of a script, for example, it will continue the script. Give it the first half of an essay, it will generate the rest of the essay. — Delian Asparouhov
Today, GPT-3 is a Machine Learning model that generates text. You give it a bit of text related to what you’re trying to generate, and it does the rest.
Machine Learning models let you make predictions based on past data, and generation (creating text) is a special case of predicting things. The GPT-3 model is trained via few shot learning, an experimental method that seems to be showing promising results in language models. GPT-3 has picked up a lot of buzz for how good it is — it can generate entire published articles, poetry and creative writing, and even code.
The excitement around GPT-3 has primarily been around text or written content. On taking the system of Few Shot Learning to images, Open AI is exploring what would happen if the same algorithm were instead fed part of an image.
Researchers at OpenAI decided to swap the words for pixels and train the same algorithm on images in ImageNet, the most popular image bank for deep learning. Because the algorithm was designed to work with one-dimensional data (i.e., strings of text), they unfurled the images into a single sequence of pixels. They found that the new model, named iGPT, was still able to grasp the two-dimensional structures of the visual world. Given the sequence of pixels for the first half of an image, it could predict the second half in ways that a human would deem sensible.
The results are startlingly impressive and demonstrate a new path for using unsupervised learning, which trains on unlabeled data, in the development of computer vision systems
History is repeating itself — The problem is that most people don’t want to let go of the way things are until it’s too late. This fits into classic disruption theory and GPT-3 looks to upend many spaces such as web development, user-aided design and now, Product Photoshoots.
Smartphone integrated LiDAR
A big part of the process to enable AI based product photography is to generate 3D files of the object in consideration. This problem can be solved bottoms up with better standardization of 3D from manufacturers and more interestingly, with LiDAR sensors that are beginning to be built in to devices that you’d carry with you. Apple’s newest iPad Pro already has them your next phone might too.
Whilst data from the LiDAR sensor alone is not precise enough to generate a high fidelity 3D model, the field is improving rapidly and so are deep learning models.
Initially, the AI was capable of developing depth data information from photographs. Since then, State-of-the-art machine learning algorithms can extract two-dimensional objects from photographs and render them faithfully in three dimensions. It’s a technique that’s applicable to augmented reality apps and robotics as well as navigation, which is why it’s an acute area of research for Facebook.
“[Our] research builds on recent advances in using deep learning to predict and localize objects in an image, as well as new tools and architectures for 3D shape understanding, like voxels, point clouds, and meshes. Three-dimensional understanding will play a central role in advancing the ability of AI systems to more closely understand, interpret, and operate in the real world.”
This makes it possible to begin generating good quality 3D objects from a smartphone or tablet in the near future. Once generated, the number of product visualization possibilities really open up such as Photorealistic Rendering.
Digital design is emerging as a crucial lever for the industry. It allows brands to design items quickly and remotely; once created, 3D assets — which are three-dimensional, photorealistic digital models of products — can be used in myriad situations, from creating marketing materials and virtual showrooms to customer-facing e-commerce pages and augmented reality experiences. A digital supply chain is also seen as a way to decrease waste while increasing production speed, offering a win-win for companies working to become more sustainable while cutting costs.
With advancements such as iGPT and a 3D model in place, AI can take over to help generate stunning images of products. This is faster, cheaper and more flexible than physical photoshoots.
The 3D model can now be placed in any number of virtual backgrounds to generate a render that looks compelling. What’s the advantage?
- Personalization: The renders you see for a product might be completely different from the renders I see for the same product
- Cloud-like Scale: You can render hundreds if not thousands of products and images at the same time, and not wait to clear the physical photoshoot space each time
- Speed: It’s near instantaneous to go from product to 3D model to render
- Flexibility: Any changes to your product at a manufacturing level can reflect in the 3D info and the product renders within minutes
- Costs: Significantly cheaper than traditional photoshoots
- Automation: The possibility to AI enable and automate the entire rendering pipeline
- Creative control: Tweak exactly how your brand’s lifestyle images look, without the creative middleman
Brands can begin to tweak images real-time depending on the website visitors. Images can be target-segmented tested for performance and the best visuals can be doubled down for the rest of the brand catalog without any increase in costs.
Building an AI to render Product Shots
At Scapic, we’ve been working on experiments to combine all these elements together. With captured and modeled 3D assets, we tried to build an AI assisted workflow for stylized lifestyle images of products.
Generative code snippets seem to already be useful in creating declarative 3D scenes using ThreeJS and WebGL. We can extend the idea to help declare a set of described elements, their parameters and getting a render immediately for the information provided:
For now, the process is still human intensive and limited to presets and not completely generative scenes. However, after a couple of attempts some of the results we saw looked promising:
There aren’t real photoshoots, but all rendered through Scapic’s AI rendering. It still requires hands on deck, and people to assist the process. However, with more work, the day is not far when the entire process of digitizing to 3D, generating photorealistic lifestyle shots and enabling Augmented Reality can be achieved right from your smartphone.
It’s still early but the space of computational product photography is evolving fast, and a whole category of immersive experiences can be achieved through the same.
It does not stop with just products, but people too. What if AI can begin to generate all models in the catalog as well?
AI Generated Models
The third massive change in the industry that is poised to make a massive difference is the rise of AI generated models for fashion photography.
A typical photoshoot process involves individual costs for models, photographers, stylists, hair and makeup artists, transportation, photo studio rental and photo equipment, digital tech and post production. Reshoots, that happen around 5% of the time, implies a repeat of all these costs.
Another cost is just the amount of time lost — photoshoots are slow. Completing the entire process and uploading the images on to the site can take weeks, if not months. This means that the retailer is losing out on selling time. The gap between procuring products and actually putting them up on the site is significant — and costs retailers potential sales during that period.
The need to reduce photoshoot costs is real. And like for many high-cost activities and sustainability woes, the product imagery creation process can be optimized with technology.
AI-powered Intelligent Retail Automation has multiple solutions optimizing processes, workflows, and experiences across the retail supply chain. Automated On-Model Fashion Imagery is the answer to improve efficiency and reduce photoshoot costs for the product imagery creation process.
Digital models and influencers are successfully breaking into the fashion industry from every angle. Some have even been signed to traditional modeling agencies. Take Lil Miquela, a 19-year-old Brazilian American model, influencer, and now musician, who has amassed a loyal following of more than 2 million people on Instagram.
Today, Lil Miquela is a computer-generated image (CGI), not artificial intelligence (A.I.). That means that Miquela or similar charecters can’t actually do anything on their own. They can’t think or learn or offer posing variations independently. But that won’t be the case for much longer.
the iGPT method presents a concerning new way to create deepfake images. Generative adversarial networks, the most common category of algorithms used to create deepfakes in the past, must be trained on highly curated data. If you want to get a GAN to generate a face, for example, its training data should only include faces. iGPT, by contrast, simply learns enough of the structure of the visual world across millions and billions of examples to spit out images that could feasibly exist within it.
What does all of this mean for existing product photo studios & human models? It’s safe to say that the space will have to prepare for a changing workforce just like a number of other industries. Models will have to exercise skills such as adaptability and creative intelligence to ensure that we too can sustain the shift to digital.
Ultimately, GPT-3 is still a language predictor. It doesn’t “think”, and it doesn’t have a “mind” of its own. It only generates content based on the input it receives. So while it cannot answer very tough question sequences, GPT-3 could remove the need for mundane tasks such as generating variations of a same design or building simple product images based on common 3D rendering principles. The product photography industry is built on repetitive, time consuming and technically complex steps that can be made dramatically faster, while leaving the artist or the creator to spend more time in the very art, than the steps needed to get there.
Humans are at our very core, driven by visuals. AI is helping generate them more convincingly than ever before. GPT-3 and iGPT may not have written this piece or rendered it’s visuals all by themselves, but the day is not too far when they might be perfectly capable of it.