Do Amazon Fresh shoppers have a duty to shop in a manner the cameras understand? I simultaneously grabbed three items from a shelf at my new cashierless checkout store while thousands of cameras monitored my movements, but after placing the items in my cart, I knew the cameras would have difficulty with the event. Most people don't grab three bags of frozen vegetables at a time, but I always do. As an engineer exploring what I know to be a store in beta, I went home prepared to report a bug. I knew I was likely stealing vegetables in leaving the store, but I could correct it after the fact. Only I later learned, Amazon would not permit me to pay for the stolen items.

Now in full knowledge of this blind spot, do I change my behavior to conform to the capacity of the cameras? Is my duty to shop like a human and not an engineer that can make the task easier on the store? If I continue my normal human behavior next time in the store, am I stealing from Amazon? Contemplating these questions motivated the thorough evaluation that follows -- an examination of how one company solved the design tension between human and machine environments.

A picture of the store from the produce section, you can see the thousands of cameras dropping down from the ceiling around the store. The shelving has no apparent weight sensor and no items are charged by weight. In contrast to other Amazon Fresh stores (e.g., this one), there is no smart "dash cart" that scans items as they are placed into the cart.

My exploration of the store is both professional and personal. Four years ago I turned down an offer to work on cashierless checkout, and this blog post is a rundown of my thoughts of the road not taken. Please note, I have no insider knowledge on the Amazon Fresh operation. I am working from first principles as someone that has produced computer vision systems and made an extensive study of machine learning system failures. While I will have some points of criticism, it is worth saying at the outset that while some of my guesses may be off the mark, it is clear the Amazon team has done amazing work.

The Consumer Perspective

My Amazon Fresh store is one of about 30 that have opened with variations of computer vision-based checkout -- where you pick up items and leave the store without having a discrete checkout phase to the visit. In total, across two shopping trips, I spent close to 2 hours and $300 on groceries. I did not modify my shopping habits beyond taking pictures of the store, which includes sections for produce, meat, fish, dairy, beer and alcohol, dry goods, and more.

The predecessor to Amazon Fresh, Amazon Go, was confounding to many. This was SNL's take on the store concept.

The shopping experience begins and ends with entry and exit gates requiring you to scan a QR code or credit card. During the shopping experience, you browse the store and place any items you intend to purchase into your cart or bag. When you are done, you scan to leave and you subsequently get a receipt notification in your email. Upon exiting, a helpful Amazon employee told me I should receive my receipt within an hour. In my case, it did not work out this way.

This is the receipt I received seven hours after leaving the store. High receipt latency is the first sign of trouble in Amazon's implementation. Amazon has no trouble getting compute time, so slow receipt issuance means Amazon has humans in the process, either for all customers or just for those where the computer vision system is uncertain about the customer's basket. To me, this is an indication that the store is still firmly in beta, and may even post steep operating losses. However, the store is likely aiming to solve a bootstrapping problem: these stores can only make effective use of computer vision after collecting many sample interactions from consumers. Taking losses for months or years is part of the plan in expectation that the machines will eventually take over.

If you are charged for food you did not purchase, then you can contest the charge on Amazon's website. If you make it home and the receipt does not charge you for something that was in your cart, then Amazon has no way of charging you for the product. This is not a bad deal for consumers that are paying attention, but it makes Amazon's effectiveness in monitoring customers in-store critically important to the feasibility of the store concept. And effectiveness requires designing the store for the benefit of the machines...

Machine Affordances

We think of affordances in design as being something that helps the human understand how to move through an environment or program. For instance, a well designed door has a handle that indicates whether you should push or pull to open the door. The Amazon Fresh store is similar, but instead of designing to the human, it designs to make the computer vision task simpler and more consistent.

Here is an exploration of Amazon's "machine affordances" as I was able to identify.

There are three things to note in this aisle. First, the light is a solid track light, placed at a consistent position with respect to shelves and the cameras. The continuous nature of the light bar greatly reduces shadows, which makes the computer vision task more consistent across shelving units. Second, the merchandising for all the cereals is curiously basic. None of them have special deals, toys, flavors (the Golden Grahams are labeled "Retro Recipe"), or other variations you would typically see in cereal packaging. Why? The computer vision system is almost certainly not trained to read the text on the box, so it is identifying cereals based on the color, size, and graphics (including lettering). If the boxes change, then it is essentially a completely new product and data collection+calibration of the system will be necessary. Amazon greatly prefers merchandising that is consistent through time. No small-batch, no local-only brands, no experimental products, and no changes. Finally, you will note that the shelves are incredibly orderly. Everything is pushed to the front. See the next photo for why.
When I worked at Best Buy in high school, it was a constant chore to bring the product to the front of the shelf. Orderly shelves move more product. Most Amazon Fresh product shelves are built with spring-loaded product pushers, which are cheaply constructed and contain no obvious electronics sensing product movements. So why don't most stores use these labor-saving devices? The unstated problem with these in the grocery setting is that grocery products typically move around shelves, go on special end caps, get low on stock, and otherwise make it necessary to move products around. These spring devices make it much more difficult and time consuming to shift stocks. However, if you have a finely tuned computer vision system, you don't care because you are not going to be moving products around. You want the product to be in a constant location and to look the same through time. This penchant for sameness can be seen throughout the product catalog of the store -- I don't recall seeing any products that had special graphics on the packaging. The boxes are meant to be the same for as long as possible so they can avoid expensive re-tuning of the computer vision system.
Every cart in the store is identified by its own large-format barcode, which is likely tied to its shopper when entering the store. Carts are not required in the store, you can also place items directly into your bag (or bags you can buy throughout the store). So why the barcodes? They are not required for the store to function. My best guess is that the cart label is useful for human reviewers looking to resolve uncertain shopping events. The many cameras in the store will be able to see items recently placed into the cart.
Although Amazon could employ sensor fusion (computer vision + weight sensors) to solve the "soup problem," Amazon has elected to charge by container rather than by weight. Consequently, they cannot sell foods that are expensive on a per gram basis. Only discrete problems are solved.
Every camera body has two lenses pointed at an angle along the shelf line across the aisle. It is difficult to know the field of view of each camera, but I believe shelves may be covered by one camera from the left, and another camera from the right. This would allow for both better visual coverage and for running inference twice. If two cameras disagree about the item that was picked, whether the item was returned to the shelf, or some other outcome, then human annotators could step in and manually annotate the outcome. In short, it appears Amazon is first checking for machine agreement, then having a human provide definitive labels only when the models are more likely to have failed.
Some aisles are more difficult to solve with computer vision than others. For instance, the freezer aisle requires customers to move large glass planes that produce reflections, drop shadows, and occlude cameras. In these aisles Amazon has doubled the number of cameras dropping from the ceiling. My theory is that the increase in camera count likely provides a guarantee that at least two cameras will capture customer events even when freezer doors are open. The multiple cameras can then serve to identify when human annotators may be necessary to resolve what was purchased.
The entrance and exit gates both have these QR code scanners that you hold your phone up to. If something goes wrong, like your phone dies in-store, there is a backup register you can visit to make your purchases. So the store is not cashierless at present, but more "cashier optional."

Anatomy of a Theft

Now, back to the "theft." The stolen items include two of three freezer aisle Quinoa and Spinach bags, and one of two bags of muscat grapes. Both these items are more challenging non-rigid products. Without weighing the cart or the shelf as items are removed, I would go so far as to say solving these cases may be impossible when multiple of the items are pulled simultaneously. While Amazon has some mitigations in place, such as the cart labeling, solving this case is likely to be dependent on humans making a careful study of what they can see in the cart.

What about the ethics of leaving the store with the products? Unless and until Amazon tells me to "shop like a human," I now feel a sense of responsibility in knowing a failure mode of my store. I will pick individual items from the shelf and not grab more than one item at a time. If any friends are reading this at Amazon and would like an adversarial shopper, please reach out and be prepared for highly varied test data that will strain both your models and human annotators. Otherwise, I will be among your easiest customers.

Conclusion

Google famously launched a telephone service GOOG-411 not because they wanted to offer the public an information service, but because they wanted to collect a large and diverse array of voices from the public. After Google collected enough voice data, they shut the service down. Amazon Fresh is much the same -- a data collection operation.

However, Amazon does not intend to stop once they get the data. Amazon has several things legacy grocers lack: compute power, machine learning talent, and deep reserves to operate 30+ stores at a loss while they "solve shopping." Have they solved it yet? I will believe it when I get the receipt before hitting the exit door. Until then, there are unseen people serving as your cashier.

The store as viewed from the outside. It is located just off the main freeway that runs the entire west coast of North America and is in prime position to serve as a delivery hub for a huge swath of mid-market suburban southern california homes. The in-person shopping experience may be more of an opportunistic data collection built on a delivery program rather than a business meant to turn a profit on its own.

If one of these stores opens in your neighborhood, I highly recommend the experience. You are witnessing history in the making. Just as early drivers of automobiles would occasionally need to hand crank the engine to start, there will be awkward points that obscure the transformation. Now is the best time to explore how that new world is being built. Any sufficiently advanced technology is indistinguisable from magic, and we have a rare opportunity to see that magic in the making.