Voice Enabled Commerce for Complex Orders

Introduction IoT voice interaction was once the stuff of Sci-Fi movies but now, many of us no longer bat an eye. From computers to phones to digital assistants, talking to a device has gone from a futuristic dream to an in-our-homes reality. And it appears this field has only begun to scratch the surface of its widespread potential. According to a recent OC&C Strategy Consultant Study, voice shopping could surpass $40 billion across the US and UK by 2022 (up from $2 billion today). For Elastic Path’s recent Hackdays, our team looked at voice-enabled commerce powered by Cortex. Specifically, we focused on enabling expert users to interact as they normally would when placing complex orders, such as coffee orders. We wanted them to talk to the system rather than through the traditional digital interactions. While Cortex ran the commerce side of things, we used Google’s Dialogflow to handle voice recognition and created a small NodeJS server to glue it all together. The conceptual secret sauce though, was a context-driven approach, complementing catalog-driven language processing. Context Matters Behind the words, buying things in real life is quite complicated. When a customer says, “I’d like a triple shot espresso, please,” the underlying concepts at play, translated for a commerce system, include the desire for the item, the item variety itself, the intent to order the item, and a desire (or willingness) to pay. “I’d like a triple espresso, please” <==> “I desire the espresso product, but I want it of the triple shot variety. Also, I’d like to order this configured item and I am ready to pay for it.” { "queryResult": { "queryText": "I’d like a triple espresso, please", "parameters": { "number": "", "size": "triple", "product": "expresso-bundle" }, "allRequiredParamsPresent": true, "fulfillmentText": "Okay, so you want triple espresso. Would you like to pay?", "fulfillmentMessages": [ { "text": { "text": [ "Okay, so you want triple espresso. Would you like to pay?" ] } } ], "outputContexts": [ { "lifespanCount": 2, "parameters": { "number.original": "", "product.original": "espresso", "size.original": "triple", "number": "", "size": "triple", "product": "expresso-bundle" } } ], "intent": { "displayName": "I want" }, "intentDetectionConfidence": 0.8966336, "diagnosticInfo": { "webhook_latency_ms": 30 }, "languageCode": "en" } } This is one of the key challenges for eCommerce voice interactions: context sensitivity. The ability to recognize key points from a single command makes transactions smoother, encouraging adoption, reducing friction, and allowing voice interactions to mimic real-world experiences. For a commerce system, “context” roughly translates to “what else” or the “next actions”. This just so happens to be Cortex’s specialty. From Context to Commerce: Cortex Zooms to Next Actions When you ask for an espresso, the set of underlying requirements include identifying the product, ordering, and paying. Cortex, with its flexibility in presenting the client with next actions (adhering to the best practices of a mature REST Level 3 API), provides zoom parameters to link between desired actions (Cortex documentation available here). Search for the ‘espresso’ product ==> include the ‘triple shot’ option ==> add it to my order ==> purchase the order This string of actions fulfills a happy path model for ordering an espresso and the general actions (“find a product and add it to the cart”) are naturally supported by Cortex. However, a critical piece in providing flexible interactions is the ability to configure products and their add-ons on-the-fly, creating bundled products which dynamically reflect changing prices and options. Furthermore, we need to provide this functionality in a way that is predictable and consistent enough to establish a programmatic pattern (i.e. a determined chain of resource calls/zooms that we can use for any queried product), but flexible enough to support different kinds of configurations (additional shots, drink sizes, etc.). To accomplish this, we used a customized implementation for Dynamic Bundles in our APIs, which provided support for selecting from a list of bundle constituent options and dynamically adjusting corresponding products. Using this accelerator in concert with out-of-the-box Cortex endpoints provided dynamic product configuration, but within a predictable pattern for adding all desired options and accessing “next actions”. Given this translation from context to commerce, the next challenge is recognizing context in voice commands. The Gift of Gab: Natural Language Processing Many large technology companies offer NLP services to extract intents and details. We chose Google’s Dialogflow over Facebook’s Wit.ai and IBM’s Watson due to its ease-of-testing, development, and extensibility. While both Wit.ai and Watson offer powerful language processing features, Dialogflow’s detailed feedback, deep community support, and streamlined connectivity with Android devices supported our rapid development and eventual demos with minimal additional configuration. Dialogflow uses “intents” and “entities” to decipher and tag input text. From a high level, intents describe the goal of the input — this aligns very closely with the idea of “context”. By implementing an “I want” intent and training the NLP model with sentences that implied this resolve, we connected the context with various input possibilities. This provided programmatic contextualization of voice input. Training the model can be done through the Dialogflow GUI by providing sample inputs and assigning them to an intent. As the model receives additional input, it becomes smarter and more accurate about recognizing past, as well as novel, similar inputs. Behind the scenes, these intents and their trained inputs are represented as JSON (and may even be uploaded in a similar manner). Below is a JSON sample extracted from the “I want” intent’s list of trained inputs. This input associates the sentence “I want a triple espresso” with the desired intent, tagging the various pieces of the sentence. { "data": [ { "text": "i want ", "userDefined": false }, { "text": "one", "alias": "number", "meta": "@sys.number", "userDefined": false }, { "text": " ", "userDefined": false }, { "text": "triple", "alias": "size", "meta": "@size", "userDefined": false }, { "text": " ", "userDefined": false }, { "text": "espresso", "alias": "product", "meta": "@order", "userDefined": false } ], "isTemplate": false, "count": 0, "updated": 0, "isAuto": false }, Further, orders are rarely simple and recognizing variations on an order requires not only context understanding, but also detail recognition and relevancy knowledge. Product variations, like extra shots, different sizes, milk varieties, etc., require the NLP system to know which details to flag. These dynamic pieces of the input commands constitute “entities”, which are also defined through the Dialogflow GUI and associated with appropriate intents. This is where the key details of a catalog come into play. With manually-imported catalog data and specified, corresponding configuration options, Dialogflow learned to parse specific products and variations from inputs, providing this information in the JSON output, as well. In the example above, we see the system tagging things like “size” and “product” — these are predefined entities associated with the “I want” intent. Hence, when we provide training input that resolves to this intent, the system picks up the related, expected entities and validates these for increased specificity and accuracy going forward. Dialogflow also provides tools for testing new inputs and visualizing this output in JSON. After training the model, providing the input “I’d like a triple espresso, please” provides (sample) output. This provides the necessary structural predictability, allowing us to consume, tag, and decipher vocal input data. We then used Dialogflow’s Fulfillment module to pass these details on to our NodeJS server, which parsed this data and kicked off the expected Cortex flow to fulfill these desires. Taking this a step further, Dialogflow allows users to import detail recognition knowledge (i.e. entity definitions) as JSON data. For example, the following provides a snippet of the JSON definition for a “size” entity. { "id": "123", "name": "size", "isOverridable": true, "entries": [ { "value": "doppio", "synonyms": [ "doppio", "double" ] }, { "value": "grande", "synonyms": [ "grande", "large" ] }, { "value": "quad", "synonyms": [ "quad", "quadruple" ] }, { "value": "short", "synonyms": [ "short", "small" ] }, { "value": "tall", "synonyms": [ "regular", "tall" ] }, { "value": "triple", "synonyms": [ "triple" ] }, { "value": "venti", "synonyms": [ "vendi", "venti" ] } ], "isEnum": false, "automatedExpansion": true, "allowFuzzyExtraction": false, "isRegexp": false } Given this capability, a user could group catalog-based add-ons under specific entities to provide automated and dynamic catalog-driven language processing. For example, if espresso products are linked to a set of SKU options relating to size, we can write a script that parses this source data (in our case, an XML file) and outputs a JSON entity definition for “size”, assigning the parsed SKU options as entity “values” and “synonyms”.