Google is investigating ways AI might be used to ground natural language instructions to smartphone app actions. In a study accepted to the 2020 Association for Computational Linguistics (ACL) conference, researchers at the company propose corpora to train models that would alleviate the need to maneuver through apps, which could be useful for people with visual impairments.
When coordinating efforts and accomplishing tasks involving sequences of actions, for example following a recipe to bake a birthday cake, people provide each other with instructions. With this in mind, the researchers set out to establish a baseline for AI agents that can help with similar interactions. Given a set of instructions, these agents would ideally predict a sequence of app actions as well as the screens and interactive elements produced as the app transitions from one screen to another.
In their paper, the researchers describe a two-step solution comprising an action phrase-extraction step and a grounding step. Action-phrase extraction identifies the operation, object, and argument descriptions from multi-step instructions using a Transformer model. (An “area attention” module within the model allows it to attend to a group of adjacent words in the instruction as a whole for decoding a description.) Grounding matches the extracted operation and object descriptions with a UI object on the screen, again using a Transformer model but one that contextually represents UI objects and grounds object descriptions to them.
The coauthors created three new data sets to train and evaluate their action-phrase extraction and grounding model:
- The first contains 187 multi-step English instructions for operating Pixel phones along with their corresponding action-screen sequences.
- The second contains English “how-to” instructions from the web and annotated phrases that describe each action.
- The third contains 295,000 single-step commands to UI actions covering 178,000 UI objects across 25,000 mobile UI screens from a public Android UI corpus.
They report that a Transformer with area attention obtains 85.56% accuracy for predicting span sequences that completely match the ground truth. Meanwhile, the phrase extractor and grounding model together obtain 89.21% partial and 70.59% complete accuracy for matching ground-truth action sequences on the more challenging task of mapping language instructions to executable actions end-to-end.
The researchers assert that the data sets, models, and results — all of which which are available on open source on GitHub — provide an important first step on the challenging problem of grounding natural language instructions to mobile UI actions.
“This research, and language grounding in general, is an important step for translating multi-stage instructions into actions on a graphical user interface. Successful application of task automation to the UI domain has the potential to significantly improve accessibility, where language interfaces might help individuals who are visually impaired perform tasks with interfaces that are predicated on sight,” Google Research scientist Yang Li wrote in a blog post. “This also matters for situational impairment when one cannot access a device easily while encumbered by tasks at hand.”