The people at Google have devised AI effective at predicting which machine learning models will produce the best outcomes. In a newly published paper ("Off-Policy Evaluation through Off-Policy Classification") and blog article, a team of Google AI researchers proposes what they call"off-policy classification," or OPC, which evaluates the operation of AI-driven representatives by treating development as a classification problem.
The team notes that their approach -- a variant of reinforcement learning, which employs benefits to induce software policies toward aims -- works with image inputs and scales to tasks, such as vision-based robotic grasping. "Fully off-policy reinforcement learning is a version in which an agent learns completely from older data, which can be attractive because it enables version iteration without requiring a physical practice," writes Robotics at Google software engineer Alex Irpan. "With completely off-policy RL, an individual can train several versions on the same fixed dataset gathered by preceding representatives, then pick the best one".
Arriving at OPC was a bit more challenging than it seems. As Irpan and fellow coauthors notice, off-policy reinforcement learning enables AI model training together with, say, a robot, but not test. They point out that a test that is ground-truth is usually ineffective.
Their solution -- OPC -- addresses this by supposing that jobs at hand have little-to-no randomness involved in how states change and from assuming that representatives either fail or succeed after experimental trials. The binary nature of the second of the 2 assumptions allowed the mission of two classification labels ("successful" for success or "catastrophic" for failures) to every action.
OPC additionally relies on what's called a Q-function (learned using a Q-learning algorithm) to gauge actions' future total rewards. Agents choose actions together with the biggest projected benefits, and their performances are measured by how often the selected actions are effective (which depends upon how well the Q-function correctly classifies actions as effective versus catastrophic). The classification accuracy acts as an off-policy evaluation score.
The team trained machine learning policies in simulation using fully off-policy reinforcement learning and then assessed them using the off-policy scores tabulated from previous real-world data. They report that one version of OPC in particular -- SoftOPC -- performed best at forecasting success rates in a robot grasping job. Given 15 models of varying robustness (seven of which have been trained purely in simulation), SoftOPC generated scores carefully correlated with accurate grasp achievement and"considerably" more dependable than baseline procedures.
In future work, the researchers intend to research jobs with"noisier" and nonbinary dynamics. "[W] e think the results are promising enough to be applied to a lot of real-world RL issues," composed Irpan.