Thoughts on the State of Robotic Grasping and Manipulation from a Supposed Expert in Robotic Grasping and Manipulation

A pinch here and there
Maybe a grasp you will find
Promise you won't suck

I remember that I realized pretty quickly in grad school that I wasn't cut out for academia, and I'm a probably a few job transitions away from bypassing end-effector, grasping, and manipulation work entirely, so I thought I'd jot down my observations on the last half-decade of grasping/manipulation work from a hardware perspective while they're still fresh on my mind. Who knows, maybe this will lead to one final push for innovation from me, as small as it may be.

An Oversimplification of the World 10 Years Ago

As far as physical models go, the grasping/manipulation problem isn't all that crazy or intractable. We're applying a small collection of force vectors onto some object in an attempt to either lock it in place or to give it some relative motion. It's assumed that if we have full control over the force applied at these contact points, the object can move however we wish. That's basic mechanics/dynamics, taught in intro physics courses, so why haven't we figured it out yet?

A couple of demos 10 or so years ago really ruined the field for those of us working in it, in my opinion. The Ishikawa lab showed off the culmination of years of work in a 2009 video demonstrating high speed manipulation. I feel that's often still the go-to example for what can be achieved when your system is well characterized, and you're able to fully track your system state. It looked like our original assumptions are pretty sound. Various other labs would demonstrate in short order that more complex behaviors are possible without significant hardware advancements beyond the state of the art. In parallel, groups were also trying to make strides in implementing "mechanical intelligence" (aka slightly more advantageous/optimized transmissions) for more effective end-effectors. Critics would point out that a lot of the demos during this time period were limited in complexity and utilized only a few object geometries. Google would show a few years later, during that precipitous rise in interest in machine learning, that you could parallelize testing in the real world to further refine your models. Add to that the rise of readily available RGBD sensors, and it sure felt like to me by the mid-2010s that we had both a solid tech stack and defined direction towards autonomous grasping and manipulation.

A typical workflow was/is as follows:

Acquire a point cloud somehow or use a priori knowledge and use that to generate the world/problem state
Adjust your model as appropriate to virtually recreate the problem state
Determine an "optimal" sequence of actuator inputs to accomplish the target task
Run those actuator inputs on the real world system
Hope for the best
Blame poor results on inability of sensors to properly execute step 1
Blame friction
???
Maybe profit (or graduate)

Manipulation systems got increasingly complex in a bid to apply the Ishikawa approach to less contrived and more useful scenarios. I was in one of many labs that would regularly publish novel end-effector/mechanism designs to better accomplish some manipulation primitive, and I'm sure that to this day, there are still entire sessions dedicated to new end-effector designs at every robotics conference. Incredibly complicated (and fragile) anthropomorphic hands like the Shadow Hand or DLR Aiwiwi Hand where designed to leverage researchers' inherent know-how of human-manipulation motions. We even went as far as to crochet tendons by hand for our hands. Researchers/companies also introduced new tactile sensing technologies (1,2,3) and increased the density of traditional contact/pressure sensors (1,2,3) to better evaluate the state of contact interfaces.

Despite all this work, I can't point to any sort of commercial or research deployment today performing those sort of tasks of daily living that we apparently were working towards. Academic demos, as 'unstructured' as researchers would like them to be, remain staged and relatively simple. Technical expertise may differ, but all grad students should be adept at manically reducing the dimensionality of the problem (usually in the weeks leading up to a conference deadline) until the model works within the most convenient scope (and there's your daily dose of snark). That said, I think academia has made some significant strides in establishing some common benchmarks, like physical object sets and manipulation task boards (1,2), so that different systems from different labs can be evaluated in a similar way. All of this suggests to me that we're still missing a big level-up in our understanding of properly controlling the contact conditions between the robot and object.

In my mind, there are a few, not-discussed-frequently-enough issues that remain significant barriers to deployment of autonomous manipulation strategies in production settings:

Friction is unreliable and laughs at you behind your back. The real world doesn't behave like a minimal set of idealized point contacts. As a result, I think efforts have diverged as of late: either make an excessive number of contacts, placing your faith in elastic averaging (redundant constraints), or make the minimal set of contacts and ensure that only those set of contacts get formed in the way that you require (exact constraints). The former introduces more variability in the end result, and the latter requires more structure in setting up the problem. Work is ongoing on making friction more tractable to model in simulations, but it's not terribly uncommon in academic papers to still see end-effectors with ad-hoc grip pads stuck on, or have object geometry selected and sized to be more amenable to the hand's capabilities, to make grasping more repeatable.
A single scene snapshot isn't sufficient. Not only can the real world shift/morph, it may appear different from alternate viewing angles. A single scan, though often limited by hardware selection, is susceptible to signal noise, occlusions, and variable environmental lighting. I'd argue that scene reconstruction is more complex for manipulation than it is for mobile robotics SLAM applications, in which the robot's primarily looking to avoid boundaries as opposed to trying to create contact at a particular point in a particular way.
Immobilizing and constraining an object is a process, not an instantaneous state change. More on magnets and vacuums in a bit, but when using grippers or our own hands, we don't typically grab onto things that are magically locked in free space until we've secured a hold. Even during a controlled hand-off (passing the baton in a race, for example), we'd expect the object to shift and move as one set of contacts is exchanged for another. Some procedures mitigate this with pre-shaping prior to the grasp, but that's not always a sufficient.
Manipulation includes a lot of motions both before and after a grasp. Imagine if people were only allowed to grab objects as they're found, in-place, and if they can't adjust their grasp after picking up the object. Research strives to propose solutions for particular problems and evaluate those solutions' effectiveness, but what if the problem conditions can be changed, and the so-called solution is merely the opening step towards the end goal? There's nothing wrong with a single step if we keep missing the leap up the staircase.

Key Hardware Tech Advances in Manipulation the Past 10 Years

I will readily admit that my naivete was running full-bore in the middle of my PhD. A part of me definitely believed that we just needed to reduce the dimensionality of the manipulation problem through modeling (and we were close). There had to be some near-optimal coupling in force/position space control of end-effectors that would effectively solve the grasping problem for the bulk of objects. My job should have been to find the subset of the linkages and motors in a high-dof hand like the Shadow Hand to more easily accomplish my target manipulation goals. I've now come to believe that not only is our toolset incomplete, we still have plenty of combinations/options to explore in terms of how to mix our existing tools.

(3D) Vision is Eating the World

Back in 2013, I was running a Kinect v1 on ROS 1 Fuerte, getting about 1-2 fps with a graphics-card-less desktop chugging away on a Core 2 Duo processor. That was barely enough to get me the geometric principal axis of an object on a clean/clutter-free table, but totally sufficient for some novel-at-the-time demos. I regarded point cloud sensors and cameras as the cute new gizmos that made my job easier, without appreciating just how important perception technology would become. I feel that with every job switch, I get a new education on how important perception is and just how little of it I appreciate and understand. Among many other things, I've been lectured on the importance of controlled lighting, weird edge cases with reflective material/glare, accuracy limitations of non-telecentric setups, and the inevitability of motion blur.

Led by the insane proliferation/adoption of the Intel Realsense line, it seems to me that structured/coded light is the perception approach of choice for robotic manipulation projects. These setups don't quite need the range of the lidar units used in the automotive space, with the target objects unlikely to be more than 0.5m to 1.0m away from the most distal sensors, and you're unlikely to get better resolution with other technologies. The Realsense D435/D415 itself has pretty much become a comical, one-size-fits-all shotgun solution due to its compact form factor and support, though I think we're now starting to see higher-end solutions (1,2,3) with resolution to address inspection/metrology needs (where a basic blob just doesn't cut it).

Structured light (in my opinion) "winning out" over time-of-flight and stereovision is yet another example where the direct measurement was too difficult, perhaps due to the wide range of material properties (and the associated interactions with light), and so it made more sense to measure an indirect representation (distortion of the structured/coded light pattern) instead. In particular, we've been seeing this in the past few years with the different variations of the the GelSight tactile sensor, where the deformation of an elastic layer is measured by visually tracking the motion of dots (or other keystone features). Other versions (and I'm sure I'm forgetting some) include TacTip, TRI's Punyo, and Meta's Digit. I guess that it's difficult to beat the data density of the ubiquitous camera sensor, especially since we already have such absurd advantages from economies of scale, so why not restructure the problem to better suit the resources at hand instead of vice versa?

The naysayer in me would argue that the addition of novel artificial structure in the problem has been the primary advancement. Despite all the advancement in the core hardware technology and software algorithms, the big level-up here comes from enforcing a higher level of repeatability and consistency in the raw camera data output. All pixels are equal, but some pixels are more equal than others.

Brushless Direct and Quasi-Direct Drive

Not to say I was ever particularly strong in this area, but a couple of recent job interviews have made it obvious that roboticists working on hardware are going to need a much better understanding of brushless motor design and control moving forwards (and backwards). Philosophically (if that's even the right word), I've traditionally only worried about the actuator stall torque and the packaging constraints, as those were (arguably) the primary limiting factors in end-effector design. Control was almost exclusively in the position-space, with force output dictated by some elastic element in series, usually a compliant pad at contact or some variation on underaction between the actuator and contact. Let's just overdrive the fingers until the object stops slipping, amirite? Now, my own path from academia to industry may be a bit atypical here, but I was surprised to find that material-handling applications in manufacturing, where system function is far more specialized, relied more on pneumatics and limited kinematic stroke/freedom for more consistent force output. Electric end-effector options were far less common, especially with higher payloads, presumably (and I'm guessing broadly here) because running motors at stall in the end-effector's final holding-grasp state ain't all that preferable.

A lot of the actuator work originally developed for legged robotics at MIT and CMU gave us access to what I'd (perhaps inappropriately) call a tunable, digital spring of sorts. There's now an option to more easily get a desired force profile at contact through control alone. Arguably, this has been possible for a while through other implementations: the friction drive setups of the OG cobots via Peshkin and Colgate come to find, but work on direct-drive and low-reduction drive brushless actuators seemed to have make this practice far more accessible. For grasping/manipulation, I'm less interested in the dynamic motions/behaviors and more excited about the applications in proprioceptive sensing through the motor itself (1,2).

Contact and force sensing has always been very difficult, particularly because it relies on the repeatable and hysteresis-free deformation of some elastic element. With the contact sensor elements at contact, you then also need to compensate for damage through wear and tear. If only to maximize robustness, it makes more sense to move the sensing element further away from contact, though (in my mind) we now have a new conundrum: the component-chain between the sensor and contact should be as stiff as possible, and as that size/offset grows, we increase the potential for damage during inadvertent collisions. In our work for the DARPA ARM-H project, I remember that iRobot insisted on break-away magnetic joints despite the compliance and underactuation already implemented in the system. Low-reduction brushless drive doesn't directly address that issue, but it reduces overall complexity and enables some more interesting behaviors throughout the mechanism's range of motion. This "actuator transparency" allows the actuator to both do work and have work (safely) be done on it. Conceptually, that seems particularly appropriate for end-effectors, as we use our hands as both input and output devices during manipulation.

Stuff that (imo) Didn't Pan Out

Mechanical synergies and underactuation really haven't achieved much adoption outside of the lab space. I often point to Robotiq as the primary example of a company pushing underactuation in industry, and yet, if you're to look up videos of their products in the field, you're unlikely to see any vendor actually utilize the adaptive behavior of the grippers. I think it's also telling that their website now focuses on full manufacturing solutions (with next to no mention of underactuation/adaptability) as opposed to the piecemeal components themselves.
The lack of penetration in industry may be more an indictment of the complexity and fragility that comes with increased degrees of freedom than underactuation itself. The ability to guarantee the execution of a single task trumps the capability to maybe but maybe not do a wider selection of tasks. I still remember that the awe at the impressive mechanical artistry of the BarrettHand were accompanied by less-than-enthusiastic stories of arduous maintenance and repair. Empire Robotics, which tried to commercialize the universal jamming gripper, cited wear and (literal) tear of their flexible outer membranes as a significant contributor to commercial failure. To be fair, there are a few soft robotics companies (1,2) that have found some consistent niche in recent years.
Tactile technology, despite the many vision-based tactile efforts of late, have seemed to remain firmly stuck in the research space. For something that relies on deformation at the contact interface, it's difficult to envision a manifestation capable of withstanding that sort of regular abuse while returning reliable force data, much less do that for a dense array of points like our fingers do. The best examples of sensing arrays (tablet/smartphone screens) are limited to certain materials and only return contact state, while requiring significant economies of scale to be viable.
The obsession with humanoid hands and anthropomorphism produces some real fun demos, and not much else. The argument for anthropomorphism has always seemed a bit flimsy to me: sacrifice minimalism and robustness to copy the kinematics of our existing hands in order to leverage our know-how in controlling/using such hands, as if the latter is somehow a magical solution. One of my biggest takeaways from grad school was hearing that for all the research in anthropomorphic prosthetic hands, even the most advanced models struggle to exceed the overall utility of the Hosmer hook.

The Tried and True Solutions

The elephant in the room: for all the fancy research, modeling, and prototyping we propose and do in the research space, it's pretty difficult to enter industry and not converge towards a vacuum or magnet-based solution. I still remember watching the inaugural Amazon Picking Challenge and becoming increasingly frustrated with how well the vacuum-based grasping solutions were doing. Recently, it was particularly demoralizing to hear a stalwart academic in the field of grasping remark that his company's secret sauce to picking was just figuring out vacuums really well. Turns out, vacuums are mechanically simple, (relatively) easy to make compliant and adaptive to different geometries, compact for tight spaces, easy to control, and capable of picking up objects if only a single surface/side is accessible. Truly a difficult combo to beat.

Of course, there are some pretty obvious limitations: magnets only work with ferromagnetic materials, and vacuums typically require a near airtight seal (though high-flow alone can be sufficient). However, even for the cases that don't fit the ideal operating conditions, there are workarounds: porous/loose fabric an be pre-wrapped in plastic (which they often are anyway) or clamped in some carrier device that's more easily picked. Similarly, we could embed removable metal plates/tabs in the target objects. The system can even swap in the optimal suction cup/magnet to maximize performance. Some may call that cheating. Those same people may be trying to publish a paper instead of deploying a solution. You're not necessarily limiting the range of inputs to the problem if you have flexibility to modify the inputs in a separate preparation step.

Limiting robotic systems to the problem as given seems a bit odd. The core problem has always been to move an object in space, not to move an object in space like how people do in all possible scenarios. In addition, it's not like the typical manual pick and place task is a person rummaging through a loose pile of items. We use all sorts of containers and tools when handling the objects piecemeal either doesn't make sense or would take too long. Functionally, the difference between manipulating the object directly and manipulating something that's rigidly attached to the object or holds the object could be minimal.

Stuff I Hope to See (or Do)

I'm the sort of engineer who would prefer to do something whimsical and clever than necessarily impactful, so my insight has always been a bit off-kilter, but I've continued to believe that there are some missed opportunities in the grasping/manipulation space, whether in research or industry, that would be especially worthwhile to explore:

Multi-stage, sequential manipulation w/ regrasping - Why is grasping and manipulation a one-and-done procedure? Researchers often require that dexterous manipulation maintain force closure on the object, which by definition requires redundant degrees of freedom and a whole lot of unnecessary complexity. This is mostly an argument on semantics and scope, but if we're just talking about using a series of contact points to reposition an object in space, it shouldn't matter if those contacts come from the same robotic system or even an actuated system, or we lose and re-gain force/form closure during a lengthy, non-atomic process. There's no need to require a fancy finger-gaiting maneuver within a single end-effector if we can just drop and re-pick the object until the desired contact points are achieved or if we can hand-off to another end-effector or static affordance. It would be fun to see a chain of drop, re-scan, and re-grasp primitives that just continues until the desired grasp is obtained.
More non-prehensile manipulation - Most (if not all) of the demos/deliverables on grasping solutions pick the object as is, as if it can only be moved in a grasp. Pushing/nudging motions to create more advantageous scenarios prior to grasp attempts don't seem to be thoroughly explored in practice, though post-drop nudging (extrinsic manipulation/dexterity for your academics) especially in machine-tending applications seem to be the norm. Robotic systems should (literally) mix it up to artificially adjust the object states prior to manipulation attempts.
Dynamic extrinsic affordances - My recent work has me over-thinking alignment jigs, chutes, and slides. Robotic systems have long used fixed physical boundaries as reference points, either through direct contact or visual sensing. In manipulation/grasping, this most commonly is manifested as a simple surface/wall or an alignment jig specific to the target object. Varying the geometry of the support structures could enable more effective/precise alignment methods. That said, I hesitate to go too far down the road of super general shape-shifting surfaces that try to solve the problem for all possible geometries. Just a few offset surfaces and a varying angle relative to the level table frame could be more than sufficient.
A cache of swappable tooling - I know tool-changers exist, but that's not necessarily what I'm talking about. I don't think the ATI-style, super rigid, usually-pneumatic wrist couplers are the only way to swap out end-effectors. Various researchers/projects have implemented handles/tooling specific to grippers (1,2,3), but I don't know that there are many systems actively optimizing the gripper topology and contact geometry as actively as say a CNC mill would do for machining operations. As companies offer more bespoke options as opposed to a one-size-fits-most strategy, while maintaining a common actuation base, I could see scenarios where a robot would carry a carousel or bank of what would effectively be robotic gloves, each suited for a particular object or application. We could probably go much farther with single-dof opposition grasp by just adjusting the contact coverage/geometry.
Graceful failures and finding a path to recovery - Maximizing good is not the same as minimizing the bad, but I think the latter opens up many more worthwhile applications than the former. Not all pick failures are catastrophic and require a system reset to re-attempt the task from scratch. Can the infrastructure around the manipulation task be set such that the system can make multiple attempts at a particular subtask? Expanding off of proposal #1 above, maybe we can construct an intermediary "safe space" for re-orientation primitives after a less-than-optimal pick attempt (from perhaps less-than-optimal pick conditions). I don't believe I've seen robotic systems that not only make multiple attempts at a task, but make initial attempts with full knowledge that those attempts have a high chance of failure but could restructure the scene in a better way for a future re-try.
Leveraging different contact surfaces on the same gripper - Despite having a single degree of freedom, there are (at least) 4 ways to measure things with a set of calipers but using different sets of contacts on the same mechanism. We're able to do quite a bit with a basic parallel-jaw gripper and a high-dof arm, but systems typically only leverage the same pair of contact points. The Hosmer hook and the Sarcos Guardian Gripper are the primary examples of this that come to mind for me. In the latter case, just adding curvature to the finger tips enable both a lateral pinch and wrap grasp (when taking into consideration the opposition thumb) through a single degree of actuation.
Safe Cyclic Motions - Modeling is most effective when, in practice, every action has a reliable and repeatable reaction. That's not really the case in manipulation where contacts may be improperly placed, and friction wakes up every morning choosing violence. As detailed above, integrators can use a variety of hardstops and alignment jigs to to mitigate grasping errors where the non-prehensile forces exceed friction, wherever it may arise. In such a way, the same corrective motion profile can adjust a range of pose errors. In these cases, there's some "safe", forced-slip interaction occurring that doesn't detract from the manipulation objective. Conveyor belt setups do this a lot when funneling and aligning parts between stages. I do joke that the most effective end-effectors capable of in-hand manipulation incorporate miniaturized conveyor belts (1,2,3,4,5), but I think there's something to be said for motion primitives (however they're implemented) that are agnostic to the dastardly antics of friction and produce a net effect on the object pose only in the presence of error.
I see you, SEAs - Series elastic actuators have their limitations, especially in dynamic applications or where the range of external forces is significant, but I still think they're under-leveraged in end-effector design. We're very good at measuring position, and that's the primary mechanism by which we detect forces anyhow, so why not exaggerate the motion from external forces and make the forces easier to detect with cheaper sensors? For sure, the additional degrees of freedom could be problematic, and the elastic element would need to be carefully selected for each use case, but it's always seemed like a cost-effective way to add some form of force control (albeit very simple force control) to positioning subsystems. In its most basic form, I could foresee a swappable elastic element in-line with the transmission that hits a switch when fully compressed, giving a gripper a simple way of reliably hitting a force threshold when closing.
Actuation that's more than a configuration selector - With more dexterous (typically anthropomorphic) end-effectors, I feel that there's a lot of actuators with the sole purpose of setting the relative pose of the finger (or grasping) elements. During the grasp acquisition or in-hand manipulation motion itself, those actuators typically remain fixed. If the primary function of a degree of freedom is to set-it-and-forget-it, I think you're better off setting that mechanically through some sort of externally-triggered switch as some prosthetics do, rather than dedicating an actuator to the function. To that end, an externally triggered configuration selector (imagine a robot hitting its own end-effector to change its configuration prior to grasping) would also be really cool.