One of the interesting parts of robotic learning is defining what can be learned by programming the structure the the robot will operate in. I’m not even to the point of a robot that has hands or talks or whatever, but an input/output system that somehow “thinks”, and in this case, learns to think new things. However, implementation belies functional ability; how you make something, not just what you make it do, defines how it works.

One of the projects I’m working on at UW is for making a robot that will involve human interaction to make block structures, with Mike Chung of the Neural Systems Lab. It’s a start to somethings more interesting. Basically, given an input of blocks in a certain working area (the tabletop), how can the robot learn certain structures–shapes the blocks create–and potentially create them on it’s own, or with the user. An example might be demonstrating a “square” shape with four blocks to teach the robot, and then, after putting down 3 blocks, allowing the robot to place the fourth to finish the square (or make a T shape, whatever). Of course, as an engineer, you could program it to recognize these thing by hand, or allow a system to learn it itself, but what the system is measuring and how it does so determines how the robot functions: it simply can’t be very useful (ie approaching human complexity) if the underlying system is not complicated. But that’s a butthurt post for another day (of philosophical BS).

Instead is a brief overview of making an input system for this kind of robot.

Our first version of this sort of framework (ugh, I hate that word) was tracking color blobs (bright foam blocks) and allowing the system to recognize a before and after state to let it learn some actions. This fell short in a bunch of different ways, but it was able to localize blobs in a 3d space around the table. A note on hardware: Kinect. Yeah, there’s nothing fancy going on, which lead to some problems, but again, post for another day.

Version two: a real deal. Thanks to the use of PCL and it’s development outside of ROS, it’s awesomely easy to do complicated things like grab a frame from the Kinect, do a spatial segmentation of a specific area, detect and remove the table’s plane, and do a cartesian cluster extraction of the remaining points in the depth point cloud to get a block, all at about 12 frames per second on an average computer. By golly that’s great!

 

Some obvious problems: what happens if the block moves? What about occlusion? What about resource contention? What about giving that data to another process?

1) Being academia, we can make all the assumptions we want => blocks don’t move after they’ve been placed. However, I’m not an academic. Blocks were detected if a certain cluster of points persisted for a time without moving, and had a certain magnetism in persisting so that occlusion for a few frames wouldn’t lose sight of the block. Hey, if I wanted a PHD in tracking blocks with CV, this would be a different kind of post.

2) A neat trick with PCL is using octrees. In fact, they’re made with two pointers to hit up two different point clouds, so you can do comparisons between them, such as measuring differences. If a user’s arm moves within frame, that movement will be picked up and ignored by this technique. Again, we’re assuming that blocks don’t move.

3) Everything is very nicely resource locked to prevent fighting over specific point clouds / Kinect input, but I think I can optimize this more. In the next version, of course…

4) Apache Thrift makes it super easy to have this code executed as a remote procedure call and with data passing, it connects super easy to a client program. In our case, it was connected to ROS to do some reference frame transformation (don’t ask why), pushing the data to another ROS node which then called out commands to the MATLAB engine. A separate ROS node, of course, to have the Matlab stuff controlled in a separate process. Matlab, of course, because its academia, and there are some libraries common to the subject of Bayesian networks.

Anyway, here’s a vid of the single layer HMM working to discretize the location of the block (we’re basically just looking for up, down, left, right relative to the last block). The second layer will come soon! I’d like to think that the robot manipulation part of this will come soon too, but that’s a post for another day.

Note the jump when the block is placed to the left.

 

FYI, on most *Nix systems you can run the following to do a screen capture to a video. Bam, simple.

ffmpeg -f x11grab -s 800×600 -r 25 -i :0.0 -sameq ~/out.mpg