最近在调试机器人时发现导航节点经常莫名挂死,排查了很久才最终找到并解决问题,特此记录。
通过gdb打出的挂死时程序的堆栈情况如下:
#0 0xb76edcb0 in ?? () #1#2 0xb5ab8464 in tf2::BufferCore::lookupframeNumber(std::string const&) const () from /opt/ros/indigo/lib/libtf2.so #3 0xb5abc174 in tf2::BufferCore::canTransform(std::string const&, std::string const&, ros::Time const&, std::string*) const () from /opt/ros/indigo/lib/libtf2.so #4 0xb5abc280 in tf2::BufferCore::canTransform(std::string const&, ros::Time const&, std::string const&, ros::Time const&, std::string const&, std::string*) const () from /opt/ros/indigo/lib/libtf2.so #5 0xb5b0f730 in tf2_ros::Buffer::canTransform(std::string const&, ros::Time const&, std::string const&, ros::Time const&, std::string const&, ros::Duration, std::string*) const () from /opt/ros/indigo/lib/libtf2_ros.so #6 0xb69de00f in tf::Transformer::waitForTransform(std::string const&, ros::Time const&, std::string const&, ros::Time const&, std::string const&, ros::Duration const&, ros::Duration const&, std::string*) const () from /opt/ros/indigo/lib/libtf.so #7 0x97152812 in base_local_planner::getGoalPose (tf=..., global_plan=std::vector of length 124, capacity 158 = {...}, global_frame="odom", goal_pose=...)
上面的信息基本确定罪魁祸首是DWA里的base_local_planner的getGoalPose函数(在goal_functions.cpp文件里)
bool getGoalPose(const tf::TransformListener &tf,
const std::vector &global_plan,
const std::string &global_frame, tf::Stamped &goal_pose)
{
if (global_plan.empty())
{
ROS_ERROR("Received plan with zero length");
return false;
}
// for debug
LOG4_INFO_P(gp_dwa_log, "getGoalPose 2");
// 全局参考路径最后路径点
geometry_msgs::PoseStamped plan_goal_pose = global_plan.back();
// for debug
LOG4_INFO_P(gp_dwa_log, "getGoalPose 3");
try
{
// 查询路径坐标系-->global坐标系的tf
tf::StampedTransform transform;
tf.waitForTransform(global_frame, ros::Time(0),
plan_goal_pose.header.frame_id, plan_goal_pose.header.stamp,
plan_goal_pose.header.frame_id, ros::Duration(0.5));
// for debug
LOG4_INFO_P(gp_dwa_log, "waitForTransform frame_id: %s", plan_goal_pose.header.frame_id.c_str());
tf.lookupTransform(global_frame, ros::Time(0),
plan_goal_pose.header.frame_id, plan_goal_pose.header.stamp,
plan_goal_pose.header.frame_id, transform);
// for debug
LOG4_INFO_P(gp_dwa_log, "lookupTransform");
// 变换最后路径点到global坐标系
poseStampedMsgToTF(plan_goal_pose, goal_pose);
goal_pose.setData(transform * goal_pose);
goal_pose.stamp_ = transform.stamp_;
goal_pose.frame_id_ = global_frame;
}
catch (tf::LookupException &ex)
{
ROS_ERROR("No Transform available Error: %sn", ex.what());
return false;
}
catch (tf::ConnectivityException &ex)
{
ROS_ERROR("Connectivity Error: %sn", ex.what());
return false;
}
catch (tf::ExtrapolationException &ex)
{
ROS_ERROR("Extrapolation Error: %sn", ex.what());
if (global_plan.size() > 0)
ROS_ERROR("Global frame: %s Plan frame size %d: %sn", global_frame.c_str(), (unsigned int)global_plan.size(), global_plan[0].header.frame_id.c_str());
return false;
}
return true;
}
刚开始以为是调用waitForTransform()导致查tf超时之类引发的问题,优化后发现问题依旧。最后经过多次努力,终于找到原因(重点来了!)
原因有下:
- 全局参考路径global_plan没有做多线程读写保护,导致getGoalPose在查global_plan的时候(线程定时查)另外的线程在修改;
- geometry_msgs::PoseStamped &plan_goal_pose = global_plan.back();偏偏取全局路径最后一个点用来查tf的时候用的是引用而不是拷贝,导致其他线程一修改global_plan就会导致查tf挂掉;
那为啥原生DWA没有做这个优化呢?难道作者没发现?一看,确实,因为move_base没有多线程读写global_plan,好吧。。。
总结与反思:
- 对于多线程问题一定要注意读写保护,及时加锁,在不影响性能的情况下可以避免使用引用的方式;
- 多线程锁也要注意死锁,要拎的清关系;
- 可以借助代码扫描工具(如SonarCube)提前发现系统漏洞,或者使用动态扫描工具,节省时间成本;



