linux内核select/poll，epoll实现与区别

作者：字体：[增加减小] 来源：互联网时间：2017-05-28

通过本文主要向大家介绍了深入理解linux内核,linux内核完全注释,linux内核版本,查看linux内核版本,linux内核移植等相关知识,希望对您有所帮助,也希望大家支持linkedu.com www.linkedu.com

下面文章在这段时间内研究 select/poll/epoll的内核实现的一点心得体会：
select，poll，epoll都是多路复用IO的函数，简单说就是在一个线程里，可以同时处理多个文件描述符的读写。
select/poll的实现很类似，epoll是从select/poll扩展而来，主要是为了解决select/poll天生的缺陷。
epoll在内核版本2.6以上才出现的新的函数，而他们在linux内核中的实现都是十分相似。
这三种函数都需要设备驱动提供poll回调函数，对于套接字而言，他们是 tcp_poll，udp_poll和datagram_poll;
对于自己开发的设备驱动而言，是自己实现的poll接口函数。

select实现（2.6的内核，其他版本的内核，应该都相差不多）
应用程序调用select，进入内核调用sys_select，做些简单初始化工作，接着进入 core_sys_select，
此函数主要工作是把描述符集合从用户空间复制到内核空间，最终进入do_select，完成其主要的功能。
do_select里，调用 poll_initwait，主要工作是注册poll_wait的回调函数为__pollwait，
当在设备驱动的poll回调函数里调用poll_wait，其实就是调用__pollwait，
__pollwait的主要工作是把当前进程挂载到等待队列里，当等待的事件到来就会唤醒此进程。
接着执行for循环，循环里首先遍历每个文件描述符，调用对应描述符的poll回调函数，检测是否就绪，
遍历完所有描述符之后，只要有描述符处于就绪状态,信号中断,出错或者超时，就退出循环，
否则会调用schedule_xxx函数，让当前进程睡眠，一直到超时或者有描述符就绪被唤醒。
接着又会再次遍历每个描述符，调用poll再次检测。
如此循环，直到符合条件才会退出。
以下是 2.6.31内核的有关select函数的部分片段：
他们调用关系：
select --> sys_select --> core_sys_select --> do_select

int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
{
  ktime_t expire, *to = NULL;
  struct poll_wqueues table;
  poll_table *wait;
  int retval, i, timed_out = 0;
  unsigned long slack = 0;
  
  ///这里为了获得集合中的最大描述符，这样可减少循环中遍历的次数。
  ///也就是为什么linux中select第一个参数为何如此重要了
  rcu_read_lock();
  retval = max_select_fd(n, fds);
  rcu_read_unlock();
  if (retval < 0)
    return retval;
  n = retval;

  ////初始化 poll_table结构，其中一个重要任务是把 __pollwait函数地址赋值给它，
  poll_initwait(&table);
  wait = &table.pt;
  if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
    wait = NULL;
    timed_out = 1;
  }
  if (end_time && !timed_out)
    slack = estimate_accuracy(end_time);

  retval = 0;
  ///主循环，将会在这里完成描述符的状态轮训
  for (;;) {
    unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;

    inp = fds->in; outp = fds->out; exp = fds->ex;
    rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;

    for (i = 0; i < n; ++rinp, ++routp, ++rexp) {
      unsigned long in, out, ex, all_bits, bit = 1, mask, j;
      unsigned long res_in = 0, res_out = 0, res_ex = 0;
      const struct file_operations *f_op = NULL;
      struct file *file = NULL;
      ///select中 fd_set 以及 do_select 中的 fd_set_bits 参数，都是按照位来保存描述符，意思是比如申请一个1024位的内存，
      ///如果第 28位置1，说明此集合有 描述符 28， 
      in = *inp++; out = *outp++; ex = *exp++;
      all_bits = in | out | ex; // 检测读写异常3个集合中有无描述符
      if (all_bits == 0) {
        i += __NFDBITS;
        continue;
      }

      for (j = 0; j < __NFDBITS; ++j, ++i, bit <<= 1) {
        int fput_needed;
        if (i >= n)
          break;
        if (!(bit & all_bits))
          continue;
        file = fget_light(i, &fput_needed); ///通过 描述符 index 获得 struct file结构指针，
        if (file) {
          f_op = file->f_op; //通过 struct file 获得 file_operations，这是操作文件的回调函数集合。
          mask = DEFAULT_POLLMASK;
          if (f_op && f_op->poll) {
            wait_key_set(wait, in, out, bit);
            mask = (*f_op->poll)(file, wait); //调用我们的设备中实现的 poll函数，
                                     //因此，为了能让select正常工作，在我们设备驱动中，必须要提供poll的实现，
          }
          fput_light(file, fput_needed);
          if ((mask & POLLIN_SET) && (in & bit)) {
            res_in |= bit;
            retval++;
            wait = NULL; /// 此处包括以下的，把wait设置为NULL，是因为检测到mask = (*f_op->poll)(file, wait); 描述符已经就绪
                       /// 无需再把当前进程添加到等待队列里，do_select 遍历完所有描述符之后就会退出。
          } 
          if ((mask & POLLOUT_SET) && (out & bit)) {
            res_out |= bit;
            retval++;
            wait = NULL;
          }
          if ((mask & POLLEX_SET) && (ex & bit)) {
            res_ex |= bit;
            retval++;
            wait = NULL;
          }
        }
      }
      if (res_in)
        *rinp = res_in;
      if (res_out)
        *routp = res_out;
      if (res_ex)
        *rexp = res_ex;
      cond_resched();
    }
    wait = NULL; //已经遍历完一遍，该加到等待队列的，都已经加了，无需再加，因此设置为NULL
    if (retval || timed_out || signal_pending(current)) //描述符就绪，超时，或者信号中断就退出循环
      break;
    if (table.error) {//出错退出循环
      retval = table.error;
      break;
    }

    /*
     * If this is the first loop and we have a timeout
     * given, then we convert to ktime_t and set the to
     * pointer to the expiry value.
     */
    if (end_time && !to) {
      expire = timespec_to_ktime(*end_time);
      to = &expire;
    }
    /////让进程休眠，直到超时，或者被就绪的描述符唤醒，
    if (!poll_schedule_timeout(&table, TASK_INTERRUPTIBLE,
            to, slack))
      timed_out = 1;
  }

  poll_freewait(&table);

  return retval;
}
void poll_initwait(struct poll_wqueues *pwq)
{
  init_poll_funcptr(&pwq->pt, __pollwait); //设置poll_table的回调函数为 __pollwait,这样当我们在驱动中调用poll_wait 就会调用到 __pollwait
  ........
}
static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
        poll_table *p)
{
  ...................
  init_waitqueue_func_entry(&entry->wait, pollwake); // 设置唤醒进程调用的回调函数，当在驱动中调用 wake_up唤醒队列时候，
                                          // pollwake会被调用，这里其实就是调用队列的默认函数 default_wake_function
                                          // 用来唤醒睡眠的进程。
  add_wait_queue(wait_address, &entry->wait);     //加入到等待队列
}

int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
        fd_set __user *exp, struct timespec *end_time)
{
  ........
  //把描述符集合从用户空间复制到内核空间
  if ((ret = get_fd_set(n, inp, fds.in)) ||
    (ret = get_fd_set(n, outp, fds.out)) ||
    (ret = get_fd_set(n, exp, fds.ex)))
  .........
  ret = do_select(n, &fds, end_time);
  .............
  ////把do_select返回集合，从内核空间复制到用户空间
  if (set_fd_set(n, inp, fds.res_in) ||
    set_fd_set(n, outp, fds.res_out) ||
    set_fd_set(n, exp, fds.res_ex))
    ret = -EFAULT;
   ............
}

</div>

poll的实现跟select基本差不多，按照
poll --> do_sys_poll --> do_poll --> do_pollfd 的调用序列
其中do_pollfd是对每个描述符调用其回调poll状态轮训。
poll比select的好处就是没有描述多少限制，select 有1024 的限制，描述符不能超过此值，poll不受限制。
我们从上面代码分析，可以总结出select/poll天生的缺陷：
1）每次调用select/poll都需要要把描述符集合从用户空间copy到内核空间，检测完成之后，又要把检测的结果集合从内核空间copy到用户空间
当描述符很多，而且select经常被唤醒，这种开销会比较大
2）如果说描述符集合来回复制不算什么，那么多次的全部描述符遍历就比较恐怖了，
我们在应用程序中，每次调用select/poll 都必须首先遍历描述符，把他们加到fd_set集合里，这是应用层的第一次遍历，
接着进入内核空间，至少进行一次遍历和调用每个描述符的poll回调检测，一般可能是2次遍历，第一次没发现就绪描述符，
加入等待队列，第二次是被唤醒，接着再遍历一遍。再回到应用层，我们还必须再次遍历所有描述符，

分享到：QQ空间新浪微博腾讯微博微信百度贴吧 QQ好友复制网址打印

您可能想查找下面的文章:

2017-05-28C++中的运算符和运算符优先级总结
2017-05-28C++调试追踪class成员变量的方法
2017-05-28你必须知道的C语言预处理的问题详解
2017-05-28linux C++ 获取文件绝对路径的实例代码
2017-05-28VC中Tab control控件的用法详细解析
2017-05-28马尔可夫链算法（markov算法）的awk、C++、C语言实现代码
2017-05-28C++封装IATHOOK类实例
2017-05-28C++遍历文件夹获取文件列表
2017-05-28C语言转义字符实例详解
2017-05-28编写C++程序使DirectShow进行视频捕捉

linux内核select/poll，epoll实现与区别

您可能想查找下面的文章:

相关文章

文章分类

最近更新的内容