Now I write my second blog. It’s also a web spider but for github. Our teacher said that github is a social coding and want me to find some api about github in order to know our active in github. But I search that api and find that v3 and v4 api I found didn’t have what I need. So I can only use something likes spider.😥😥

Because github is the social coding. I need to create a harmless spider. I think it’s just use less than twenty http get a day. So I think it’s nothing. This spider I use Scala and jsoup to create. The reason why I use jsoup is that I find github user profile website is nearly a static website.

<g transform="translate(16, 20)">
      <g transform="translate(0, 0)">
          <rect class="day" width="10" height="10" x="13" y="0" fill="#ebedf0" data-count="0" data-date="2016-05-22"/>
          <rect class="day" width="10" height="10" x="13" y="12" fill="#ebedf0" data-count="0" data-date="2016-05-23"/>
          <rect class="day" width="10" height="10" x="13" y="24" fill="#ebedf0" data-count="0" data-date="2016-05-24"/>
          <rect class="day" width="10" height="10" x="13" y="36" fill="#ebedf0" data-count="0" data-date="2016-05-25"/>
          <rect class="day" width="10" height="10" x="13" y="48" fill="#ebedf0" data-count="0" data-date="2016-05-26"/>
          <rect class="day" width="10" height="10" x="13" y="60" fill="#ebedf0" data-count="0" data-date="2016-05-27"/>
          <rect class="day" width="10" height="10" x="13" y="72" fill="#ebedf0" data-count="0" data-date="2016-05-28"/>
      </g>

And what I need is like this. It’s a static website. What I need is in rect. It’s data-count and data-date. And these attributes. So I write my code like this.

  def getLinks(url:String): List[Node] = {
    val doc:Document = Jsoup.connect(url).get()
    val links:Elements = doc.select("rect")
    var ret:ListBuffer[Node] = ListBuffer()
    val iterator = links.iterator()
    while (iterator.hasNext) {
      val ne = iterator.next()
      ret += new Node(ne.attr("data-date"),ne.attr("data-count"))
    }
    ret.toList
  }

These codes have a pit. It’s about the list. List is a static object. If I need to append some objects in it, I can use ListBuffer. That’s about the Node class


class Node(date:String,count:String) {
  val Date:String = date
  val Count:String = count

  override def toString: String = "Date="+Date+"\t"+"Count="+Count
}

It’s just have two string and I rewrite the toString function.

Now I just have finished the kernel about the parser. The next version I will use XML to read who I need to parser and write those actives.

If you want to see all the codes place see my project on the github. It’s GitHubUserProfileParser

Last modification:January 27th, 2020 at 01:04 pm