Hacker News new | past | comments | ask | show | jobs | submit login
Scrape: A simple, higher level interface for Go web scraping (github.com/yhat)
79 points by ericchiang on May 24, 2015 | hide | past | favorite | 15 comments



Go has some weird syntactic sugar including where a method invocation is rewritten by the compiler to pass in a value or a pointer depending on what the callee wants(!?!). And yet Go code is still littered with:

    if err != nil {
... rather than some simple, compile-time validated sugar to pass the error value up the call chain. Yes, I've read the justification documents. No, they still don't make a very convincing argument.


> rewritten by the compiler to pass in a value or a pointer

This is only method receivers, not for arguments. Essentially the way of invoking a method on a pointer receiver (like '->' in C/C++) isn't any different than invoking it on a value receiver (like '.' in C/C++). But you can't pass 'pointer to int' to a method where 'int' is expected.

> some simple, compile-time validated sugar to pass the error value up the call chain

This bothers me a bit, too. Sometimes I was able to work around it in a creative way. Take http handlers as an example:

  func standardHandler(w http.ResponseWriter, r *http.Request) {
    // errors may occur here...
  }
You can wrap these in an error handling function like this:

  type naiveHandler(w http.ResponseWriter, r *http.Request) error

  func handleErrors(fn naiveHandler) {
    return func(w http.ResponseWriter, r *http.Request) {
      if fn(w, r) != nil {
        // handle errors
      }
    }
  }

  http.HandleFunc("/", handleErrors(handler1))
  http.HandleFunc("/", handleErrors(handler2))
  // ...
This is handy since you can handle different errors differently - eg. report, log etc - but uniformly between all handlers. To be fair the only syntactic sugar I'd need in most cases is an equivalent of a C/C++ macro:

  RETURN_ERR_UNLESS_NIL(/* Go expression returning only error */)


All I can say, that maintaining that is going to be a pleasure. These on your face things are perfect for understanding the structure quickly. Usually maintenance phase is where software is for most of its lifetime.

The less magic, syntactic sugar and tricks, the better.


To me goquery seems more intuitive than scrape, may be because I am more familiar with jquery selectors syntax.

Any reason why yhat guys (ericchiang) created Scrape (and not use say goquery)?

Can you make the matcher function in main.go go away with a simpler (more intuitive) interface/api/dsl?


This is a very small amount of boilerplate around the golang.com/x/net/html package. If you need the huge feature set of goquery, use that. But I find this pretty suitable for my day to day problems.


I agree. Reading the title, I thought scrape would offer something beyond HTML manipulation.

The way that it is now, goquery provides easier and more familiar HTML manipulation.


I like goquery[1] for doing this type of thing.

[1] https://github.com/PuerkitoBio/goquery


I'd like to introduce htmlutil[1] and cascadia[2] for DOM processing in Go which is useful in scraping articles.

[1]: https://github.com/thinxer/go-htmlutil

[2]: https://github.com/andybalholm/cascadia


Selfless plug.. May also want to check out Surf for web scraping.

https://github.com/headzoo/surf Docs: http://www.gosurf.io/

Among other things goquery is baked in to easily select page elements using CSS selectors.


This is very cool. I'm not much of a front-end guy so I'm struggling with the examples. Would you mind posting up a simple example that will scrape--say--the first TD tag of every row of a table? Thanks.


  rows := scrape.FindAll(table, scrape.ByTag(atom.Tr))
  cols := []*html.Node{}
  for _, row := range rows {
      // Find returns the first result
      col, ok := scrape.Find(row, scrape.ByTag(atom.Td))
      if ok {
          cols = append(cols, col)
      }
  }


Thanks!



supporting xpath?


Those who don't understand xpath are cursed to reinvent it, poorly.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: