decode page content

decode page content

· json · rss
#golang 
View url →

About

As we know the Go use utf-8 as the default encode to handle the string or something like stream. But in the web you may meet other different encode methods, so it is necessary to convert it, otherwise you will get strange character.

    resp, err := http.Get("http://hq.sinajs.cn/list=sh601162")
    if err != nil {
        panic("http request error")
    }
    defer resp.Body.Close()

    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
      panic("can't read content")
    }

    fmt.Println(string(body))

Upper code is very easy to fetch content from a web, but you will see some illegal content in the command line if you run it.

How to handle it? Someone has provide the solution. github.com/Tang-RoseChild/mahonia

Then the snippet will be like this.

    resp, err := http.Get("http://hq.sinajs.cn/list=sh601162")
    if err != nil {
        panic("http request error")
    }
    defer resp.Body.Close()

    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
      panic("can't read content")
    }

    enc := mahonia.NewDecoder("gb18030")
    out := enc.ConvertString(string(body))
    fmt.Println(out)

Now you can get the perfect string result.

But it looks litter bit ugly. Why? It read resp.Body as []byte, then convert to string, at last using decoder to convert it. It can work, but not so good. How to do it with high performance?

	resp, err := http.Get("http://hq.sinajs.cn/list=sh601162")
	if err != nil {
		panic("http request error")
	}
	defer resp.Body.Close()


	enc := mahonia.NewDecoder("gb18030")
	content := enc.NewReader(resp.Body)
	io.Copy(os.Stdout, content)

Using mahonia's reader, it can skip some convert between string and []byte. You can also get the result in command line, but it has limitation, only os.Stdout will be uncomfortable.

    ......
	enc := mahonia.NewDecoder("gb18030")
	content := enc.NewReader(resp.Body)
	buffer := new(bytes.Buffer)
	io.Copy(buffer, content)
	fmt.Println(string(buffer.Bytes()))

So this will be more general to handle it. Can we improve further more? Yes, the mahonia provide the Read.

    ......
	enc := mahonia.NewDecoder("gb18030")
	content := enc.NewReader(resp.Body)
    s := make([]byte, 256)
    content.Read(s) //ignore the read byte number and error
    fmt.Println(string(s))

If you understand the Go's interface, you can easily to process the same problem. Now I leave it to you to make it better. The key point is reader's EOF.