Lost in string representations

by Clement Delafargue on March 4, 2014

Tagged as: haskell, bytestring, text, string.

When you start playing with haskell, handling text is simple: you use String, which is just an alias for [Char]. You have access to the full goodness of Data.List and for a while, you’re in a happy place. The first version of hammertime used strings everywhere.

Soon enough, you learn many operations on String are O(n) and you use Data.Text as a drop-in replacement, and for a while, you’re in a sort of happy place. Hammertime now uses Data.Text for its internal data representations.

I was there, until I’ve started playing more seriously with Snap. Snap is a Haskell web framework and has roughly the same vision as play framework, with the nice addition of the Snaplet mechanism, and the notable lack of typesafe routing. Like play, it’s focused on HTTP, doesn’t come with a gazillion persistence layers and is quite simple to understand (I’m looking at you, yesod). Like play, it’s based on iteratee IO (but this will change in 1.0 to io-streams, which will bring a simpler interface and a performance boost). Its template system, Heist, is really nice (it’s inspired from liftweb’s template system, which was my favourite part of this framework. Yes I’ve used (and liked) liftweb. Don’t judge me).

Long story short, I kinda like snap. Try it out, it’s fun.

Back to the main story. Even though I use Data.Text in my models, snap uses Data.ByteString for the data you extract from requests (path fragments, query string parameters, form data). For more fun, Data.UUID only parses String or Data.ByteString.Lazy. For even more fun, Aeson, the JSON library only handles Data.Text in its AST (but is capable to (de)serialize (from)to Data.ByteString.

why so complicated

Turns out every representation has its uses (Except String. Don’t use String outside of 1HaskellADay one-liners).

String

String is a naive representation of a string. It’s implemented as a linked list of unicode characters and thus most operations on it have linear complexity. It’s unfit for serious use.

Data.Text

Data.Text is a space efficient, unboxed representation for strings (it also provides a lazy version, which is a list of strict chunks). It has extremely good space and time performance thanks to its internal representation and a powerful loop fusion mechanism. Internally, it’s packed utf16.

It’s the representation you should use for your data models.

Data.ByteString and Data.ByteString.Lazy

Data.ByteString is a strict, immutable representation for binary data. It represents sequences of bytes and is suitable for high-performance use. Unlike Data.Text, Data.ByteString does not carry any information about char encoding. That’s why when converting from ByteString to Text, you need to explicitly tell the encoding you want (and handle possible failures).

Data.ByteString.Lazy is the lazy version, for strings too big to fit in memory.

TL;DR

To sum up: Data.Text for readable text with a known encoding. Data.ByteString for high-performance, binary, transfer where you don’t care about the encoding.

Many thanks to @yoeight, @lucasdicioccio, and @BeRewt for helping me out. :)

For a more accurate and exhaustive breakdown of string representations, check out How to pick your string library in Haskell by Edward Z. Yang

Musings about FP and CS

A log of my journey through FP and CS

Lost in string representations

String

Data.Text

Data.ByteString and Data.ByteString.Lazy

TL;DR