#2357 URI Encoding / Decoding

SlimerDude Fri 10 Oct 2014

First off, after reading about the horror that other programming languages bring to the table with regards to URI support (notably Ruby and Python), I'm very pleased that Fantom took the time to get this right!

I do however have the odd question regarding URI encoding / decoding...

Q1) Encoding

Lets say I have some nasty bit of text than I want to use as path segment. (Let's even have it include the / character!) How do I create a URI with this? Example:

url   := `http://foo.com/`
nasty := "-/-"

// ... code to add 'nasty' to 'url'

// the standard (backslashed) form of url should now look like:
echo(url) // --> http://foo.com/-\/-

To create a Uri from a Str there is sys::Uri.fromStr and sys::Uri.decode, but to use these my Str needs to be already encoded as standard form or percent encoded.

Is there a method somewhere that does this encoding for me? Or am I to write it myself?

Q2) Decoding

Assuming I now have my URL http://foo.com/-\/-, how do I now convert the path segment back into a standard Str that's not standard form? Example:

url   := Uri("http://foo.com/-\\/-")
path  := url.path.first

nasty := ... // code to decode path *from* standard form

echo(nasty) // --> -/-

Again, am I not seeing a method somewhere or do I need to write it myself?

If these standard form <-> Str and percent endcoding <-> Str methods don't currently exist, I feel as if they're an omission to the current (and otherwise excellent) API and it'd be nice if they were supported.

SlimerDude Fri 10 Oct 2014

For those wanting to convert to / from URI standard form, here are my methods:

static const Int[] delims := ":/?#[]@\\".chars

// Encode the Str *to* URI standard form
// see http://fantom.org/sidewalk/topic/2357
static Str encodeUri(Str str) {
    buf := StrBuf(str.size + 8) // allow for 8 escapes
    str.chars.each |char| {
        if (delims.contains(char))
            buf.addChar('\\')
        buf.addChar(char)
    }
    return buf.toStr
}

** Decode the Str *from* URI standard form
** see http://fantom.org/sidewalk/topic/2357
static Str decodeUri(Str str) {
    if (!str.chars.contains('\\'))
        return str
    buf := StrBuf(str.size)
    escaped := false
    str.chars.each |char| {
        escaped = (char == '\\' && !escaped)
        if (!escaped)
            buf.addChar(char)
    }
    return buf.toStr
}

brian Mon 20 Oct 2014

Maybe I'm not fully grokking it, but how does the API now whether that "-/-" is supposed to treat the "/" as a path separator or know that it is supposed to backslash escape it? All the normal encode/decode assume that special chars are being used for their normal purposes (scheme, port, path separators). Is what you are trying to do is escape all those special chars because you know that the string is a single file name within the path?

It sounds like you are going some wacked out things if you are trying to escape slashes and stuff, so maybe some background info might help too.

SlimerDude Sun 26 Oct 2014

maybe some background info might help too.

Sure.

Str <-> Standard Form

BedSheet encodes / decodes objects as strings so they may be embedded in URLs. A common use case is that a user object may encode itself as its primary key, so a User with an ID of 42 may be combined with a URL of /user to make /user/42.

A string msg could also be encoded, for example Hello Mum! would become /showMsg/Hello Mum!. The point is, any string should be able to be encoded into a URL:

msg := "What the @#:\\/!?"
url := `/showMsg/` + encodeUri(msg).toUri

At the other end when you're handling the request for the /showMsg/... you want to decode the URI segment back into it's original form.

origMsg := decodeUri(url.path[1]) // --> "What the @#:\\/!?"

So the methods do as they say, encode and decode strings into URI paths.

Str <-> Percent Encoding

I see in the java src that Fantom has some pretty optimised routines for encoding / decoding URIs into an percent encoded format. It would be neat if they were exposed a little so others could make use of it.

OAuth in particular makes heavy use of percent encoding.

Specifically I'm thinking of methods like:

static Str percentEncode(Str str, Str exclude)
static Str percentDecode(Str str)

where exclude is a list of characters that will not be encoded, usually the unreserved set -._~

brian Thu 8 Jan 2015

Ticket promoted to #2357 and assigned to brian

Add methods to Uri to encode/decode just the name portion of path

brian Fri 10 Jul 2015

More summary from 2432:

  1. Encode/decode just name portion (or really any portion)
  2. Make parts that encodes the scheme, host, path, query, frag and glues it back together with appropriate separators
  3. Maybe add isPathRel

SlimerDude Tue 10 Nov 2015

As percent encoding UTF-8 strings is non-trivial, here's some sample code:

**
** Percent encode the given string as per 
** [Parameter Encoding]`http://oauth.net/core/1.0/#rfc.section.5.1` 
** of the OAuth spec. 
** Essentially, encode ALL characters except for 'A-Za-z0-9-_.~'
**
static Str percentEscape(Str str) {
    buf := StrBuf(str.size * 2)
    str.each { 
        if (it.isAlphaNum || it == '-' || it == '_' || it == '.' || it == '~')
            buf.addChar(it)
        else
            percentEncodeUtf8Char(buf, it)
    }
    return buf.toStr            
}

static Void percentEncodeUtf8Char(StrBuf buf, Int c) {
    if (c <= 0x007F) {
        percentEncodeByte(buf, c);
    } else if (c <= 0x07FF) {
        percentEncodeByte(buf, 0xC0.or(c.shiftr( 6).and(0x1F)))
        percentEncodeByte(buf, 0x80.or(c.shiftr( 0).and(0x3F)))
    } else if (c <= 0xFFFF) {
        percentEncodeByte(buf, 0xE0.or(c.shiftr(12).and(0x0F)))
        percentEncodeByte(buf, 0x80.or(c.shiftr( 6).and(0x3F)))
        percentEncodeByte(buf, 0x80.or(c.shiftr( 0).and(0x3F)))
    } else if (c <= 0x10FFFF) {
        percentEncodeByte(buf, 0xF0.or(c.shiftr(18).and(0x0F)))
        percentEncodeByte(buf, 0x80.or(c.shiftr(12).and(0x3F)))
        percentEncodeByte(buf, 0x80.or(c.shiftr( 6).and(0x3F)))
        percentEncodeByte(buf, 0x80.or(c.shiftr( 0).and(0x3F)))
    } else
        throw ArgErr("0x${c.toHex} is not a valid UTF-8 code point")
}

static Void percentEncodeByte(StrBuf buf, Int c) {
    buf.addChar('%');
    hi := c.shiftr(4).and(0xf);
    lo := c.and(0xf);
    buf.addChar(hi < 10 ? '0'+hi : 'A'+(hi-10))
    buf.addChar(lo < 10 ? '0'+lo : 'A'+(lo-10))
}

Some test examples:

// examples from https://en.wikipedia.org/wiki/UTF-8
percentEscape("\u0024")              // --> "%24"
percentEscape("\u00a2")              // --> "%C2%A2"
percentEscape("\u20ac")              // --> "%E2%82%AC"
percentEncodeUtf8Char(buf, 0x10348)  // --> "%F0%90%8D%88"

// examples from https://tools.ietf.org/html/rfc3629#section-7
percentEscape("\u2262\u0391")        // --> "%E2%89%A2%CE%91"
percentEscape("\uD55C\uAD6D\uC5B4")  // --> "%ED%95%9C%EA%B5%AD%EC%96%B4"
percentEscape("\u65E5\u672C\u8A9E")  // --> "%E6%97%A5%E6%9C%AC%E8%AA%9E"
percentEncodeUtf8Char(buf, 0x233B4)  // --> "%F0%A3%8E%B4"

brian Fri 15 Sep 2017

Ticket resolved in 1.0.70

I cleaned up the escape handling in Uri normalization, and added five new methods: isPathRel, escapeToken, unescapeToken, encodeToken, and decodeToken. These methods are not design as a general purpose percent encoding library, but rather just designed to work with URIs and our predefined and optimized charMap/delimiter tables. These are actually the fundamental building blocks that where not easily exposed previously. I decided against some other higher level convenience methods such as a new "section constructor" for now - although now its fairly easy to build up parts into a normalized or encoded form now yourself with a StrBuf.

SlimerDude Sat 16 Sep 2017

Those methods look like a good addition Brian, thanks! I look forward to trying them out.

As for the convenience ctor, I may try putting a util class together which hopefully, now that we have the new methods, shouldn't be too difficult

Login or Signup to reply.