Pseudo-Protocol and Encoding Bypass

Posted on 2021-09-10 Edited on 2025-06-14 In Security Waline:

This blog can be linked to my last post aboutXSS 的分类与利用方式The pseudo-protocol and encoding bypass mentioned this time is also a code injection attack method. The pseudo-protocol provides another code injection point, and encoding bypass is the corresponding injection method.

What is a fake agreement

The concept of pseudo-protocol is relatively simple, but there are many types of situations.

Pseudo-protocols are different from protocols widely used on the Internet such as http, https, ftp, etc. They are used in URLs to perform specific functions.
For example, Data pseudo-protocol: ‘data: text/html; base64, PHanflajfAFLGLKSJ =’,
JavaScript pseudo-protocol: ‘javascript: alert (1); **’;

1
2
3

< p > < a href = "javascript: alert (5) " > javascript pseudo-protocol</a> </p >
<p>	<a href="data:text/html;base64,PHNjcmlwdD5hbGVydCgxKTs8L3NjcmlwdD4=">data 伪协议 base64</a></p>
<p>	<a href="data:text/html;base64,PHNjcmlwdD5hbGVydCgiMSIpOzwvc2NyaXB0Pg">含双引号data 伪协议 base64</a></p>

Encoding bypass

In fact, the pseudo-protocol objectively provides an additional code injection point, and the encoding bypass is the way we use this injection point for injection.

This injection method is mainly to bypass the filtering of ‘< script >’ tags, parentheses, quotation marks, etc

Unicode encoding

Before talking about coding bypass, let’s first mention this well-known coding method.

ISO (International Organization for Standardization) developed a code that includes all cultures, all letters and symbols on Earth, using two bytes to represent one character.

Unicode is just a set of symbols. It only specifies the binary code of the symbol, but it does not specify how this binary code should be stored.

The specific storage is implemented by: UTF-8, UTF-16, etc

JavaScript encoding: ‘& #x’, ‘\ u’ can be used to indicate that the current string is a unicode encoding, such as string test, which can be expressed as’test ‘, or’ & #x0074 & #x0065 & #x0073 & #x0074 ‘, or’ & #x74 & #x65 & #x73 & #x74 ’
HTML Entity Encoding: ‘& #’ indicates the current HTML entity encoding, or take test as an example, ‘& #116 & #101 & #115 & #116’.
URL encoding: This is more common, that is, the beginning of the percent sign, such as’% 74% 65% 73% 74 ’

Browser decoding

** There are three main processes involved in parsing an HTML doc: HTML parsing, URL parsing, and JavaScript parsing. Each parser is responsible for decoding and parsing its corresponding part of the HTML doc, and the order varies **

First, let’s take a look at the effect of each parser separately.

HTML encoding

< p > For a tag, href attribute javascript: alert (1) encoding comparison </p >
< p > 1 < a href = "javascript: alert (1) " > unencoded</a> </p >
< p > 2 < a href = "javascript: alert (1) " > encode r in href</a> </p >
< p > 3 < a href = "javascript: alert (1) " > encode r in javascript</a> </p >
< p > 4 < a href = "javascript: alert (1) " > encode r in alert (1)</a> </p >

Among these encoding methods, only 2 cannot be parsed correctly, because 2 destroys the doc structure of HTML.
Because the HTML parser does not decode the HTML keywords, there is no recognition of 2 as a normal HTML tag, and the latter will not be, so we can see that the latter two, no matter what encoding method is changed, cannot be decoded correctly.

URL encoding

< p > For a tag, href attribute javascript: alert (1) encoding comparison </p >
< p > 1 < a href = "javascript: alert (1) " > unencoded</a> </p >
< p > 2 < a h% 72ef = "javascript: alert (1) " > encode r in href</a> </p >
< p > 3 < a href = "javasc% 72ipt: alert (1) " > encode r in javascript</a> </p >
< p > 4 < a href = "javascript: ale% 72t (1) " > encode r in alert (1)</a> </p >

In this case, 1,4 can be bypassed, and 2 cannot be bypassed for the same reason as above. This thing will not be parsed as normal html, and 3, although it will be parsed into a tag, but It is the url encoding of the javascript protocol type, and the protocol is also part of the url. The url parser is to identify according to this keyword. It cannot be identified now, so it cannot be bypassed.

JavaScript coding

< p > For a tag, href attribute javascript: alert (1) encoding comparison </p >
< p > 1 < a href = "javascript: alert (1) " > unencoded</a> </p >
< p > 2 < a href = "javascript: alert (1) " > encode r in href</a> </p >
< p > 3 < a href = "javascript: alert (1) " > encode r in javascript</a> </p >
< p > 4 < a href = "javascript: alert (1) " > encode r in alert (1)</a> </p >
< p > 5 < a href = "javascript: alert (1) " > encode (in alert (1)</a> </p >

This is also 1, 4 can be bypassed, 2 is to destroy the DOM structure, 3 is to destroy the protocol, the label can be parsed normally, but the protocol is wrong, 5 this destroys the parsing of js, and cannot be performed correctly.

In summary, in the corresponding parsing stage, if the keyword of this stage is encoded and the structure is destroyed, it cannot be parsed correctly.

If the HTML parsing stage encodes the href attribute, it cannot be parsed correctly. HTML parsing does not consider that the a tag has an attribute called’href ‘, but that there is an attribute called’href’.

When the url parser goes to javasc% 72ipt: alert (1), it will not consider this a JavaScript pseudo-protocol at all, so it will not decode it.

1 2	< p > 4 < a href = "javascript: alert (1) " > encode r in alert (1)</a> </p > < p > 5 < a href = "javascript: alert (1) " > encode (in alert (1)</a> </p >

Both of these can correctly go to the last step of the JavaScript parsing stage, and then extract the js code for parsing. At this time, it depends on how the js parser parses it.

If the encoding of the url keyword, such as’javascript ', is not parsed correctly during the HTML decoding phase, then the url parsing phase will not parse it.

Multiple layers of confusion

The above is a separate parser, in fact, we also mentioned that the browser parses the doc in order, respectively, the HTML parser parses the dom tree, the URL parser decodes the url attribute in the dom tree, and finally the JavaScript parser decodes the part of the dom tree that is considered to be js code

As long as we reverse this order, we can perform two or even three layers of obfuscated encoding on a piece of HTML code.

Such as second-level confusion

1
2
3

< p > 1 < a href = "javascript: ale% 5c% 75% 30% 30% 37% 32t (1) " > r in alert (1) encodes js, then url</a> </p >
< p > 2 < a href = "javascript: ale\ u0072t (1) " > r in alert (1) is encoded in js, then html</a> </p >
< p > 3 < a href = "javascript: ale% 72t (1) " > r in alert (1) is url encoded, then html encoded</a> </p >

Three layers of confusion

1
2

< p > For a tag, the href attribute javascript: alert (1) is triggered by a three-layer encoding vulnerability </p >
< p > Three-layer decoding < a href = "javascript: ale% 5c% 75% 30% 30% 37% 32t (1) " > The r in alert (1) is js encoded, then url encoded, and then html encoded </p ></a>

Summary

To sum up, in our last blog about xss, developers can filter by detecting keywords, scripts, etc.

However, the browser provides a pseudo-protocol method, so we have added a code injection method.

The advantage of this method is that it allows us to encode these keywords, thereby bypassing the above detection.

And this requires understanding two points, the first point, the encoding and decoding method of the browser, and the second, the order of decoding.

The first is html decoding. If the key to html is encoded at this stage, it cannot be parsed correctly. For example, if the href attribute of a tag is html encoded, the browser will think that you just want to give a tag an attribute called’href ', but at this stage, other parts can be decoded in html, such as’javascript: alert (1) '. The r in this code is encoded in html encoding and can be decoded at this stage (that is, html decoding stage, only those non-html keywords will be decoded and encoded in html encoding).

The second is URL decoding. At this stage, if it is’javasc% 72ipt: alert (1) ‘, the url encoding thinks you have a protocol called’javasc% 72ipt’, and of course it will not be parsed correctly.

Finally, JavaScript decoding can go to this stage, indicating that the above two steps have been correctly parsed, and that you have a piece of js code here, then parse it according to the js parser.

Understand it from a different perspective.

The html decoding stage is to see what tags you have and what attributes the tags have. If you encode these things, it is actually written in a different way. Of course, the parser directly parses it into a dom tree as you said. As for whether the follow-up is recognizable by the browser kernel, the parser does not care. The information that is not used in this step can be decoded by html decoding.

After HTML decoding, for parsing the url part, then perform url decoding. One purpose of this step is to see what protocol you are. If you encode the name of the protocol, the parser also thinks that you are the protocol. As for whether there is really this protocol, it is not the parser that is responsible.

Finally, after being parsed by the url parser, it is considered to be the part of the js code, and then the js is decoded.

In the end, these three steps are parsed first, what you say is what you say, you say there is an attribute called’href ', then there is, the rest of the part is decoded according to the decoding method corresponding to the current stage and handed over to the next stage, the next stage is to first pick out the part you are responsible for parsing (not decoding), and then decode the rest according to the decoding method corresponding to the current stage. As for whether it can run in the end, that’s not what they care about. **

For example, this thing: ‘< p > 3 < a href = "javasc% 72ipt: alert (1) " > r in javascript is encoded

’, in the html decoding stage a tag and href attributes can be parsed normally, in theory ‘javasc% 72ipt’ this thing will also be decoded by html, but this section is not html encoded, so it is not decoded correctly. At the next stage, the url parser thinks that the protocol of this thing is’ javasc% 72ipt ', so the next stage will not think that it is a piece of js, even if it can be parsed correctly by the js parser.