bebraw
bebraw12mo ago

Parsing HTML/XML

Lately I've been researching how to write a simple HTML/XML -> JSON converter. My syntax isn't completely standard HTML (i.e., <Button> is possible) which makes this a notch harder than it probably should be. Here's an example of how things should work: https://github.com/gustwindjs/gustwind/blob/feat/html-prototype/html-to-breezewind/tests/element_test.ts . The current implementation relies on https://deno.land/x/deno_dom but the problem is that it's losing information for tagName. I cannot use it to tell a button apart from a Button as it's uppercasing the tag by default. As an alternative, I looked into https://deno.land/x/xml but it's losing structural information due to its automatic grouping (i.e., it folds divs within a single array and loses their relative positioning so I cannot map the structure later on to match the original). Any insight on the issue would be valuable. Maybe I have to fork either to add the missing functionality but I hope to avoid that.
GitHub
gustwind/html-to-breezewind/tests/element_test.ts at feat/html-prot...
🐳💨 – Deno powered JSON oriented site generator. Contribute to gustwindjs/gustwind development by creating an account on GitHub.
2 Replies
marvinh.
marvinh.12mo ago
Using an existing HTML parser will be difficult because tag names are case insensitive in HTML. If you don't need to support all the crazy edge cases and could live with a more constrained language like JSX you could make parsing a lot easier. In htm for example which is a mix of HTML + JSX we handrolled our own for a similar purpose https://github.com/developit/htm/blob/master/src/build.mjs
bebraw
bebraw12mo ago
that's a good point. JSX would be an option perhaps. i could just skip using their templating syntax (or even inject from context later and skip JS expressions that don't map to JSON). i'll try this direction in my next attempt. thanks! looking closer, it's definitely going to be some variant of your treeify so that's a huge help already