Converting HTML to Portable Text

Migrate HTML content from WordPress to Portable Text in Sanity, gaining presentation-agnostic block content and rich querying and filtering capabilities for your structured content.

Your migration script now handles individual content fields as well as image uploads. This is essential groundwork, but we've danced around what will undoubtedly be the most complicated part—migrating from HTML into Portable Text.

This can be difficult because WordPress's HTML is stored as a string and could contain literally anything. It is unstructured content. And the mess you're trying to get out of. While Sanity provides tooling to smooth this process, there are bound to be rough edges depending on the quality of your existing data source content.

In this lesson, you'll import post-processed HTML from the content.rendered attribute in the WordPress REST API response. This string of HTML will have all functions from your WordPress installation executed. Such as "shortcodes" translated into markup.

Once complete, you'll have content stored in Portable Text, a standard allowing you to:

Work in a fully customizable, real-time collaborative editor with custom blocks, marks, styles, comments, and so on
Render block content directly as props in front end frameworks (on web and mobile)
Query documents based on specific block content and apply filters to even the most complex structured block content and rich text

In the next lesson, you'll send an authenticated request to get access to pre-processed HTML in content.raw – preferable for handling content written in the Block Editor (known as Gutenberg).

If you proceed with this lesson and import the content as-is, it will still import your final HTML markup into Portable Text; however, you will lose much control over how to serialize those blocks. Forcing you to do icky things like implementing content with dangerouslyInsertInnerHTML.

Unfortunately, if you're using a page builder such as Elementor, Divi, or Builder Beaver, you will have a bad time with the migration. While the Block tools package below can extract content from these, the HTML output can be fairly messy and tricky to navigate.

These plugins lack a method to extract serialized content like the built-in WordPress block editor. They are more challenging to translate into structured content and almost impossible for us to reason about in a lesson. The good thing about migrating to Sanity and a modern stack is that you won't have to deal with this content lock-in again 🤞.

Earlier, you installed @portabletext/block-tools into the project. This package contains a function, htmlToBlocks(), to convert an HTML string into Portable Text.

By default, it will extract some formatting, such as headings, lists, and paragraphs, into corresponding Portable Text blocks. However, if you need to take some of the existing HTML and turn it into custom objects—like taking an image, uploading it, and creating a reference—that will require some customization.

The helper function below wraps htmlToBlocks and contains logic to extract the URL of any <img> tag found inside a <figure> tag. If your content field does not have images inside figure tags, you must update the script to find them.

Because the deserialize method is synchronous, the image URL is first stored in a block type externalImage.

In the next section, we map over each block to find an external image and use the URL to attempt to search for the image in the WordPress database. Because the image is just a string in the HTML markup, this process is not guaranteed to work. Which, while unfortunate, is an excellent demonstration of why structured content and referential integrity are so important!

In the script below, you'll find that the htmlToBlockContent takes an argument with rules that describe how to deserialize the incoming HTML to structured content in Portable Text. A rule exposes the HTML using the HTML Node API, letting you write fairly fine-grained conditional checks against its structure. This is a low-level API, so be prepared for some troubleshooting.

The script below does the following things:

Accept a string of HTML
Convert it into Portable Text
If it finds a figure tag, it stores the URL in an externalImage block
Then, in a throttled array of async functions, searches the WP REST API for that image based on its filename
If found, either use an existing image in the in-memory cache or upload the image
Eliminates empty blocks
Returns the Portable Text

Create the new wrapper function to turn an HTML string into Portable Text:

import {htmlToBlocks} from '@portabletext/block-tools'
import {Schema} from '@sanity/schema'
import {uuid} from '@sanity/uuid'
import {JSDOM} from 'jsdom'
import pLimit from 'p-limit'
import type {FieldDefinition, SanityClient} from 'sanity'

import type {Post} from '../../../sanity.types'
import {schemaTypes} from '../../../schemaTypes'
import {BASE_URL} from '../constants'
import {sanityIdToImageReference} from './sanityIdToImageReference'
import {sanityUploadFromUrl} from './sanityUploadFromUrl'
import {wpImageFetch} from './wpImageFetch'

const defaultSchema = Schema.compile({types: schemaTypes})
const blockContentSchema = defaultSchema
  .get('post')
  .fields.find((field: FieldDefinition) => field.name === 'content').type

// https://github.com/portabletext/editor/tree/main/packages/block-tools
export async function htmlToBlockContent(
  html: string,
  client: SanityClient,
  imageCache: Record<number, string>,
): Promise<Post['content']> {
  // Convert HTML to Sanity's Portable Text
  let blocks = htmlToBlocks(html, blockContentSchema, {
    parseHtml: (html) => new JSDOM(html).window.document,
    rules: [
      {
        deserialize(node, next, block) {
          const el = node as HTMLElement

          if (node.nodeName.toLowerCase() === 'figure') {
            const url = el.querySelector('img')?.getAttribute('src')

            if (!url) {
              return undefined
            }

            return block({
              // these attributes may be overwritten by the image upload below
              _type: 'externalImage',
              url,
            })
          }

          return undefined
        },
      },
    ],
  })

  // Note: Multiple documents may be running this same function concurrently
  const limit = pLimit(2)

  const blocksWithUploads = blocks.map((block) =>
    limit(async () => {
      if (block._type !== 'externalImage' || !('url' in block)) {
        return block
      }

      // The filename is usually stored as the "slug" in WordPress media documents
      // Filename may be appended with dimensions like "-1024x683", remove with regex
      const dimensions = /-\d+x\d+$/
      let slug = (block.url as string)
        .split('/')
        .pop()
        ?.split('.')
        ?.shift()
        ?.replace(dimensions, '')
        .toLocaleLowerCase()

      const imageId = await fetch(`${BASE_URL}/media?slug=${slug}`)
        .then((res) => (res.ok ? res.json() : null))
        .then((data) => (Array.isArray(data) && data.length ? data[0].id : null))

      if (typeof imageId !== 'number' || !imageId) {
        return block
      }

      if (imageCache[imageId]) {
        return {
          _key: block._key,
          ...sanityIdToImageReference(imageCache[imageId]),
        } as Extract<Post['content'], {_type: 'image'}>
      }

      const imageMetadata = await wpImageFetch(imageId)
      if (imageMetadata?.source?.url) {
        const imageDocument = await sanityUploadFromUrl(
          imageMetadata.source.url,
          client,
          imageMetadata,
        )
        if (imageDocument) {
          // Add to in-memory cache if re-used in other documents
          imageCache[imageId] = imageDocument._id

          return {
            _key: block._key,
            ...sanityIdToImageReference(imageCache[imageId]),
          } as Extract<Post['content'], {_type: 'image'}>
        } else {
          return block
        }
      }

      return block
    }),
  )

  blocks = await Promise.all(blocksWithUploads)

  // Eliminate empty blocks
  blocks = blocks.filter((block) => {
    if (!block) {
      return false
    } else if (!('children' in block)) {
      return true
    }

    return block.children.map((c) => (c.text as string).trim()).join('').length > 0
  })

  blocks = blocks.map((block) => (block._key ? block : {...block, _key: uuid()}))

  // TS complains there's no _key in these blocks, but this is corrected in the map above
  // @ts-expect-error
  return blocks
}

Update your transformToPost.ts script to convert HTML to Portable Text and write to the content field

import {uuid} from '@sanity/uuid'
import {decode} from 'html-entities'
import type {SanityClient} from 'sanity'
import type {WP_REST_API_Post} from 'wp-types'

import type {Post} from '../../../sanity.types'
import {htmlToBlockContent} from './htmlToBlockContent'
import {sanityIdToImageReference} from './sanityIdToImageReference'
import {sanityUploadFromUrl} from './sanityUploadFromUrl'
import {wpImageFetch} from './wpImageFetch'

// Remove these keys because they'll be created by Content Lake
type StagedPost = Omit<Post, '_createdAt' | '_updatedAt' | '_rev'>

export async function transformToPost(
  wpDoc: WP_REST_API_Post,
  client: SanityClient,
  existingImages: Record<string, string> = {},
): Promise<StagedPost> {
  const doc: StagedPost = {
    _id: `post-${wpDoc.id}`,
    _type: 'post',
  }

  // ...all your other attributes

  if (wpDoc.content) {
    doc.content = await htmlToBlockContent(wpDoc.content.rendered, client, existingImages)
  }

  return doc
}

You can now run the migrations again.

npx sanity@latest migration run import-wp --no-dry-run --type=posts

Once the transactions are committed, you should see documents appear with populated content fields.

You might notice that once rendered in columns or other specific layouts, your existing HTML is now rendered in one column of block content.

If preserving presentation is essential, more work is required. We'll cover this in the next lesson by working with raw content from WordPress.

You have 2 uncompleted tasks in this lesson

0 of 2

Migrating content from WordPress to Sanity

Using WordPress blocks?

Using a "page builder" plugin?

HTML to Portable Text with Block tools

Migrating to Portable Text