Uploading assets performantly

So far the migration script has staged documents to be created sequentially, now we need to introduce asynchronous functions to upload assets, that has the potential to slow the import process down.

Your post and page documents likely have a featured_media reference that you should upload and reference your new Sanity documents. This lesson will focus on the updated migration script and offer several new helper functions.

Uploading an asset to Sanity while creating documents in migrations is a two-step process:

Fetch the asset file from a URL and use Sanity Client to upload it, which returns an asset document with its _id.
Attach the returned asset document ID to the current document as a reference in an asset field.

Because this operation is asynchronous, it means the creation of a document must wait for that upload and response before creating the asset reference

Currently, the migration script requests up to 100 posts (or pages, categories, or tags) and then loops over them with a for of loop. This loop type is somewhat practical in this application as you could use an asynchronous function, and the loop will await completion before proceeding.

However, this means each document would have to wait in sequence for an image to upload, making the migration script incredibly slow. That's no way to live!

This does not mean we should go to the other extreme—uploading all 100 images simultaneously—as you would likely encounter the issue of rate limits.

See Technical limits for more information about API rate limits.

One benefit of migration tooling is the built-in avoidance of rate limits, as it automatically batches mutations into transactions. However, now that your introducing custom document creation into the script's execution, you need to be a little more careful.

Earlier, you installed p-limit as a dependency. This package allows you to create an array of async functions, which, when placed in a Promise.all() call, will throttle the number of simultaneous function invocations.

You'll see the updated script has replaced the for of loop for a map of functions wrapped in the function limit from p-limit. The script changes from staging each document individually to creating an array of asynchronous staging functions.

The migration tooling makes a limited Sanity Client version available inside a context variable. As this version does not allow uploading assets, the updated script creates a new, fully-featured instance of Sanity Client using the same projectId, dataset, and token config.

When uploading assets to Sanity, you can also append metadata about the "source" from which it came. This metadata enables more efficient re-running of the script to avoid re-uploading the same images on every invocation.

The WordPress REST API has a route for retrieving information about an image if you have its ID. A function to query that endpoint and return just the metadata we need to store in Sanity will make this more convenient.

Create a new helper function to query the WordPress REST API's /media route for an image by its id value.

import type {UploadClientConfig} from '@sanity/client'
import {decode} from 'html-entities'

import {BASE_URL} from '../constants'

// Get WordPress' asset metadata about an image by its ID
export async function wpImageFetch(id: number): Promise<UploadClientConfig | null> {
  const wpApiUrl = new URL(`${BASE_URL}/media/${id}`).toString()
  const imageData = await fetch(wpApiUrl).then((res) => res.json())

  if (!imageData || !imageData.source_url) {
    return null
  }

  let metadata: UploadClientConfig = {
    filename: imageData.source_url.split('/').pop(),
    source: {
      id: imageData.id,
      name: 'WordPress',
      url: imageData.source_url,
    },
    // Not technically part of the Sanity imageAsset schema, but used by the popular Media Plugin
    // @ts-expect-error
    altText: imageData.alt_text,
  }

  if (imageData?.title?.rendered) {
    metadata.title = decode(imageData.title.rendered)
  }

  if (imageData?.image_meta?.caption) {
    metadata.description = imageData.image_meta.caption
  }

  if (imageData?.image_meta?.credit) {
    metadata.creditLine = imageData.image_meta.credit
  }

  return metadata
}

When you use this function to retrieve an image record from WordPress, you'll need to pass it along to the function that uploads the image to Sanity.

Create a helper function to upload an image to Sanity – using its URL – along with optional metadata:

import {Readable} from 'node:stream'

import type {SanityClient, SanityImageAssetDocument, UploadClientConfig} from '@sanity/client'

export async function sanityUploadFromUrl(
  url: string,
  client: SanityClient,
  metadata: UploadClientConfig,
): Promise<SanityImageAssetDocument | null> {
  const {body} = await fetch(url)
  if (!body) {
    throw new Error(`No body found for ${url}`)
  }
  let data: SanityImageAssetDocument | null = null
  try {
    data = await client.assets.upload(
      'image',
      Readable.fromWeb(body),
      metadata,
    )
  } catch (error) {
    console.error(`Failed to upload image from ${url}`)
    console.error(error)

    return null
  }

  return data
}

This function returns a Sanity image asset document, the _id value you'll use to create a reference to this asset.

The image schema type in Sanity stores a reference in the asset attribute. Since you'll be uploading many images, getting their ID, and creating a reference, having a helper function for this simple task makes sense.

Create a helper function to take the _id of an asset document and return the shape of an asset reference in a document:

import type {Post} from '../../../sanity.types'

export function sanityIdToImageReference(id: string): Post['featuredMedia'] {
  return {
    _type: 'image',
    asset: {_type: 'reference', _ref: id},
  }
}

Note that the return type of this function is set to the featuredMedia field of a post – but it should satisfy any image field.

Now that you have functions to query WordPress for an image, upload it to Sanity, and create a reference in a document. It is advantageous to have one more function that will query for existing images from the same source at the beginning of the migration script – to avoid re-uploading images unnecessarily.

Create a helper function to query for previously uploaded images from WordPress.

import type {SanityClient} from 'sanity'

const query = `*[
    _type == "sanity.imageAsset" 
    && defined(source.id)
    && source.name == "WordPress"
]{
    _id,
    "sourceId": source.id
}`

export async function sanityFetchImages(client: SanityClient) {
  const initialImages = await client.fetch<{_id: string; sourceId: number}[]>(query)
  const existingImages: Record<number, string> = {}

  for (let index = 0; index < initialImages.length; index++) {
    existingImages[initialImages[index].sourceId] = initialImages[index]._id
  }

  return existingImages
}

This query will return all images in the dataset that have been uploaded with the source attributes our helpers use, then convert the response into an object for a basic (but fast!) key-value in-memory cache.

Now, with a strategy to query for and upload images efficiently, update your migration script below to put these pieces into place.

Update your migration script to be asynchronous and throttled:

import {createClient} from '@sanity/client'
import pLimit from 'p-limit'
import {createOrReplace, defineMigration} from 'sanity/migrate'
import type {WP_REST_API_Post, WP_REST_API_Term} from 'wp-types'

import {getDataTypes} from './lib/getDataTypes'
import {sanityFetchImages} from './lib/sanityFetchImages'
import {transformToPost} from './lib/transformToPost'
import {wpDataTypeFetch} from './lib/wpDataTypeFetch'

const limit = pLimit(5)

// Add image imports, parallelized and limited
export default defineMigration({
  title: 'Import WP JSON data',

  async *migrate(docs, context) {
    // Create a full client to handle image uploads
    const client = createClient(context.client.config())

    // Create an in-memory image cache to avoid re-uploading images
    const existingImages = await sanityFetchImages(client)

    const {wpType} = getDataTypes(process.argv)
    let page = 1
    let hasMore = true

    while (hasMore) {
      try {
        let wpData = await wpDataTypeFetch(wpType, page)

        if (Array.isArray(wpData) && wpData.length) {
          // Create an array of concurrency-limited promises to stage documents
          const docs = wpData.map((wpDoc) =>
            limit(async () => {
              if (wpType === 'posts') {
                wpDoc = wpDoc as WP_REST_API_Post
                const doc = await transformToPost(wpDoc, client,  existingImages)
                return doc
              } else if (wpType === 'pages') {
                wpDoc = wpDoc as WP_REST_API_Post
              } else if (wpType === 'categories') {
                wpDoc = wpDoc as WP_REST_API_Term
              } else if (wpType === 'tags') {
                wpDoc = wpDoc as WP_REST_API_Term
              }

              hasMore = false
              throw new Error(`Unhandled WordPress type: ${wpType}`)
            }),
          )

          // Resolve all documents concurrently, throttled by p-limit
          const resolvedDocs = await Promise.all(docs)

          yield resolvedDocs.map((doc) => createOrReplace(doc))
          page++
        } else {
          hasMore = false
        }
      } catch (error) {
        console.error(`Error fetching data for page ${page}:`, error)
        // Stop the loop in case of an error
        hasMore = false
      }
    }
  },
})

There are some significant changes in the migration script above:

Instead of staging documents one by one, they're now set up in an array with a limit function, then using p-limit, are resolved at most five at a time. This is to prevent any issues with rate limits as images are uploaded during the migration.
The in-memory cache of existing images is queried before any migration begins.
These images and Sanity Client are passed into the post-transform function.

With the migration script set up to handle asynchronous functions, the transformToPost script needs to be updated to perform them.

Update the transformToPost function to add image uploads.

import {uuid} from '@sanity/uuid'
import {decode} from 'html-entities'
import type {SanityClient} from 'sanity'
import type {WP_REST_API_Post} from 'wp-types'

import type {Post} from '../../../sanity.types'
import {sanityIdToImageReference} from './sanityIdToImageReference'
import {sanityUploadFromUrl} from './sanityUploadFromUrl'
import {wpImageFetch} from './wpImageFetch'

// Remove these keys because they'll be created by Content Lake
type StagedPost = Omit<Post, '_createdAt' | '_updatedAt' | '_rev'>

export async function transformToPost(
  wpDoc: WP_REST_API_Post,
  client: SanityClient,
  existingImages: Record<string, string> = {},
): Promise<StagedPost> {
  const doc: StagedPost = {
    _id: `post-${wpDoc.id}`,
    _type: 'post',
  }

  // ...all other attributes!

  // Document has an image
  if (typeof wpDoc.featured_media === 'number' && wpDoc.featured_media > 0) {
    // Image exists already in dataset
    if (existingImages[wpDoc.featured_media]) {
      doc.featuredMedia = sanityIdToImageReference(existingImages[wpDoc.featured_media])
    } else {
      // Retrieve image details from WordPress
      const metadata = await wpImageFetch(wpDoc.featured_media)

      if (metadata?.source?.url) {
        // Upload to Sanity
        const asset = await sanityUploadFromUrl(metadata.source.url, client, metadata)

        if (asset) {
          doc.featuredMedia = sanityIdToImageReference(asset._id)
          existingImages[wpDoc.featured_media] = asset._id
        }
      }
    }
  }

  return doc
}

Once again, you can execute your import script the same way you did before. You'll notice the script taking a little longer to execute as images are uploaded. However, it should be faster on subsequent runs as re-uploads are avoided.

npx sanity@latest migration run import-wp --no-dry-run --type=posts

You should now see documents being created with a shape like this:

{
  "_id": "post-631475",
  "_type": "post",
  "title": "From NASA’s First Astronaut Class to Artemis II: The Importance of Military Jet Pilot Experience",
  "featuredMedia": {
    "_type": "image",
    "asset": {
      "_type": "reference",
      "_ref": "image-1b007a770ea5a9902c39cf07e04cd5483ec05a7e-3405x2495-jpg"
    }
  }
}

As the script commits transactions, you should see new documents appear with images.

So far, you've imported several types of documents and uploaded images. Now it's time to get into the meat of these documents: block content and rich text.

You have 6 uncompleted tasks in this lesson

0 of 6

Migrating content from WordPress to Sanity

Avoiding slow, sequential loops

Concurrency and rate limits

Uploading images efficiently

Appending source metadata

Putting it all together

Update the transform function

Run the import with images