GPU Ready in 15 Seconds! Easy GPU Programming with Modal

This article is a machine translation from Japanese. It may contain translation errors.

I’m Tokunaga, the CEO of PredNext. We often need to run GPU workloads on high-performance GPUs that we don’t have at home, right? However, powerful GPUs like the NVIDIA H100 are not available in typical households. Modal is convenient for such situations.

Modal is a cloud service where you write scripts in Python and run them directly on servers. Think of it as the Heroku of the GPU cloud world, and you’ll get the right idea.

Let’s skip the difficult explanations for now and actually run code on a GPU using Modal.

1. Prerequisites

Python 3.9 or higher
Modal account
Basic Python knowledge

2. Setup

First, install the Modal library. I’m using uv here, so if you’re not using it, please adjust accordingly. Or take this opportunity to start using uv.

uv init --app
uv add modal

Create an account on the official Modal website and obtain an API key.

Run the following command in your terminal to log in:

uv run modal setup

3. Your First GPU Program

Save the following code as gpu_test.py:

import modal

app = modal.App()
image = modal.Image.debian_slim().pip_install("torch", "numpy")

@app.function(gpu="T4", image=image)
def train_simple_model():
    """Execute simple machine learning processing on GPU"""
    import torch
    import torch.nn as nn

    print(f"Device in use: {torch.cuda.get_device_name(0)}")
    print(f"PyTorch version: {torch.__version__}")

    # Create dataset
    X = torch.randn(1000, 20).cuda()
    y = torch.randn(1000, 1).cuda()

    # Simple neural network
    model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1)).cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.MSELoss()

    # Start GPU time measurement
    start_time = torch.cuda.Event(enable_timing=True)
    end_time = torch.cuda.Event(enable_timing=True)
    start_time.record()

    for epoch in range(100):
        optimizer.zero_grad()
        pred = model(X)
        loss = loss_fn(pred, y)
        loss.backward()
        optimizer.step()

        if epoch % 20 == 0:
            print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

    end_time.record()
    torch.cuda.synchronize()
    elapsed_time = start_time.elapsed_time(end_time)

    return f"Training complete! Final loss: {loss.item():.4f}, Execution time: {elapsed_time:.2f}ms"

@app.local_entrypoint()
def main():
    print("Starting machine learning on GPU...")
    result = train_simple_model.remote()
    print(f"{result}")

After saving the script, you can execute it using the cloud GPU with the following command:

uv run modal run gpu_test.py

4. Execution Results

If successful, executing the above code will produce output like this:

Starting machine learning on GPU...
Device in use: Tesla T4
PyTorch version: 2.7.1+cu126
Epoch 0, Loss: 0.9950
Epoch 20, Loss: 0.9431
Epoch 40, Loss: 0.9039
Epoch 60, Loss: 0.8565
Epoch 80, Loss: 0.8005
Training complete! Final loss: 0.7446, Execution time: 913.92ms

With just this setup, you can run PyTorch on a GPU in the cloud.

The most remarkable aspect of Modal is the incredibly short time from command invocation to code execution on the cloud side. Code execution begins on the cloud side in about 15 seconds from command execution. (The first execution takes a bit longer due to image creation.)

This ease of execution is extremely important during development. With typical cloud services, you need about 5 minutes of wait time just for instance allocation and startup. For example, when developing by repeatedly running a script that takes 30 seconds to execute, traditional services only allow about 10 iterations per hour (you could keep the instance running, but that’s scary for individual development, right?), while Modal enables 20-30 iterations. Furthermore, since you can launch instances in parallel, parallel execution of parameter searches creates a dramatic difference in efficiency.

Additionally, with $30 in free credits provided monthly, small-scale usage can be effectively free.

Modal enables magical remote method execution with simple annotations, but the mechanism behind it is not simple.

In Modal, the same script is executed in both the local environment and the cloud-side environment. When app.run() is called within a script, the running script and its dependent scripts are automatically transferred to the cloud side, and the same script is loaded on the cloud side. Then, when you make a remote method call like func.remote(), the execution subject shifts to the cloud side. (app.run() is called inside the @app.local_entrypoint() decorator in the hello world example above, so it’s not used directly.)

The modal package provided by Modal abstracts much of the complexity of remote execution, but it cannot completely hide all problems. Therefore, some subtle issues may occur. For example, if the Python version differs between the local and cloud environments, the execution results may vary. Dealing with such problems is the user’s responsibility.

The basic usage should be clear from the hello, world! example, but a bit more knowledge and ingenuity are needed for convenient use. As a next step, looking at the A simple web scraper example is recommended.

Below, I’ll introduce some convenient features of Modal and things you should know to avoid trouble, including both those introduced and not introduced in the official documentation.

Creating Custom Container Images

Modal uses Docker images to build environments on the cloud. In addition to officially provided container images, you can also create customized Docker images. The following example installs pandas and numpy with pip on the official image and also installs ffmpeg using apt.

my_image = (
    modal.Image.debian_slim()
    .pip_install("torch", "numpy")
    .apt_install("ffmpeg")
)

Specifying GPUs

Available GPUs range from T4 to B200, covering a fairly wide range. For testing, it’s good to use the cheaper T4.

@app.function(gpu="T4", image=my_image)
def my_function():
    # Run in custom environment
    pass

Using Multiple GPUs

You can use up to 8 GPUs on a single instance. With something like H100x8, you can run experiments of reasonable scale. However, the cost is several tens of dollars per hour, so be careful not to waste it. Here’s an example of securing two A100s:

@app.function(gpu="A100:2")  # Secure two A100s
def multi_gpu_training():
    import torch
    if torch.cuda.device_count() > 1:
        print(f"Using {torch.cuda.device_count()} GPUs")

Parsing Command Line Arguments Yourself

Using @app.local_entrypoint(), which often appears in official site samples, automatically parses command line arguments and sets them as method arguments. While this is convenient, there are situations where fine control is not possible and it’s difficult to use. If you call app.run() yourself as shown below, you don’t need to call @app.local_entrypoint(), so you can parse command lines yourself.

def main(args):
    with app.run():
         func.remote(args)

Running the Same Script in Both Local and Remote Environments

If you call app.run() yourself, it’s also easy to run the same script both locally and remotely. To run locally, just don’t call app.run(). There’s some additional work, such as adding or removing .remote, but it’s not a major problem.

def main(args):
    if args.remote == True:
        with app.run():
            func.remote(args)
    else:
        func.local(args)

Calling Methods of Remote Objects

Instead of adding the @app.function decorator to a method, you can add the @app.cls decorator to a class to define a remote object and call methods of that object. In this case, you need to add the @modal.method() decorator to the remote method you want to call.

@app.cls(gpu="T4", image=image)
class TrainingModel:
    def __init__(self):
        import torch
        self.model = torch.nn.Linear(10, 1).cuda()
        self.optimizer = torch.optim.Adam(self.model.parameters())

    @modal.method()
    def train_step(self, x, y):
        import torch
        loss = torch.nn.functional.mse_loss(self.model(x), y)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        return loss.item()

@app.local_entrypoint()
def main():
    trainer = TrainingModel()
    loss = trainer.train_step.remote(x_data, y_data)
    print(f"Loss: {loss}")

Uploading Files from Local to Remote

Using Image.add_local_dir, you can upload files to the Modal side at runtime. If parameters or configurations are in separate files, they are not automatically uploaded, so you need to upload them using such mechanisms.

# Upload local directory
image_with_data = modal.Image.debian_slim().add_local_dir(
    local_path="./data",      # Local directory
    remote_path="/modal/data" # Modal side path
)

@app.function(image=image_with_data)
def process_data():
    import os
    # Files in /modal/data are available
    files = os.listdir("/modal/data")
    print(f"Available files: {files}")

Downloading Files from Remote to Local

Using Volume, you can persist written files and later download them to your local machine with the modal volume get command. Volume is something like a file system built on object storage like S3 behind the scenes.

# Create Volume
volume = modal.Volume.from_name("my-volume", create_if_missing=True)

@app.function(volumes={"/results": volume})
def save_model():
    import torch
    model = torch.nn.Linear(10, 1)
    # Train model here (omitted)
    torch.save(model.state_dict(), "/results/model.pth")

    # Also save log file
    with open("/results/training.log", "w") as f:
        f.write("Training completed successfully")

    return "File save complete"

After execution, download files with the following commands:

modal volume get my-volume model.pth ./local_model.pth
modal volume get my-volume training.log ./training.log

Troubleshooting

Below, I’ll explain some troubles I encountered with Modal and their solutions.

Problem of Multiple Instances Launching Automatically

Modal has auto-scaling enabled by default. For example, if you create a remote object and call training_step every 100 iterations, the Modal side may judge that “the load is increasing” and automatically launch new instances. This phenomenon occurs because Modal monitors requests/min (or requests/sec) rather than the number of parallel requests.

You can address this problem by setting allow_concurrent_inputs and concurrency_limit in the options of the @app.cls decorator. concurrency_limit is particularly important.

@app.cls(
    gpu="T4",
    allow_concurrent_inputs=100,  # Number of calls to queue
    concurrency_limit=1           # Actual concurrent execution limited to 1
)
class Trainer:
    @modal.method()
    def training_step(self, data):
        import time
        time.sleep(1)
        return "Processing complete"

@app.local_entrypoint()
def main():
    trainer = Trainer()

    for i in range(100):
        result = trainer.training_step.remote(f"data_{i}")
        print(f"Step {i}: {result}")

Be Careful of Name Conflicts Between nfs and Volume

I’m not explaining nfs this time (because migration to Volume is recommended), but you must not create nfs and Volume with the same name. Although nfs and Volume are separate systems, creating items with the same name will cause them to malfunction, so be careful. It may have been fixed now, but as of 2024, mysterious errors were occurring. To avoid difficult troubleshooting, avoid creating items with the same name, or simply don’t use nfs.

About Pricing

Modal’s pricing structure is a bit complicated, but basically, you’re charged per second for what you use. The nice thing is that you get $30 in free credits every month. This is enough for small-scale experiments and prototype creation, allowing even individual developers to try the latest GPUs casually.

Looking at the actual costs incurred in this test, the machine learning training test had an execution time of about 2 minutes (including environment setup). The estimated cost for this was about $0.02. From the second time onward, there’s no cost for disk image creation, so execution time is significantly reduced.

In this way, for short processing times, the latest GPUs can be used for a few cents per time, so trial and error during the development stage can be done casually. The T4 GPU costs about $0.59/hour, and the A100 GPU costs about$ 2.5/hour, which is expensive compared to other services, but considering the reduction in setup time and ease of operation, it’s a worthwhile investment.

Summary

This time, I introduced Modal as a cloud service where GPUs are exceptionally easy to use. The price is a bit high compared to Runpod and Lambda Labs, but Modal’s richness in terms of ease of use and practicality cannot be ignored. Please give it a try!

Looking for Work

PredNext is currently accepting project requests. Our specialty is AI-related technologies centered on natural language processing and image processing, with a particular focus on model compression and optimization. If you’re interested, please contact us through our contact form.

Let’s Start by Running Code on a GPU with Modal