=============================================== Weights & Biases Experiment Tracking Tutorial =============================================== .. meta:: :description: A comprehensive guide to using Weights & Biases (WandB) for experiment tracking in ML workflows :keywords: WandB, Weights & Biases, experiment tracking, machine learning, visualization, hyperparameters .. .. contents:: Table of Contents .. :depth: 3 .. :local: Introduction ============ Weights & Biases (WandB) is a comprehensive machine learning platform that provides experiment tracking, model management, and collaboration tools for ML teams. It offers powerful visualization capabilities, hyperparameter optimization, and seamless integration with popular ML frameworks. .. note:: WandB can be integrated into ML-Train for robust experiment tracking with cloud-based storage and advanced visualization features. What Makes WandB Special ------------------------ WandB provides a comprehensive ML platform with powerful tracking capabilities. Key advantages include: - **Rich Visualizations**: Interactive charts, plots, and dashboards - **Cloud Storage**: Secure cloud-based experiment storage and sharing - **Team Collaboration**: Share experiments and insights with team members - **Hyperparameter Optimization**: Built-in sweep functionality for automated hyperparameter tuning - **Model Registry**: Track and version your trained models - **Framework Integration**: Native support for PyTorch, TensorFlow, Keras, and more Getting Started =============== Basic Tracking -------------- Here's your first WandB experiment: .. code-block:: python import wandb import math # Initialize a new run wandb.init(project="my-first-project", name="basic-experiment") # Track metrics during training for step in range(100): loss = math.exp(-step/50) + 0.1 * math.sin(step/10) accuracy = 1 - math.exp(-step/30) wandb.log({ "loss": loss, "accuracy": accuracy }, step=step) # Finish the run wandb.finish() .. important:: Always call ``wandb.finish()`` at the end of your training script to ensure all data is properly uploaded. Advanced Tracking ================= Configuration and Hyperparameters ---------------------------------- Track your experiment configuration: .. code-block:: python import wandb # Define configuration config = { "learning_rate": 0.001, "batch_size": 32, "epochs": 10, "model_type": "transformer", "hidden_size": 768 } # Initialize with config wandb.init( project="my-project", config=config, name="transformer-experiment" ) # Access config during training lr = wandb.config.learning_rate batch_size = wandb.config.batch_size Multiple Metrics at Once ------------------------ Log multiple metrics simultaneously: .. code-block:: python # Log multiple metrics in one call wandb.log({ "train/loss": train_loss, "train/accuracy": train_acc, "val/loss": val_loss, "val/accuracy": val_acc, "learning_rate": current_lr, "epoch": epoch }, step=global_step) Tracking Rich Objects ===================== Images and Plots ----------------- WandB provides excellent support for tracking images and matplotlib figures: .. code-block:: python import wandb import matplotlib.pyplot as plt import numpy as np from PIL import Image wandb.init(project="image-tracking") # Track matplotlib figures fig, ax = plt.subplots() ax.plot([1, 2, 3], [1, 4, 2]) ax.set_title("Training Progress") wandb.log({"training_plot": wandb.Image(fig)}) plt.close(fig) # Track PIL images img = Image.fromarray(np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8)) wandb.log({"sample_image": wandb.Image(img, caption="Generated Sample")}) # Track image arrays directly img_array = np.random.random((32, 32, 3)) wandb.log({"numpy_image": wandb.Image(img_array)}) Tables and DataFrames --------------------- Track structured data with WandB Tables: .. code-block:: python import pandas as pd import wandb # Create a table from pandas DataFrame df = pd.DataFrame({ "epoch": [1, 2, 3, 4, 5], "train_loss": [0.8, 0.6, 0.4, 0.3, 0.2], "val_loss": [0.9, 0.7, 0.5, 0.4, 0.3], "accuracy": [0.7, 0.8, 0.85, 0.9, 0.92] }) table = wandb.Table(dataframe=df) wandb.log({"results_table": table}) # Create table manually columns = ["image", "prediction", "target", "correct"] data = [] for i in range(10): img = wandb.Image(sample_images[i]) data.append([img, predictions[i], targets[i], predictions[i] == targets[i]]) table = wandb.Table(data=data, columns=columns) wandb.log({"predictions": table}) Audio and Video --------------- Track multimedia content: .. code-block:: python import wandb # Track audio files wandb.log({ "generated_audio": wandb.Audio("path/to/audio.wav", caption="Generated Speech"), "sample_rate": 22050 }) # Track video files wandb.log({ "training_animation": wandb.Video("path/to/video.mp4", caption="Training Progress") }) # Track audio arrays audio_data = np.random.randn(22050 * 2) # 2 seconds of random audio wandb.log({ "numpy_audio": wandb.Audio(audio_data, sample_rate=22050, caption="Random Audio") }) 3D Objects and Point Clouds --------------------------- Track 3D data for computer vision and robotics applications: .. code-block:: python import numpy as np import wandb # Track 3D point clouds points = np.random.uniform(-1, 1, (1000, 3)) colors = np.random.randint(0, 255, (1000, 3)) point_cloud = wandb.Object3D({ "type": "lidar/beta", "points": points, "colors": colors }) wandb.log({"point_cloud": point_cloud}) Model Checkpoints and Artifacts =============================== Artifacts System ---------------- WandB Artifacts provide versioned storage for datasets, models, and other files: .. code-block:: python import wandb wandb.init(project="artifact-example") # Save a model as an artifact artifact = wandb.Artifact("my-model", type="model") artifact.add_file("model.pth") artifact.add_file("config.json") wandb.log_artifact(artifact) # Use an artifact from another run run = wandb.init(project="artifact-example") artifact = run.use_artifact("my-model:latest") artifact_dir = artifact.download() Model Checkpointing ------------------- Automatically save model checkpoints during training: .. code-block:: python import torch import wandb wandb.init(project="checkpoint-example") # During training loop for epoch in range(num_epochs): # ... training code ... # Save checkpoint if epoch % 5 == 0: checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, } torch.save(checkpoint, f'checkpoint_epoch_{epoch}.pth') # Log as artifact artifact = wandb.Artifact(f"model-checkpoint", type="model") artifact.add_file(f'checkpoint_epoch_{epoch}.pth') wandb.log_artifact(artifact) Hyperparameter Optimization ============================ WandB Sweeps ------------ Automate hyperparameter optimization with WandB Sweeps: .. code-block:: yaml # sweep_config.yaml program: train.py method: bayes metric: goal: maximize name: val_accuracy parameters: learning_rate: distribution: log_uniform_values min: 0.0001 max: 0.1 batch_size: values: [16, 32, 64, 128] hidden_size: values: [128, 256, 512, 1024] dropout: distribution: uniform min: 0.1 max: 0.5 Create and run the sweep: .. code-block:: python import wandb import yaml # Load sweep configuration with open('sweep_config.yaml') as f: sweep_config = yaml.safe_load(f) # Create sweep sweep_id = wandb.sweep(sweep_config, project="hyperparameter-optimization") # Run sweep agents wandb.agent(sweep_id, function=train, count=50) Training Function for Sweeps ----------------------------- .. code-block:: python def train(): # Initialize wandb wandb.init() # Get hyperparameters from sweep config = wandb.config # Build model with sweep parameters model = build_model( hidden_size=config.hidden_size, dropout=config.dropout ) optimizer = torch.optim.Adam( model.parameters(), lr=config.learning_rate ) # Training loop for epoch in range(num_epochs): train_loss, train_acc = train_epoch(model, train_loader, optimizer) val_loss, val_acc = validate(model, val_loader) # Log metrics wandb.log({ "epoch": epoch, "train_loss": train_loss, "train_accuracy": train_acc, "val_loss": val_loss, "val_accuracy": val_acc }) Framework Integrations ====================== .. tab-set:: .. tab-item:: PyTorch WandB provides seamless PyTorch integration: .. code-block:: python import torch import torch.nn as nn import wandb # Initialize wandb wandb.init(project="pytorch-integration") # Log model architecture model = nn.Sequential( nn.Linear(784, 128), nn.ReLU(), nn.Dropout(0.2), nn.Linear(128, 10) ) wandb.watch(model, log_freq=100) # Log gradients and parameters # Training loop for epoch in range(num_epochs): model.train() for batch_idx, (data, target) in enumerate(train_loader): optimizer.zero_grad() output = model(data) loss = nn.CrossEntropyLoss()(output, target) loss.backward() optimizer.step() # Log metrics if batch_idx % 100 == 0: wandb.log({ "batch_loss": loss.item(), "epoch": epoch, "batch": batch_idx }) .. tab-item:: Hugging Face Use WandB with Hugging Face Transformers: .. code-block:: python from transformers import TrainingArguments, Trainer import wandb # Initialize wandb wandb.init(project="huggingface-integration") # Setup training arguments with wandb training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, warmup_steps=500, weight_decay=0.01, logging_dir='./logs', report_to="wandb", # Enable wandb logging run_name="bert-fine-tuning" ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) trainer.train() .. tab-item:: TensorFlow/Keras .. code-block:: python import tensorflow as tf from wandb.keras import WandbCallback import wandb # Initialize wandb wandb.init(project="tensorflow-integration") # Build model model = tf.keras.Sequential([ tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) # Train with WandB callback model.fit( x_train, y_train, batch_size=32, epochs=10, validation_data=(x_val, y_val), callbacks=[WandbCallback()] ) Best Practices ============== Experiment Organization ----------------------- 1. **Use Meaningful Project Names:** .. code-block:: python wandb.init( project="image-classification-resnet", name=f"resnet50-lr{learning_rate}-bs{batch_size}", tags=["resnet", "baseline", "imagenet"] ) 2. **Consistent Naming Conventions:** .. code-block:: python # Use hierarchical metric names wandb.log({ "train/loss": train_loss, "train/accuracy": train_acc, "val/loss": val_loss, "val/accuracy": val_acc, "optimizer/learning_rate": current_lr }) 3. **Use Tags and Groups:** .. code-block:: python wandb.init( project="my-project", group="experiment-1", # Group related runs tags=["baseline", "bert", "fine-tuning"], # Add searchable tags notes="Initial baseline with default hyperparameters" ) Configuration Management ------------------------ Store comprehensive experiment configuration: .. code-block:: python config = { # Model config "model": { "type": "transformer", "num_layers": 12, "hidden_size": 768, "num_heads": 12, "dropout": 0.1 }, # Training config "training": { "learning_rate": 2e-5, "batch_size": 32, "num_epochs": 10, "warmup_steps": 1000, "weight_decay": 0.01 }, # Data config "data": { "dataset": "imdb", "max_length": 512, "train_size": 25000, "val_size": 5000 }, # Environment "environment": { "gpu_type": "V100", "pytorch_version": "1.9.0", "cuda_version": "11.1" } } wandb.init(project="my-project", config=config) Error Handling and Robustness ------------------------------ .. code-block:: python import wandb import sys try: wandb.init(project="robust-training") # Training code here for epoch in range(num_epochs): try: train_loss = train_epoch() val_loss = validate() wandb.log({ "train_loss": train_loss, "val_loss": val_loss, "epoch": epoch }) except Exception as e: wandb.log({"error": str(e), "epoch": epoch}) print(f"Error in epoch {epoch}: {e}") continue except KeyboardInterrupt: print("Training interrupted by user") finally: wandb.finish() Team Collaboration ================== Sharing and Reports ------------------- Create shareable reports for team collaboration: .. code-block:: python # Add run notes and documentation wandb.init( project="team-project", notes=""" ## Experiment Goals - Test new attention mechanism - Compare with baseline transformer ## Key Findings - 15% improvement in accuracy - 2x faster convergence ## Next Steps - Scale to larger dataset - Test on additional tasks """ ) # Log important insights wandb.log({ "insight": "Attention mechanism shows significant improvement", "recommendation": "Deploy to production pipeline" }) Model Registry -------------- Use WandB Model Registry for model versioning: .. code-block:: python import wandb # After training wandb.init(project="model-registry-demo") # Log model to registry artifact = wandb.Artifact("sentiment-classifier", type="model") artifact.add_file("model.pth") artifact.add_file("tokenizer.json") artifact.add_file("config.json") # Add metadata artifact.metadata = { "accuracy": 0.95, "f1_score": 0.94, "training_data": "imdb-50k", "framework": "pytorch" } wandb.log_artifact(artifact) # Link to model registry wandb.link_artifact(artifact, "model-registry/sentiment-classifier") Troubleshooting =============== Common Issues ------------- **Slow Upload Speeds:** Configure WandB for better performance: .. code-block:: python import os # Reduce upload frequency os.environ["WANDB_LOG_INTERNAL"] = "false" wandb.init( project="my-project", settings=wandb.Settings( _disable_stats=True, # Disable system stats _disable_meta=True # Disable metadata collection ) ) **Authentication Issues:** Check your API key setup: .. code-block:: bash # Verify login status wandb verify # Re-login if needed wandb login --relogin **Offline Mode:** Run experiments without internet connection: .. code-block:: python import os os.environ["WANDB_MODE"] = "offline" wandb.init(project="offline-project") # Your training code here wandb.finish() # Sync when online # wandb sync wandb/offline-run-*/ **Memory Issues:** For large experiments: .. code-block:: python # Reduce logging frequency wandb.init( project="large-experiment", settings=wandb.Settings( log_internal=None, save_code=False, disable_git=True ) ) # Log less frequently if step % 100 == 0: # Log every 100 steps instead of every step wandb.log(metrics, step=step) Migration and Integration ========================= From TensorBoard ---------------- Convert existing TensorBoard logs: .. code-block:: bash # Install tensorboard integration pip install wandb[tensorboard] # Sync TensorBoard logs wandb sync --tensorboard ./tensorboard_logs From Other Platforms -------------------- .. code-block:: python # Import existing experiment data import wandb import pandas as pd # Read existing experiment results df = pd.read_csv("previous_experiments.csv") for _, row in df.iterrows(): wandb.init( project="migrated-experiments", name=row["experiment_name"], config=row["config"], reinit=True ) # Log historical results for metric in ["loss", "accuracy", "f1_score"]: if metric in row: wandb.log({metric: row[metric]}) wandb.finish() Conclusion ========== WandB provides a comprehensive platform for machine learning experiment tracking with powerful visualization, collaboration, and optimization features. Key takeaways: - **Start Simple**: Begin with basic metric and config tracking - **Leverage Rich Media**: Use images, tables, and multimedia logging - **Optimize Systematically**: Use Sweeps for hyperparameter optimization - **Collaborate Effectively**: Share experiments and insights with your team - **Scale Intelligently**: Use artifacts and model registry for production workflows With WandB, you can build a complete MLOps pipeline from experimentation to production, ensuring reproducibility and enabling effective collaboration across your ML team. .. tip:: Create your first WandB project today by signing up at https://wandb.ai and running your first experiment. The platform's intuitive interface and powerful features will transform how you approach ML experimentation.