===============================================
Weights & Biases Experiment Tracking Tutorial
===============================================

.. meta::
   :description: A comprehensive guide to using Weights & Biases (WandB) for experiment tracking in ML workflows
   :keywords: WandB, Weights & Biases, experiment tracking, machine learning, visualization, hyperparameters

.. .. contents:: Table of Contents
..    :depth: 3
..    :local:

Introduction
============

Weights & Biases (WandB) is a comprehensive machine learning platform that provides experiment tracking, model management, and collaboration tools for ML teams. It offers powerful visualization capabilities, hyperparameter optimization, and seamless integration with popular ML frameworks.

.. note::
   WandB can be integrated into ML-Train for robust experiment tracking with cloud-based storage and advanced visualization features.

What Makes WandB Special
------------------------

WandB provides a comprehensive ML platform with powerful tracking capabilities. Key advantages include:

- **Rich Visualizations**: Interactive charts, plots, and dashboards
- **Cloud Storage**: Secure cloud-based experiment storage and sharing
- **Team Collaboration**: Share experiments and insights with team members
- **Hyperparameter Optimization**: Built-in sweep functionality for automated hyperparameter tuning
- **Model Registry**: Track and version your trained models
- **Framework Integration**: Native support for PyTorch, TensorFlow, Keras, and more

Getting Started
===============

Basic Tracking
--------------

Here's your first WandB experiment:

.. code-block:: python

   import wandb
   import math

   # Initialize a new run
   wandb.init(project="my-first-project", name="basic-experiment")

   # Track metrics during training
   for step in range(100):
       loss = math.exp(-step/50) + 0.1 * math.sin(step/10)
       accuracy = 1 - math.exp(-step/30)
       
       wandb.log({
           "loss": loss,
           "accuracy": accuracy
       }, step=step)

   # Finish the run
   wandb.finish()

.. important::
   Always call ``wandb.finish()`` at the end of your training script to ensure all data is properly uploaded.


Advanced Tracking
=================

Configuration and Hyperparameters
----------------------------------

Track your experiment configuration:

.. code-block:: python

   import wandb

   # Define configuration
   config = {
       "learning_rate": 0.001,
       "batch_size": 32,
       "epochs": 10,
       "model_type": "transformer",
       "hidden_size": 768
   }

   # Initialize with config
   wandb.init(
       project="my-project",
       config=config,
       name="transformer-experiment"
   )

   # Access config during training
   lr = wandb.config.learning_rate
   batch_size = wandb.config.batch_size

Multiple Metrics at Once
------------------------

Log multiple metrics simultaneously:

.. code-block:: python

   # Log multiple metrics in one call
   wandb.log({
       "train/loss": train_loss,
       "train/accuracy": train_acc,
       "val/loss": val_loss,
       "val/accuracy": val_acc,
       "learning_rate": current_lr,
       "epoch": epoch
   }, step=global_step)

Tracking Rich Objects
=====================

Images and Plots
-----------------

WandB provides excellent support for tracking images and matplotlib figures:

.. code-block:: python

   import wandb
   import matplotlib.pyplot as plt
   import numpy as np
   from PIL import Image

   wandb.init(project="image-tracking")

   # Track matplotlib figures
   fig, ax = plt.subplots()
   ax.plot([1, 2, 3], [1, 4, 2])
   ax.set_title("Training Progress")
   
   wandb.log({"training_plot": wandb.Image(fig)})
   plt.close(fig)

   # Track PIL images
   img = Image.fromarray(np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8))
   wandb.log({"sample_image": wandb.Image(img, caption="Generated Sample")})

   # Track image arrays directly
   img_array = np.random.random((32, 32, 3))
   wandb.log({"numpy_image": wandb.Image(img_array)})

Tables and DataFrames
---------------------

Track structured data with WandB Tables:

.. code-block:: python

   import pandas as pd
   import wandb

   # Create a table from pandas DataFrame
   df = pd.DataFrame({
       "epoch": [1, 2, 3, 4, 5],
       "train_loss": [0.8, 0.6, 0.4, 0.3, 0.2],
       "val_loss": [0.9, 0.7, 0.5, 0.4, 0.3],
       "accuracy": [0.7, 0.8, 0.85, 0.9, 0.92]
   })

   table = wandb.Table(dataframe=df)
   wandb.log({"results_table": table})

   # Create table manually
   columns = ["image", "prediction", "target", "correct"]
   data = []
   
   for i in range(10):
       img = wandb.Image(sample_images[i])
       data.append([img, predictions[i], targets[i], predictions[i] == targets[i]])
   
   table = wandb.Table(data=data, columns=columns)
   wandb.log({"predictions": table})

Audio and Video
---------------

Track multimedia content:

.. code-block:: python

   import wandb

   # Track audio files
   wandb.log({
       "generated_audio": wandb.Audio("path/to/audio.wav", caption="Generated Speech"),
       "sample_rate": 22050
   })

   # Track video files
   wandb.log({
       "training_animation": wandb.Video("path/to/video.mp4", caption="Training Progress")
   })

   # Track audio arrays
   audio_data = np.random.randn(22050 * 2)  # 2 seconds of random audio
   wandb.log({
       "numpy_audio": wandb.Audio(audio_data, sample_rate=22050, caption="Random Audio")
   })

3D Objects and Point Clouds
---------------------------

Track 3D data for computer vision and robotics applications:

.. code-block:: python

   import numpy as np
   import wandb

   # Track 3D point clouds
   points = np.random.uniform(-1, 1, (1000, 3))
   colors = np.random.randint(0, 255, (1000, 3))
   
   point_cloud = wandb.Object3D({
       "type": "lidar/beta",
       "points": points,
       "colors": colors
   })
   
   wandb.log({"point_cloud": point_cloud})

Model Checkpoints and Artifacts
===============================

Artifacts System
----------------

WandB Artifacts provide versioned storage for datasets, models, and other files:

.. code-block:: python

   import wandb

   wandb.init(project="artifact-example")

   # Save a model as an artifact
   artifact = wandb.Artifact("my-model", type="model")
   artifact.add_file("model.pth")
   artifact.add_file("config.json")
   wandb.log_artifact(artifact)

   # Use an artifact from another run
   run = wandb.init(project="artifact-example")
   artifact = run.use_artifact("my-model:latest")
   artifact_dir = artifact.download()

Model Checkpointing
-------------------

Automatically save model checkpoints during training:

.. code-block:: python

   import torch
   import wandb

   wandb.init(project="checkpoint-example")

   # During training loop
   for epoch in range(num_epochs):
       # ... training code ...
       
       # Save checkpoint
       if epoch % 5 == 0:
           checkpoint = {
               'epoch': epoch,
               'model_state_dict': model.state_dict(),
               'optimizer_state_dict': optimizer.state_dict(),
               'loss': loss,
           }
           
           torch.save(checkpoint, f'checkpoint_epoch_{epoch}.pth')
           
           # Log as artifact
           artifact = wandb.Artifact(f"model-checkpoint", type="model")
           artifact.add_file(f'checkpoint_epoch_{epoch}.pth')
           wandb.log_artifact(artifact)

Hyperparameter Optimization
============================

WandB Sweeps
------------

Automate hyperparameter optimization with WandB Sweeps:

.. code-block:: yaml

   # sweep_config.yaml
   program: train.py
   method: bayes
   metric:
     goal: maximize
     name: val_accuracy
   parameters:
     learning_rate:
       distribution: log_uniform_values
       min: 0.0001
       max: 0.1
     batch_size:
       values: [16, 32, 64, 128]
     hidden_size:
       values: [128, 256, 512, 1024]
     dropout:
       distribution: uniform
       min: 0.1
       max: 0.5

Create and run the sweep:

.. code-block:: python

   import wandb
   import yaml

   # Load sweep configuration
   with open('sweep_config.yaml') as f:
       sweep_config = yaml.safe_load(f)

   # Create sweep
   sweep_id = wandb.sweep(sweep_config, project="hyperparameter-optimization")

   # Run sweep agents
   wandb.agent(sweep_id, function=train, count=50)

Training Function for Sweeps
-----------------------------

.. code-block:: python

   def train():
       # Initialize wandb
       wandb.init()
       
       # Get hyperparameters from sweep
       config = wandb.config
       
       # Build model with sweep parameters
       model = build_model(
           hidden_size=config.hidden_size,
           dropout=config.dropout
       )
       
       optimizer = torch.optim.Adam(
           model.parameters(), 
           lr=config.learning_rate
       )
       
       # Training loop
       for epoch in range(num_epochs):
           train_loss, train_acc = train_epoch(model, train_loader, optimizer)
           val_loss, val_acc = validate(model, val_loader)
           
           # Log metrics
           wandb.log({
               "epoch": epoch,
               "train_loss": train_loss,
               "train_accuracy": train_acc,
               "val_loss": val_loss,
               "val_accuracy": val_acc
           })

Framework Integrations
======================


.. tab-set::

    .. tab-item:: PyTorch

        WandB provides seamless PyTorch integration:

        .. code-block:: python

           import torch
           import torch.nn as nn
           import wandb

           # Initialize wandb
           wandb.init(project="pytorch-integration")

           # Log model architecture
           model = nn.Sequential(
               nn.Linear(784, 128),
               nn.ReLU(),
               nn.Dropout(0.2),
               nn.Linear(128, 10)
           )
           
           wandb.watch(model, log_freq=100)  # Log gradients and parameters

           # Training loop
           for epoch in range(num_epochs):
               model.train()
               for batch_idx, (data, target) in enumerate(train_loader):
                   optimizer.zero_grad()
                   output = model(data)
                   loss = nn.CrossEntropyLoss()(output, target)
                   loss.backward()
                   optimizer.step()
                   
                   # Log metrics
                   if batch_idx % 100 == 0:
                       wandb.log({
                           "batch_loss": loss.item(),
                           "epoch": epoch,
                           "batch": batch_idx
                       })
                       
    .. tab-item:: Hugging Face

        Use WandB with Hugging Face Transformers:

        .. code-block:: python

           from transformers import TrainingArguments, Trainer
           import wandb

           # Initialize wandb
           wandb.init(project="huggingface-integration")

           # Setup training arguments with wandb
           training_args = TrainingArguments(
               output_dir='./results',
               num_train_epochs=3,
               per_device_train_batch_size=16,
               per_device_eval_batch_size=64,
               warmup_steps=500,
               weight_decay=0.01,
               logging_dir='./logs',
               report_to="wandb",  # Enable wandb logging
               run_name="bert-fine-tuning"
           )

           trainer = Trainer(
               model=model,
               args=training_args,
               train_dataset=train_dataset,
               eval_dataset=eval_dataset,
           )

           trainer.train()

    .. tab-item:: TensorFlow/Keras

        .. code-block:: python

           import tensorflow as tf
           from wandb.keras import WandbCallback
           import wandb

           # Initialize wandb
           wandb.init(project="tensorflow-integration")

           # Build model
           model = tf.keras.Sequential([
               tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
               tf.keras.layers.Dropout(0.2),
               tf.keras.layers.Dense(10, activation='softmax')
           ])

           model.compile(
               optimizer='adam',
               loss='sparse_categorical_crossentropy',
               metrics=['accuracy']
           )

           # Train with WandB callback
           model.fit(
               x_train, y_train,
               batch_size=32,
               epochs=10,
               validation_data=(x_val, y_val),
               callbacks=[WandbCallback()]
           )

Best Practices
==============

Experiment Organization
-----------------------

1. **Use Meaningful Project Names:**

.. code-block:: python

   wandb.init(
       project="image-classification-resnet",
       name=f"resnet50-lr{learning_rate}-bs{batch_size}",
       tags=["resnet", "baseline", "imagenet"]
   )

2. **Consistent Naming Conventions:**

.. code-block:: python

   # Use hierarchical metric names
   wandb.log({
       "train/loss": train_loss,
       "train/accuracy": train_acc,
       "val/loss": val_loss,
       "val/accuracy": val_acc,
       "optimizer/learning_rate": current_lr
   })

3. **Use Tags and Groups:**

.. code-block:: python

   wandb.init(
       project="my-project",
       group="experiment-1",  # Group related runs
       tags=["baseline", "bert", "fine-tuning"],  # Add searchable tags
       notes="Initial baseline with default hyperparameters"
   )

Configuration Management
------------------------

Store comprehensive experiment configuration:

.. code-block:: python

   config = {
       # Model config
       "model": {
           "type": "transformer",
           "num_layers": 12,
           "hidden_size": 768,
           "num_heads": 12,
           "dropout": 0.1
       },
       
       # Training config
       "training": {
           "learning_rate": 2e-5,
           "batch_size": 32,
           "num_epochs": 10,
           "warmup_steps": 1000,
           "weight_decay": 0.01
       },
       
       # Data config
       "data": {
           "dataset": "imdb",
           "max_length": 512,
           "train_size": 25000,
           "val_size": 5000
       },
       
       # Environment
       "environment": {
           "gpu_type": "V100",
           "pytorch_version": "1.9.0",
           "cuda_version": "11.1"
       }
   }

   wandb.init(project="my-project", config=config)

Error Handling and Robustness
------------------------------

.. code-block:: python

   import wandb
   import sys

   try:
       wandb.init(project="robust-training")
       
       # Training code here
       for epoch in range(num_epochs):
           try:
               train_loss = train_epoch()
               val_loss = validate()
               
               wandb.log({
                   "train_loss": train_loss,
                   "val_loss": val_loss,
                   "epoch": epoch
               })
               
           except Exception as e:
               wandb.log({"error": str(e), "epoch": epoch})
               print(f"Error in epoch {epoch}: {e}")
               continue
               
   except KeyboardInterrupt:
       print("Training interrupted by user")
   finally:
       wandb.finish()

Team Collaboration
==================

Sharing and Reports
-------------------

Create shareable reports for team collaboration:

.. code-block:: python

   # Add run notes and documentation
   wandb.init(
       project="team-project",
       notes="""
       ## Experiment Goals
       - Test new attention mechanism
       - Compare with baseline transformer
       
       ## Key Findings
       - 15% improvement in accuracy
       - 2x faster convergence
       
       ## Next Steps
       - Scale to larger dataset
       - Test on additional tasks
       """
   )

   # Log important insights
   wandb.log({
       "insight": "Attention mechanism shows significant improvement",
       "recommendation": "Deploy to production pipeline"
   })

Model Registry
--------------

Use WandB Model Registry for model versioning:

.. code-block:: python

   import wandb

   # After training
   wandb.init(project="model-registry-demo")

   # Log model to registry
   artifact = wandb.Artifact("sentiment-classifier", type="model")
   artifact.add_file("model.pth")
   artifact.add_file("tokenizer.json")
   artifact.add_file("config.json")
   
   # Add metadata
   artifact.metadata = {
       "accuracy": 0.95,
       "f1_score": 0.94,
       "training_data": "imdb-50k",
       "framework": "pytorch"
   }
   
   wandb.log_artifact(artifact)

   # Link to model registry
   wandb.link_artifact(artifact, "model-registry/sentiment-classifier")

Troubleshooting
===============

Common Issues
-------------

**Slow Upload Speeds:**
Configure WandB for better performance:

.. code-block:: python

   import os
   
   # Reduce upload frequency
   os.environ["WANDB_LOG_INTERNAL"] = "false"
   
   wandb.init(
       project="my-project",
       settings=wandb.Settings(
           _disable_stats=True,  # Disable system stats
           _disable_meta=True    # Disable metadata collection
       )
   )

**Authentication Issues:**
Check your API key setup:

.. code-block:: bash

   # Verify login status
   wandb verify
   
   # Re-login if needed
   wandb login --relogin

**Offline Mode:**
Run experiments without internet connection:

.. code-block:: python

   import os
   os.environ["WANDB_MODE"] = "offline"
   
   wandb.init(project="offline-project")
   # Your training code here
   wandb.finish()
   
   # Sync when online
   # wandb sync wandb/offline-run-*/

**Memory Issues:**
For large experiments:

.. code-block:: python

   # Reduce logging frequency
   wandb.init(
       project="large-experiment",
       settings=wandb.Settings(
           log_internal=None,
           save_code=False,
           disable_git=True
       )
   )
   
   # Log less frequently
   if step % 100 == 0:  # Log every 100 steps instead of every step
       wandb.log(metrics, step=step)

Migration and Integration
=========================

From TensorBoard
----------------

Convert existing TensorBoard logs:

.. code-block:: bash

   # Install tensorboard integration
   pip install wandb[tensorboard]
   
   # Sync TensorBoard logs
   wandb sync --tensorboard ./tensorboard_logs

From Other Platforms
--------------------

.. code-block:: python

   # Import existing experiment data
   import wandb
   import pandas as pd

   # Read existing experiment results
   df = pd.read_csv("previous_experiments.csv")

   for _, row in df.iterrows():
       wandb.init(
           project="migrated-experiments",
           name=row["experiment_name"],
           config=row["config"],
           reinit=True
       )
       
       # Log historical results
       for metric in ["loss", "accuracy", "f1_score"]:
           if metric in row:
               wandb.log({metric: row[metric]})
               
       wandb.finish()

Conclusion
==========

WandB provides a comprehensive platform for machine learning experiment tracking with powerful visualization, collaboration, and optimization features. Key takeaways:

- **Start Simple**: Begin with basic metric and config tracking
- **Leverage Rich Media**: Use images, tables, and multimedia logging
- **Optimize Systematically**: Use Sweeps for hyperparameter optimization
- **Collaborate Effectively**: Share experiments and insights with your team
- **Scale Intelligently**: Use artifacts and model registry for production workflows

With WandB, you can build a complete MLOps pipeline from experimentation to production, ensuring reproducibility and enabling effective collaboration across your ML team.

.. tip::
   Create your first WandB project today by signing up at https://wandb.ai and running your first experiment. The platform's intuitive interface and powerful features will transform how you approach ML experimentation.