How to Read CSV from Google Cloud storage to Pandas DataFrame

To read a CSV file from Google Cloud Storage (GCS) into a Pandas DataFrame, you need to follow these steps:

Step 1: Install the necessary packages

Ensure you have installed the google-cloud-storage and pandas packages. Then, you can install them using a pip.

pip install google-cloud-storage pandas

Step 2: Set up authentication

Ensure you have a Google Cloud Platform service account key JSON file. If you don’t have one, follow the instructions in the GCP documentation to create one.

Once you have the JSON key file, set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your JSON key file:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/keyfile.json"

Step 3: Read the CSV file from GCS and load it into a Pandas DataFrame

Here’s a Python script demonstrating how to read a CSV file from GCS into a Pandas DataFrame.

import pandas as pd
from google.cloud import storage

# Set your GCS bucket and file path
bucket_name = "your-bucket-name"
file_path = "path/to/your/file.csv"


def read_csv_from_gcs(bucket_name, file_path):
  # Create a GCS client
  storage_client = storage.Client()

  # Get the bucket and blob objects
  bucket = storage_client.get_bucket(bucket_name)
  blob = bucket.blob(file_path)

  # Download the contents of the blob as a string
  csv_data = blob.download_as_text()

  # Read the CSV data into a Pandas DataFrame
  dataframe = pd.read_csv(pd.StringIO(csv_data))
  return dataframe


# Read the CSV file and load it into a DataFrame
df = read_csv_from_gcs(bucket_name, file_path)

# Print the DataFrame
print(df)

Replace “your-bucket-name” and “path/to/your/file.csv” with the appropriate GCS bucket name and file path. Then, run the script, and it will read the CSV file from GCS and load it into a Pandas DataFrame, which you can then process or analyze as needed.

That’s it.

Leave a Comment