What is Random Sampling in Google BigQuery

In Google BigQuery, you can perform a random sampling of your data using a combination of SQL functions. One common method is to use the RAND() function along with LIMIT or TABLESAMPLE.

Here are two approaches to performing random sampling in BigQuery.

  1. Method 1: Using the RAND() function with ORDER BY and LIMIT
  2. Method 2: Using the TABLESAMPLE SYSTEM

Method 1: Using the RAND() function with ORDER BY and LIMIT

The RAND() function is suitable when selecting a specific number of random rows from your table. The function generates a random floating-point number between 0 and 1, which orders the rows. Then, you can use the LIMIT clause to select the desired number of rows.

SELECT *
FROM your_project.your_dataset.your_table
ORDER BY RAND()
LIMIT 1000;

Replace your_project.your_dataset.your_table with the appropriate identifiers for your project, dataset, and table. The LIMIT value (1000 in this example) represents the number of random rows you want to select.

Method 2: Using the TABLESAMPLE SYSTEM

The TABLESAMPLE SYSTEM method is more efficient when selecting a percentage of random rows from your table.

The TABLESAMPLE SYSTEM clause allows you to sample a percentage of the table’s data.

The SYSTEM sampling method is efficient but may not produce an exact percentage due to how BigQuery stores data.

SELECT *
FROM your_project.your_dataset.your_table
TABLESAMPLE SYSTEM (10 PERCENT);

Replace your_project.your_dataset.your_table with the appropriate identifiers for your project, dataset, and table.

The percentage value (10 in this example) represents the approximate percentage of rows you want to select randomly.

Note that the TABLESAMPLE SYSTEM method may not be suitable for small tables, as it might produce fewer rows than expected or no rows at all.

In such cases, using the RAND() function with ORDER BY and LIMIT would be a better option.

That’s it.

Leave a Comment