Google's BigQuery and Go

Notes

  • BigQuery(BQ) is Google’s serverless, highly scalable, enterprise data warehouse
  • SQL-like query language
  • Real time data analysis by “streaming ingestion” capability
  • Deals with gigabytes to petabytes of data volume
  • Offers data encryption and data access permissions via Cloud Identity and Access Management
  • Machine Learning and Bussiness Intel can be easily integrated with BQ
  • Support for Geospatial data

Interacting with BQ

To interact with the BQ we can use:

  • A CLI tool
  • The GCP BQ UI (New and Old Web UI)
  • REST API which also allows the use of several client libraries

First things first

To be able to use BQ, it must be enabled for a project. All new projects will have it enabled by default, but for old projects, it must be manually enabled.

On GCP Concepts

Projects

  • Any GCP resources that you allocate and use must belong to a project. In order to interact with most GCP resources, you must provide the identifying project information for every request. You can identify a project in either of two ways: a project ID, or a project number
  • Google Cloud Platform offers Cloud IAM, which lets you assign granular access to specific Google Cloud Platform resources and prevents unwanted access to other resources. Cloud IAM lets you control who (users) has what access (roles) to which resources by setting Cloud IAM policies on the resources.
  • The hierarchy of resources is as follows,
    image

On possible queries

BQ lets you perform,

  • Run interactive queries
  • Run batch queries
  • Create a view, which is a virtual table defined by a SQL query. Views can be created by,
    • Calling the tables.insert API method
    • Using the GCP Console or classic BigQuery web UI
    • Submitting a CREATE VIEW Data Definition Language (DDL) statement
  • Use partitioned tables to query a subset of your data,
    • A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data.
  • In addition to querying and viewing data, you can manage data in BigQuery in the following ways:
    • Listing projects, jobs, datasets, and tables
    • Getting information about jobs, datasets, and tables
    • Defining, updating, or patching datasets and tables
    • Deleting datasets and tables
    • Managing table partitions

BigQuery jobs

  • Jobs are actions that BigQuery runs on your behalf to load data, export data, query data, or copy data.
  • Because jobs can potentially take a long time to complete, they execute asynchronously and can be polled for their status.

Datasets

  • A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.
  • Datasets are strongly linked to their location (an analogous concept to AWS regions)
  • Allow controlled access.

Useful snippets

Table schemas

  • BigQuery allows you to specify a table’s schema when you load data into a table, and when you create an empty table.
  • To specify the schema one can use,
    • The UI
    • CLI
    • Create a schema file in JSON format.
  • A column name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and it must start with a letter or underscore. The maximum column name length is 128 characters. Can not use prefixes like _TABLE_,_FILE_ or _PARTITION.
  • Column names are case insensitive
  • Columns use standard SQL types
  • Columns accept NULLable modes (allow NULL entries). This is enabled by default
  • Here’s a link in how to send a table schema to the BQ
  • BigQuery performs best when your data is denormalized, so it is better to denormalize your data and take advantage of nested and repeated columns. Nested and repeated columns can maintain relationships without the performance impact of preserving a relational (normalized) schema. For example, a relational database used to track library books would likely keep all author information in a separate table. A key such as author_id would be used to link the book to the authors. In BigQuery, you can preserve the relationship between book and author without creating a separate author table. Instead, you create an author column, and you nest fields within it such as the author’s first name, last name, date of birth, and so on. If a book has multiple authors, you can make the nested author column repeated.
    Here’s an example to have NESTED data in schema https://cloud.google.com/bigquery/docs/nested-repeated
  • Schemas can’t be changed, but can be improved by adding columns

BigQuery tables

  • A BigQuery table contains individual records organized in rows. Each record is composed of columns (also called fields).
  • Every table is defined by a schema that describes the column names, data types, and other information.
  • Table names must be unique for a dataset
  • Tables can be copied only to the same location (region, like in aws)
  • You can create a table in BigQuery go snippet to table creation:
    • Manually using the GCP Console, the classic BigQuery web UI, or the command-line tool’s bq mk command
    • Programmatically by calling the tables.insert API method
    • From query results
    • By defining a table that references an external data source
    • When you load data
  • Table names can have up to 1024 characters

Partitioned tables

  • A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
  • Partitioned tables are ordered by data load date or data date of arrival

Views

  • A view is a virtual table defined by a SQL query
  • You can query and create views in BigQuery by using either of,
    • Query editor box in the GCP Console
    • Compose Query option in the classic BigQuery web UI
    • BigQuery command-line tool’s bq query command
    • BigQuery REST API to programmatically call the jobs.query or query-type jobs.insert methods
    • BigQuery client libraries
  • To create a view take into account the snippet https://cloud.google.com/bigquery/docs/views
  • Views must be created before being viewed
  • Caution viewing views is not the same as viewing tables, views are virtual, tables not

Resources

  1. https://cloud.google.com/bigquery/docs/reference/libraries
  2. https://cloud.google.com/resource-manager/docs/cloud-platform-resource-hierarchy#projects
  3. https://cloud.google.com/bigquery/docs
comments powered by Disqus