7 Essential SQL Statements Every Data Scientist Should Know.

Improve your data analysis skills with these 7 key SQL concepts.

Patricio Villanueva
8 min readMar 13, 2023

Introduction:

Structured Query Language (SQL) is a powerful tool used by data scientists to extract insights from data. Whether you are working with small or large datasets, it’s essential to know SQL statements to manipulate data effectively. In this blog post, we’ll discuss seven essential SQL statements that every data scientist should know to work with data effectively.

In this blog post, we will cover seven key SQL statements that are essential for data scientists to know. These statements are:

  1. SELECT statement
  2. WHERE clause
  3. JOIN statement
  4. GROUP BY statement
  5. ORDER BY statement
  6. Subqueries
  7. UNION statement
  8. CREATE TABLE statement

SELECT Statement

The SELECT statement is the most basic SQL statement, and it is used to retrieve data from one or more tables. The SELECT statement is used to specify the columns you want to retrieve, as well as the table or tables from which you want to retrieve data.

Here’s an example of how to use the SELECT statement to retrieve data from a single table:

SELECT column1, column2
FROM table_name;

JOIN Statement

The JOIN statement is used to combine data from two or more tables based on a common field. When working with relational databases, data is often stored across multiple tables, and JOIN statements provide a way to combine that data into a single result set.

There are several types of JOIN statements, including INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN, each used for a specific purpose. Here’s an overview of each type:

  • INNER JOIN: returns only the rows that have matching values in both tables.
  • LEFT JOIN: returns all rows from the left table and the matched rows from the right table. If there are no matching rows in the right table, NULL values are returned.
  • RIGHT JOIN: returns all rows from the right table and the matched rows from the left table. If there are no matching rows in the left table, NULL values are returned.
  • FULL OUTER JOIN: returns all rows from both tables, including those with NULL values.

Here’s an example of how to use the INNER JOIN statement to combine data from two tables:

SELECT *
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;

This SQL code will combine data from table1 and table2 based on the matching values in column_name.

You can also use the LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN statements to combine data from multiple tables, depending on your specific data needs.

GROUP BY Statement

The GROUP BY statement is used to group data by one or more columns. It is a useful SQL statement that allows you to aggregate data and perform calculations on groups of data.

Here’s an example of how to use the GROUP BY statement:

SELECT column1, COUNT(*)
FROM table_name
GROUP BY column1;

This SQL code will group data by column1 and count the number of rows in each group.

You can also use the GROUP BY statement with aggregate functions such as COUNT, AVG, SUM, MIN, and MAX to perform calculations on groups of data. For example:

SELECT column1, AVG(column2)
FROM table_name
GROUP BY column1;

This SQL code will group data by column1 and calculate the average value of column2 for each group.

In addition to using aggregate functions with the GROUP BY statement, you can also use the HAVING clause to filter groups based on specific conditions. For example:

SELECT column1, AVG(column2)
FROM table_name
GROUP BY column1
HAVING AVG(column2) > 10;

This SQL code will group data by column1 and calculate the average value of column2 for each group, and then filter the groups where the average value of column2 is greater than 10.

In summary, the GROUP BY statement is a powerful SQL statement that allows you to group data by one or more columns and perform calculations on groups of data using aggregate functions. By understanding how to use the GROUP BY statement, you can extract valuable insights from large datasets.

ORDER BY Statement

The ORDER BY statement is used to sort data in ascending or descending order based on one or more columns. It is a simple yet powerful SQL statement that allows you to sort data in a way that makes it easier to analyze.

Here’s an example of how to use the ORDER BY statement:

SELECT column1, column2, column3
FROM table_name
ORDER BY column1;

This SQL code will sort the data in table_name in ascending order based on column1.

You can also use the ORDER BY statement to sort data in descending order by adding the DESC keyword. For example:

SELECT column1, column2, column3
FROM table_name
ORDER BY column1 DESC;

This SQL code will sort the data in table_name in descending order based on column1.

In addition to sorting data by a single column, you can also sort data by multiple columns. For example:

SELECT column1, column2, column3
FROM table_name
ORDER BY column1, column2 DESC;

This SQL code will first sort the data in table_name in ascending order based on column1, and then sort the data within each group of column1 in descending order based on column2.

By default, the ORDER BY statement sorts data in ascending order, but you can use the ASC keyword to explicitly specify ascending order. For example:

SELECT column1, column2, column3
FROM table_name
ORDER BY column1 ASC, column2 DESC;

This SQL code will sort the data in table_name in ascending order based on column1 and descending order based on column2.

In summary, the ORDER BY statement is a powerful SQL statement that allows you to sort data in ascending or descending order based on one or more columns. By understanding how to use the ORDER BY statement, you can quickly sort and analyze large datasets.

Subqueries

Subqueries are queries embedded within another query and are useful for retrieving data from multiple tables or filtering data based on subqueries’ results. A subquery is executed first, and its results are used by the outer query to retrieve data.

Here’s an example of how to use a subquery:

SELECT column1, column2
FROM table_name1
WHERE column1 IN (
SELECT column1
FROM table_name2
);

This SQL code will retrieve data from table_name1 where column1 is in the list of values returned by the subquery.

You can also use subqueries with the WHERE clause to filter data based on subqueries’ results. For example:

SELECT column1, column2
FROM table_name1
WHERE column1 = (
SELECT MAX(column1)
FROM table_name2
);

This SQL code will retrieve data from table_name1 where column1 is equal to the maximum value of column1 returned by the subquery.

In summary, subqueries are powerful SQL statements that allow you to retrieve data from multiple tables or filter data based on subqueries’ results. By understanding how to use subqueries, you can extract valuable insights from complex datasets.

UNION Statement

The UNION statement is used to combine the results of two or more SELECT statements into a single result set. Each SELECT statement must have the same number of columns, and the data types of the columns must be compatible. Here’s an example of how to use the UNION statement:

SELECT column1, column2
FROM table_name1
UNION
SELECT column1, column2
FROM table_name2;

This SQL code will combine the results of two SELECT statements, one from table_name1 and one from table_name2, into a single result set. The result set will contain all values of column1 and column2 from both tables.

You can also use the UNION statement with the DISTINCT clause to remove duplicates from the result set. For example:

SELECT column1
FROM table_name1
UNION DISTINCT
SELECT column1
FROM table_name2;

This SQL code will combine the results of two SELECT statements and remove any duplicate values of column1.

Additionally, you can use the UNION ALL statement to combine the results of two or more SELECT statements, including duplicates:

SELECT column1 
FROM table_name1
UNION ALL
SELECT column1
FROM table_name2;

This query will retrieve all customer names from both tables, including duplicates.

In summary, the UNION statement is a powerful SQL statement that allows you to combine the results of two or more SELECT statements into a single result set. By understanding how to use the UNION statement, you can manipulate data from multiple tables and extract insights from complex datasets.

CREATE TABLE Statement

The CREATE TABLE statement is used to create a new table in a database. This statement allows you to specify the name of the table, the columns in the table, and the data types of the columns. Here’s an example of how to use the CREATE TABLE statement:

CREATE TABLE table_name (
column1 datatype,
column2 datatype,
column3 datatype,
...
);

In this SQL code, you can replace “table_name” with the name of the new table you want to create. Then, you define each column and its data type inside parentheses. For example, if you want to create a table called “customers” with columns for first name, last name, and email address, you could use the following code:

CREATE TABLE customers (
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100)
);

In this code, the data type for the first_name and last_name columns is VARCHAR, which is a variable-length string, and the maximum length is specified as 50 characters. The data type for the email column is also VARCHAR, with a maximum length of 100 characters.

You can also specify additional constraints on the columns, such as whether the column can contain NULL values or must be unique. Here’s an example of how to use the CREATE TABLE statement with column constraints:

CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
order_date DATE DEFAULT CURRENT_DATE,
total_amount DECIMAL(10, 2) CHECK (total_amount > 0)
);

In this code, the order_id column is specified as the primary key for the table, which means it uniquely identifies each row in the table. The customer_id column is specified as NOT NULL, which means it must contain a value for every row. The order_date column is specified with a DEFAULT value of CURRENT_DATE, which means that if a value is not provided when a row is inserted, the current date will be used. Finally, the total_amount column is specified with a CHECK constraint to ensure that the value is greater than 0.

In summary, the CREATE TABLE statement is a fundamental SQL statement that allows you to create new tables in a database. By understanding how to use the CREATE TABLE statement, you can design database schemas that meet the needs of your organization and support your data analysis efforts.

Conclusion

In conclusion, knowing SQL is essential for data scientists, and mastering these seven key SQL statements is a great way to get started. With practice and dedication, you can improve your SQL skills and become a more effective data analyst.

If you like this post give it some claps 👏

If you want more content about data, startup and tech follow me!

--

--

Patricio Villanueva

I've been working in data and in tech for more than 5 years now. Blogging about data, analystics, machine learning and tech