Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

About SQL

What is a Database

A database is a structured collection of data that supports electronic storage of data as well as ability to manipulate the data.  Databases typically also provide the ability to enforce permissionsdata. It is designed to efficiently store, retrieve, and manage information.

The most widely used type of databases are relational databases (PostgresSQL, MySQL, SQL Server, Oracle, etc.).  In a relational database, data is stored in tables (rows and columns) and the tables can be have relationships with other tables.  For example, a Patient Demographic table will be related to a Patient Medication table through the Patient ID.

About SQL

 'SQL' (Structured Query Language) is the language used to manipulate data is relational databases.  SQL can be used to Create, Read, Update, and Delete data within relational databases.

Tutorial Prerequisites

Basic understanding of tabular data (like Excel).

Project Setup

We will use the following site to test SQL commands: https://sqliteonline.com/

Connect to the 'PostgreSQL' database.  Note that this site has a 15 minute idle limit.

Databases that use SQL

  • SQL Server
  • Oracle
  • Postgres SQL
  • MySQL
  • DB2
  • ... and more

Other types of databases besides relational databases:

  • NoSQL
    • document database
    • graph database
    • key-value
    • time series

However a lot of NoSQL databases have incorporated SQL-like syntax so that users can query their databases.

GUI Tool for SQL

Tutorial Prerequisites

Basic understanding of tabular data (like Excel).


Project Setup

We will use the following site to test SQL commands: https://sqliteonline.com/

Connect to the 'PostgreSQL' database.  Note that this site has a 15 minute idle limit.


Commands to insert data

These commands are also explained in later sections of the tutorial.  For teaching purposes, they have been replicated for project set up so that users can learn how to query data first (i.e. the most likely step that users are performing).

draw.io Diagram
bordertrue
diagramNameProject ER Model
simpleViewerfalse
width
linksauto
tbstyletop
lboxtrue
diagramWidth381
revision3


Code Block
languagesql
-- Create the following two tables for this demo:

Commands to insert data

These commands are also explained in later sections of the tutorial.  For teaching purposes, they have been replicated for project set up so that users can learn how to query data first (i.e. the most likely step that users are performing).

Code Block
languagesql
-- Create the following two tables for this demo:
CREATE TABLE person ( id INTEGER PRIMARY KEY, person_name VARCHAR(100), income NUMERIC(7,2), country_id integer );
CREATE TABLE country ( 
	id INTEGER PRIMARY KEY, 
	country_name VARCHAR(100) 
);

--CREATE InsertTABLE dataperson into( the 'country' table:
INSERT INTO country (id, country_name) VALUES (1, 'Canada'), (2, 'USA'), (3, 'Mexico'), (4, 'Sweden'
	id INTEGER PRIMARY KEY, 
	person_name VARCHAR(100), 
	income NUMERIC(7,2), 
	country_id integer,
	CONSTRAINT country_id_fk
		FOREIGN KEY(country_id) 
		REFERENCES country(id)
);

-- Insert data into the 'personcountry' table:
INSERT INTO personcountry (id, personcountry_name, income, country_id) VALUES (1, 'SallyCanada'), 60000(2, 1);
INSERT INTO person'USA'), (id3, person_name'Mexico'), income(4, country_id) VALUES (2, 'Bob', 70000'Sweden');

-- Insert data into the 'person' table:
INSERT INTO person (id, person_name, income, country_id) VALUES (1, 'Sally', 60000, 1);
INSERT INTO person (id, person_name, income, country_id) VALUES (32, 'LucyBob', 8000070000, 21);
INSERT INTO person (id, person_name, income, country_id) VALUES (43, 'BillLucy', 7500080000, 52);
INSERT INTO person (id, person_name, income, country_id) VALUES (4, 'Bill', 75000, NULL);


Result - Person Table

idperson_nameincomecountry_id
1Sally60000.01
2Bob70000.01
3Lucy80000.02
4Bill75000.05NULL
Result - Country Table
idcountry_name
1Canada
2USA
3Mexico
4Sweden


The two red columns (person.country_id and country.id) is meant to indicate a foreign-key relationship.  For example, it can be read as: "Sally lives in country USA".  There are different benefits for having the information in separate tables but the main reason is to reduce redundant data.


Basic Data Querying

Select Statement

Code Block
languagesql
-- select all datacolumns from person table

select 
	* 
from 
	person;
idperson_nameincomecountry_id
1Sally60000.01
2Bob70000.01
3Lucy80000.02
4Bill75000.05

'Order by' clause

NULL


Code Block
languagesql
select-- *we fromcan person
order by income asc;also specify the columns we want to query by replacing the asterisk with the column name

select * from person
order by income desc
	id, 
	person_name, 
	income 
from 
	person;
idperson_nameincomecountry_id
1Sally60000.01
2Bob70000.0
3Lucy80000.01
4Bill75000.05
3Lucy80000.02
idperson_nameincomecountry_id
3Lucy80000.02
4Bill75000.05
2Bob70000.01
1Sally60000.01

'Where' Clause

'Order by' clause

Code Block
languagesql
-- if we want to return the data and have it ordered according to a specific column(s), we can use the order by clause
Code Block
languagesql
-- filter data from person table by name

select * from person
order where person_name = 'Lucy'by income asc;

select * from person
order by income desc;
idperson_nameincomecountry_id
1Sally60000.01
2Bob70000.01
4Bill75000.0NULL
3Lucy80000.02

Join Types

There are 3 types of joins that are available in SQL.  The following is a good diagram that explains the results from each join type:  The diagram and more information can be found here: https://www.w3schools.com/sql/sql_join.asp

Image Removed

Inner Join

Code Block
languagesql
select * from person inner join country on person.country_id = country.id;

Return records that are matching in both employee and country table.

idperson_nameincomecountry_id
3Lucy80000.02
4Bill75000.0NULL
2Bob70000.01
idperson_nameincomecountry_idid
1Sally60000.011
2Bob70000.011
3Lucy80000.022

Left Join

'Where' Clause

Code Block
languagesql
-- filter data
Code Block
languagesql
select * from person left join country on person.country_id = country.id;

Returns all records from the employee table and only the matching records from the country table.

 table by name
select * from person where person_name = 'Lucy';

-- sometimes we don't know the exact spelling or word.  In that case we can use a wildcard
select * from person where person_name like '%Lu%';
1
idperson_nameincomecountry_id
idperson_nameincomecountry_idid
1Sally60000.011
2Bob70000.01
3Lucy80000.0224Bill75000.05
Right Join


'Distinct' Clause

Code Block
languagesql
--now doget athe rightunique join
selectcountries *that froma person right join country on person.lives in

select distinct country_id =from country.id;

Returns only matching records from the employee table and all records from the country table.

person;
country_id
1
2
NULL

Join Types

Join statements are a way to combine data from multiple tables.  In a database, you'll often find that data is spread out across multiple tables ranging from dozens to even thousands of tables.  There are multiple reasons database developers create multiple tables but the two main reasons are: make some process more efficient (inserting, querying or updating data) or it's just logically appropriate to store data in a separate table.  For example we have information about a person in the Person table and information about a country in the Country table. 

But what if we wanted to know the average income of people in Canada?  For this, we may need to join tables.

There are 4 types of joins that are available in SQL.  The following is a good diagram that explains the results from each join type:  The diagram and more information can be found here: https://www.w3schools.com/sql/sql_join.asp

Image Added


Inner Join

idperson_nameincomecountry_idid1Sally60000.0112Bob70000.0113Lucy80000.02243

Full Outer Join

Code Block
languagesql
select * from person full outer join country on person.country_id = country.id;

Returns all records from employee and country table.

idperson_nameincomecountry_idid1Sally60000.0112Bob70000.0113Lucy80000.0224Bill75000.0543

Group By and Aggregation

Select with Group By clause

The 'group by' statement in SQL groups rows that have the same values for a set of fields into summary rows.  The grouping helps to perform aggregation functions like count, max, min, sum, etc.

Here we are basically asking the database to return the count of people per country.

Code Block
languagesql
select country_name, count(*) from person
	* 
from 
	person 
inner join country 
	on person.country_id = country.id
group by country_name;;

Return records that are matching in both person and country table.

idperson_nameincomecountry_nameidcountid
1Sally60000.0USA1Canada1
2

Having Clause

Using the 'having' clause, we can take the result of the group by statement and filter it to only return countries where the number of people is greater than 1.

Bob70000.011
3Lucy80000.022

Left Join

Code Block
languagesql
select country_name, count(*) from person
inner
	* 
from 
	person 
left join 
	country 
	on person.country_id = country.id
group by country_name
having count(*) > 1;
country_namecountCanada2

Average

;

Returns all records from the person table and only the matching records from the country table.

idperson_nameincomecountry_idid
1Sally60000.011
2Bob70000.011
3Lucy80000.022
4Bill75000.0NULLNULL

Right Join

Return the average income per person per country.

Code Block
languagesql
select country_name, avg(income) from person
inner join country --now do a right join
select * 
from 
	person 
right join 
	country 
	on person.country_id = country.id
group by country_name;
country_nameavgUSA
80000.000000000000
Canada
65000.000000000000

Maximum Value

;

Returns only matching records from the person table and all records from the country table.

idperson_nameincomecountry_idid
1Sally60000.011
2Bob70000.011
3Lucy80000.022




4




3

Full Outer Join

Code Block
languagesql
select country_name, 'max' as agg_type, max(income) from person
inner join country * 
from 
	person 
full outer join 
	country 
	on person.country_id = country.id
group by country_name;
country_nameagg_typeavgUSAmax
80000.00
Canadamax
70000.00

Minimum Value

Code Block
languagesql
select country_name, 'min' as agg_type, min(income) from person
inner join country on person.country_id = country.id
group by country_name;
country_nameagg_typeavgUSAmin
80000.00
Canadamin
60000.00

Union Clause

Instead of giving the queries one at a time to get the avg, max, and min, we can use a union clause.  Union clause is used to combine the results of multiple queries.  The rules for using a union clause are as follows (see link):

;

Returns all records from person and country table.

idperson_nameincomecountry_idid
1Sally60000.011
2Bob70000.011
3Lucy80000.022
4Bill75000.0NULL




4




3

Group By and Aggregation

Select with Group By clause

The 'GROUP BY' clause in SQL is used when you want to group rows that have the same values in certain columns and perform some kind of aggregate function (avg, max, min, count) on each group. This is particularly useful when analyzing and summarizing data.  Common usage scenarios include:

  • calculating summaries
  • data categorization
  • reporting and visualization

For example, here we are asking the database to return the count of people per country.

  • Each SELECT statement within UNION must have the same number of columns
  • The columns must also have similar data types
  • The columns in each SELECT statement must also be in the same order

    Code Block
    languagesql
    select 
    	country_name, 'avg' as agg_type, avg(income) from 
    	count(*) 
    from 
    	person
    inner join 
    	country 
    	on person.country_id = country.id
    group by 
    	country_name
    
    UNION
    
    select ;
    country_namecount
    USA1
    Canada2

    Having Clause

    Using the 'having' clause, we can take the result of the group by statement and filter it to only return countries where the number of people is greater than 1.

    Code Block
    languagesql
    select 
    	country_name, 
    	count(*) 
    from 
    	person
    inner join 
    	country 
    		, 'max' as agg_type, max(income) from person
    inner join country on person.country_id = country.id
    group by 
    	country_name
    
    UNION
    select country_name, 'min' as agg_type, minhaving 
    	count(*) > 1;
    country_namecount
    Canada2

    Average

    Return the average income per person per country.

    Code Block
    languagesql
    select 
    	country_name, 
    	avg(income) 
    from 
    	person
    inner join 
    	country 
    		on person.country_id = country.id
    group by 
    	country_name;
    agg_typeCanada
    country_nameavg
    CanadaUSAavg65000
    80000.000000000000
    Canadamax
    7000065000.00000000000000


    Maximum Value

    min
    Code Block
    60000.00
    languagesql
    select 
    	country_name, 
    	'max' as agg_type, 
    	max(income) 
    from 
    	person
    inner join 
    	country 
    	on person.country_id = country.id
    group by 
    	country_name;
    country_nameagg_typeavg
    USAmax
    USAmax80000.00
    USAavg80000.000000000000
    USAmin
    80000.00

    Sub-queries

    Sub-queries can be used to create intermediary tables that are then joined within the larger dataset.  It can help with applying filter logic and performance tuning by filtering the amount of data being processed in a join condition.

    Code Block
    languagesql
    select * from person
    full outer join (
    	select * from country
    	where id = 2) country_table
    on person.country_id = country_table.id;
    idperson_nameincomecountry_ididcountry_name1Sally60000.0012Bob70000.0013Lucy80000.0022USA
    Canadamax
    70000.00


    Minimum Value

    Code Block
    languagesql
    select 
    	country_name, 
    	'min' as agg_type, 
    	min(income) 
    from 
    	person
    inner join 
    	country 
    		on person.country_id = country.id
    group by 
    	country_name;
    country_nameagg_typeavg
    USAmin
    80000.00
    Canadamin
    60000.00


    Avg, Max, and Min Values:

    Code Block
    languagesql
    select 
    	country_name, 
    	min(income), 
    	max(income), 
    	avg(income) 
    from 
    	person
    inner join 
    	country 
    		on person.country_id = country.id
    group by 
    	country_name;
    country_namemin(income)max(income)avg(income)
    Canada60000.0070000.0065000.000000000000
    USA80000.0080000.0080000.000000000000

    Union Clause

    Instead of giving the queries one at a time to get the avg, max, and min, we can use a union clause.  Union clause is used to combine the results of multiple queries.  The rules for using a union clause are as follows (see link):

    • Each SELECT statement within UNION must have the same number of columns
    • The columns must also have similar data types
    • The columns in each SELECT statement must also be in the same order


    Code Block
    languagesql
    select country_name, 'avg' as agg_type, avg(income) from person
    inner join country on person.country_id = country.id
    group by country_name
    
    UNION
    
    select country_name, 'max' as agg_type, max(income) from person
    inner join country on person.country_id = country.id
    group by country_name
    
    UNION
    select country_name, 'min' as agg_type, min(income) from person
    inner join country on person.country_id = country.id
    group by country_name;
    country_nameagg_typeavg
    Canadaavg

    65000.000000000000

    Canadamax
    70000.00
    Canadamin60000.00
    USAmax80000.00
    USAavg80000.000000000000
    USAmin80000.00

    Sub-queries

    Sub-queries can be used to create intermediary tables that are then joined within the larger dataset.  It can help with applying filter logic and performance tuning by filtering the amount of data being processed in a join condition.

    Code Block
    languagesql
    select * from person
    full outer join (
    	select * from country
    	where id = 2) country_table
    on person.country_id = country_table.id;
    idperson_nameincomecountry_ididcountry_name
    1Sally60000.001

    2Bob70000.001

    3Lucy80000.0022USA
    4Bill75000.00NULL


    Compare the sub-query to filtering after joining:

    Code Block
    languagesql
    select * from person
    full outer join country
    on person.country_id = country.id
    where country.id = 2;
    idperson_nameincomecountry_ididcountry_name
    3Lucy80000.0022USA


    When you filter after, it takes the results of the join condition and then applies the filter logic.  The sub-query will calculate the sub-query first and then apply the join logic.

    Case Statement

    Code Block
    languagesql
    SELECT person_name, CASE 
    WHEN income BETWEEN 50000 and 61000 THEN 1 
    WHEN income BETWEEN 62000 and 71000 THEN 2 
    WHEN income BETWEEN 72000 and 81000 THEN 3 END as income_category
    FROM person;
    person_nameincome_category
    Sally1
    Bob2
    Lucy3
    Bill3

    Data type conversion

    The following link explains how to perform conversion of data from one format into another:  https://www.postgresql.org/docs/current/functions-formatting.html

    Often times when working in a big data environment, data is dumped into tables and all the columns are treated as strings.  Therefore, if you need to perform math operations or other comparisons, you may not get the correct results.

    The most common conversions are string to number or string to date/timestamp.  Examples of each are below.

    Treat string as a number

    If you try to add two strings, it will result in an error:

    Code Block
    languagesql
    select ('100') + ('100'); -- query fails
    Code Block
    languagesql
    SELECT ('100'::INTEGER) + ('200'::Integer); -- result is successfully returned as 300
    SELECT ('10.1'::DECIMAL) + ('12.2':: DECIMAL)' -- result is successfully returned as 22.3
    SELECT ('-10.1'::DECIMAL) + ('12.2':: DECIMAL)' -- result is successfully returned as 2.1


    Treat string as a date

    Code Block
    languagesql
    select ('jan 01 2021') > ('feb 01 2021'); -- incorrectly results in true because 'j' is larger than 'f' alphabetically.
    Code Block
    languagesql
    select ('jan 01 2021'::DATE) > ('feb 01 2021'::DATE); -- correctly returns false
    select ('jan 01 2021 08:01:45'::TIMESTAMP) > ('jan 01 2021 08:01:46'::TIMESTAMP); -- correctly returns false
    select to_date('05$DEC$2000', 'DD$MON$YYYY') -- returns 2000-12-05T00:00:00.000Z
    
    


    Run query against table

    Code Block
    languagesql
    create table test_data_conv (income_1 varchar(100), income_2 varchar(100), date_1 varchar(100), date_2 varchar(100));
    
    INSERT INTO test_data_conv 
    (income_1, income_2, date_1, date_2) 
    values 
    ('10.1', '12.2', 'jan 01 2021', 'feb 01 2021'),
    ('13.1', '14.2', 'mar 01 2021', 'aug 01 2021');
    
    SELECT sum(income_1::DECIMAL) from test_data_conv; --23.2
    
    select * from test_data_conv where (date_1::DATE) < (date_2::DATE); -- correctly returns both rows.
    
    
    Code Block
    languagesql
    ALTER TABLE person ADD column bday text;
    
    UPDATE person
    SET bday = '2000-Jan-01'
    WHERE person_name = 'Lucy';
    
    UPDATE person
    SET bday = '2000-Feb-01'
    WHERE person_name = 'Bob';
    
    -- compare the results for the following queries:
    select * from person
    order by bday::date;
    
    select * from person
    order by bday;
    4Bill75000.005

    Case Statement

    Code Block
    languagesql
    SELECT person_name, CASE 
    WHEN income BETWEEN 50000 and 61000 THEN 1 
    WHEN income BETWEEN 62000 and 71000 THEN 2 
    WHEN income BETWEEN 72000 and 81000 THEN 3 END as income_category
    FROM person;
    person_nameincome_categorySally1Bob2Lucy3Bill3

    Performance Tuning

    There are ways to optimize performance of queries by reducing the amount of data being processed or by using SQL keywords that are optimized for certain processes.  Here are a few examples:

    DescriptionGoodBad

    Select particular columns to speed up performance.


    Makes a difference if the table has many columns.

    Code Block
    languagesql
    select name, age from employeeperson;
    Code Block
    languagesql
    select * from employeeperson;
    
    
    Reduce amount of data being processed in join clauses
    Code Block
    languagesql
    select * from 
    
    (select employee_id, country_id from employee where join_date > '2020-01-01') t1
    
    inner join
    
    (employee_id, salary from compensation where comp_date > '2020-01-01') t2
    
    on t1.employee_id = t2.employee_id;
    Code Block
    languagesql
    select * from employee t1
    
    inner join compensation t2
    
    on t1.employee_id = t2.employee_id
    
    where join_date > '2020-01-01'
    
    and comp_date > '2020-01-01';

    Inserting, Updating, and Deleting Tables and Data

    Step 1 -

    Create tables

    Code Block
    languagesql
    -- Create the following two tables for this demo:
    CREATE TABLE person ( id INTEGER PRIMARY KEY, person_name VARCHAR(100), income NUMERIC(7,2) );
    CREATE TABLE country ( id INTEGER PRIMARY KEY, country_name VARCHAR(100) );

    For more information on creating tables: https://www.postgresql.org/docs/9.2/sql-createtable.html

    Step 2 -


    Insert Data into the Tables

    Code Block
    languagesql
    -- Insert data into the 'person' table:
    INSERT INTO person (id, person_name, income) VALUES (1, 'Sally', 60000);
    INSERT INTO person (id, person_name, income) VALUES (2, 'Bob', 70000);
    INSERT INTO person (id, person_name, income) VALUES (3, 'Lucy', 80000);
    Insert data into the 'country' table:
    Code Block
    languagesql
    INSERT INTO country (id, country_name) VALUES (1, 'Canada'), (2, 'USA'), (3, 'Mexico');


    Update

    Update the person table to have a column for the person's country:

    Code Block
    languagesql
    ALTER TABLE person ADD column country_id integer;

    Update the data:

    Code Block
    languagesql
    UPDATE person
    SET country_id = 2
    WHERE person_name = 'Lucy';
    Code Block
    languagesql
    UPDATE person
    SET country_id = 1
    WHERE person_name in ('Sally', 'Bob');


    Delete Data

    The delete clause is used to delete data from a table.  Be careful when using the delete clause.  If you omit the 'where' clause, you'll end up deleting all data from the table.

    Code Block
    languagesql
    -- delete row from country table where the ID of the country = 3
    
    delete from country where id = 3;


    Drop tables

    We can drop tables from our database by performing the following command.

    Code Block
    languagesql
    drop table person;
    drop table country;

    Additional online resources:

    NameReference
    Postgres cheatsheet with common commandshttps://www.postgresqltutorial.com/postgresql-cheat-sheet/
    Postgres tutorialhttps://www.postgresqltutorial.com/
    Postgres documentationhttps://www.postgresql.org/docs/9.4/


    Past Recordings:

    SQL Introductory workshop conducted at KCNI Oct 2020: