Case-(in)sensitive, collation and quote in different DBMSs.

Case-(in)sensitive, collation and quote in different DBMSs.

April 21, 2023
Database

Case-(in)sensitive, collation and quote in different DBMSs.

In Oracle Database, you can use double quotes (") to quote SQL identifiers such as table names, column names, and aliases. By enclosing an identifier in double quotes, you can use mixed case, special characters, or reserved words as identifiers.

Oracle quote SQL identifiers #

In Oracle Database, you can use double quotes (") to quote SQL identifiers such as table names, column names, and aliases. By enclosing an identifier in double quotes, you can use mixed case, special characters, or reserved words as identifiers.

Here’s an example of using quoted identifiers in an Oracle SQL query:

SELECT "FirstName", "LastName", "Age"
FROM "Employees"
WHERE "Age" > 30;

In this example, the table name Employees and column names FirstName, LastName, and Age are enclosed in double quotes. This allows you to use mixed case for the table and column names.

When using quoted identifiers, keep the following in mind:

  • Quoted identifiers are case-sensitive. If you create a table with the name "Employees", you must use the same capitalization when referencing it in your queries. If you try to reference it as "employees" or "EMPLOYEES", Oracle will treat it as a different table.
  • If you don’t use double quotes, Oracle will automatically convert the identifiers to uppercase. For example, SELECT firstname FROM employees will be treated as SELECT FIRSTNAME FROM EMPLOYEES.
  • Enclosing an identifier in double quotes allows you to use reserved words or special characters. However, using such identifiers can make your SQL code harder to read and maintain. It’s generally recommended to choose descriptive and simple identifiers that don’t require quoting.

SQL Server quote SQL identifiers #

In SQL Server, you can use either square brackets ([]) or double quotes (") to quote SQL identifiers, such as table names, column names, and aliases. By enclosing an identifier in square brackets or double quotes, you can use mixed case, special characters, or reserved words as identifiers.

Here’s an example of using quoted identifiers in a SQL Server query:

SELECT [FirstName], [LastName], [Age]
FROM [Employees]
WHERE [Age] > 30;

In this example, the table name Employees and column names FirstName, LastName, and Age are enclosed in square brackets.

Alternatively, you can use double quotes:

SELECT "FirstName", "LastName", "Age"
FROM "Employees"
WHERE "Age" > 30;

To use double quotes as the identifier delimiter, you need to have the QUOTED_IDENTIFIER option set to ON (which is the default setting). You can enable or disable this option using the SET statement:

SET QUOTED_IDENTIFIER ON;

When using quoted identifiers, keep the following in mind:

  • Quoted identifiers are case-insensitive by default in SQL Server, as long as the server and database collation are case-insensitive. If the collation is case-sensitive, the quoted identifiers will be case-sensitive as well.
  • If you don’t use square brackets or double quotes, SQL Server will automatically convert the identifiers to the appropriate case according to the collation settings. For example, SELECT firstname FROM employees will work as long as the collation is case-insensitive.
  • Enclosing an identifier in square brackets or double quotes allows you to use reserved words or special characters. However, using such identifiers can make your SQL code harder to read and maintain. It’s generally recommended to choose descriptive and simple identifiers that don’t require quoting.

How to check SQL Server collation #

To determine the collation of your SQL Server, you can use the SERVERPROPERTY function. The Collation property returns the collation that is currently being used by the server.

Here’s a query to find the server collation:

SELECT CONVERT (varchar(256), SERVERPROPERTY('collation'));

Running this query will return the server collation as a single-row result.

To determine the collation of a specific database, you can use the COLLATION_NAME function or query the sys.databases catalog view. Here’s a query to find the collation of a specific database:

SELECT name, collation_name
FROM sys.databases
WHERE name = 'YourDatabaseName';

Replace YourDatabaseName with the name of the database you want to check the collation for. The result will show the database name and its collation.

Keep in mind that individual columns within a database can also have their own collation settings, which can be different from the database or server collation. To check the collation of a specific column, you can query the INFORMATION_SCHEMA.COLUMNS view:

SELECT TABLE_NAME, COLUMN_NAME, COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'YourTableName' AND COLUMN_NAME = 'YourColumnName';

Replace YourTableName and YourColumnName with the name of the table and column you want to check the collation for. The result will show the table name, column name, and the column’s collation.

A collation example #

Take collation SQL_Latin1_General_CP1_CI_AS as an example:

The collation SQL_Latin1_General_CP1_CI_AS is case-insensitive. In this collation name, CI stands for “case insensitive,” while AS stands for “accent sensitive.”

Here’s a breakdown of the different components in the collation name:

  • SQL: Indicates that it’s a SQL Server-defined collation.
  • Latin1_General: Represents the base Latin1 character set (which includes most Western European languages).
  • CP1: Represents the code page 1252 (Windows-1252), which is the character encoding used by this collation.
  • CI: Indicates that the collation is case-insensitive, meaning that it treats uppercase and lowercase characters as equal.
  • AS: Indicates that the collation is accent-sensitive, meaning that it treats accented characters as distinct from their unaccented counterparts.

In summary, the SQL_Latin1_General_CP1_CI_AS collation treats uppercase and lowercase characters as equal, but it differentiates between accented and unaccented characters.

What is ccented characters

Accented characters are letters of the alphabet that have been modified with diacritical marks, such as accents, to indicate a specific pronunciation or to distinguish them from other similar characters. Accents can appear above, below, or through a character. They are commonly used in many languages, especially those derived from Latin, to represent various phonetic features.

Some examples of accented characters are:

  • é (e with acute accent)
  • ê (e with circumflex)
  • ñ (n with tilde)
  • ü (u with diaeresis)
  • ç (c with cedilla)
  • Ã¥ (a with ring)

Accented characters are essential for accurate representation and pronunciation in many languages, such as French, Spanish, Portuguese, German, and Swedish. In some cases, the presence or absence of an accent can change the meaning of a word entirely.

When working with databases and text comparisons, it’s essential to choose the correct collation that respects the case sensitivity and accent sensitivity rules of the language being used. This ensures accurate query results and proper sorting of text data.

ClickHouse quote SQL identifiers #

In ClickHouse, you can use backticks (`) to quote SQL identifiers, such as table names, column names, and aliases. By enclosing an identifier in backticks, you can use mixed case, special characters, or reserved words as identifiers.

Here’s an example of using quoted identifiers in a ClickHouse query:

SELECT `Timestamp`, `UserId`
FROM `Events`
WHERE `UserId` = 123;

In this example, the table name Events and column names Timestamp and UserId are enclosed in backticks.

Using quoted identifiers is helpful if your table or column names have spaces, special characters, or are reserved keywords. However, using such identifiers can make your SQL code harder to read and maintain. It’s generally recommended to choose descriptive and simple identifiers that don’t require quoting.

Keep in mind that, unlike some other database systems, ClickHouse is case-sensitive for identifiers. This means that you must use the correct case when referencing table and column names, even when they are quoted.

MySQL quote SQL identifiers #

In MySQL, you can use backticks (`) to quote SQL identifiers, such as table names, column names, and aliases. By enclosing an identifier in backticks, you can use mixed case, special characters, or reserved words as identifiers.

Here’s an example of using quoted identifiers in a MySQL query:

SELECT `first_name`, `last_name`
FROM `employees`
WHERE `age` > 30;

In this example, the table name employees and column names first_name, last_name, and age are enclosed in backticks.

Using quoted identifiers is helpful if your table or column names have spaces, special characters, or are reserved keywords. However, using such identifiers can make your SQL code harder to read and maintain. It’s generally recommended to choose descriptive and simple identifiers that don’t require quoting.

MySQL also supports the ANSI_QUOTES SQL mode, which allows you to use double quotes (") to quote identifiers. To enable this mode, you can set it at the server level or for a specific session using the SET statement:

SET sql_mode = 'ANSI_QUOTES';

With ANSI_QUOTES enabled, you can use double quotes in addition to backticks to quote identifiers:

SELECT "first_name", "last_name"
FROM "employees"
WHERE "age" > 30;

Keep in mind that the case sensitivity of identifiers in MySQL depends on the file system and the collation settings of the underlying operating system. For example, on Windows, table and column names are case-insensitive, while on Linux, they are case-sensitive by default.

MySQL lower_case_table_names option #

The lower_case_table_names system variable in MySQL controls the way the server stores and compares table names. It has three possible values:

  • lower_case_table_names=0 (default on Unix-based systems): Table names are stored as specified and comparisons are case-sensitive. If you create a table named MyTable, you must refer to it using the same case (e.g., SELECT * FROM MyTable;).
  • lower_case_table_names=1 (default on Windows and macOS): Table names are stored in lowercase and comparisons are case-insensitive. If you create a table named MyTable, you can refer to it using any case (e.g., SELECT * FROM mytable; or SELECT * FROM MYTABLE;).
  • lower_case_table_names=2 (default on macOS with case-sensitive file systems): Table names are stored as specified but comparisons are case-insensitive. This is similar to lower_case_table_names=1, but table names are stored with the case specified during creation.

To change the lower_case_table_names value, you need to modify the MySQL server configuration file (e.g., my.cnf or my.ini). Add or modify the following line under the [mysqld] section:

lower_case_table_names=1

Replace 1 with the desired value. After making the change, you need to restart the MySQL server for the new setting to take effect.

Warning: Changing the lower_case_table_names setting after tables have been created can cause various issues, such as duplicate tables or inaccessible tables. It is recommended to set this option when you initially configure the MySQL server, before creating any tables. If you need to change this option on an existing server, make sure to backup your data, update the option, and then recreate your tables and reload the data.

Config MySQL collation setting different from OS collation #

MySQL server can have different collation settings from the operating system. Collation in MySQL determines how the server compares and sorts character strings. MySQL supports a variety of character sets and collations for different languages and sorting rules.

MySQL collation settings are independent of the operating system’s locale settings. You can set the character set and collation at different levels in MySQL:

  1. Server level: The default character set and collation for the entire server can be set in the MySQL server configuration file (e.g., my.cnf or my.ini). To set the server-level character set and collation, add or modify the following lines under the [mysqld] section:

    character-set-server=utf8mb4
    collation-server=utf8mb4_general_ci
    

    Replace utf8mb4 and utf8mb4_general_ci with your desired character set and collation. After making changes, restart the MySQL server for the new settings to take effect.

  2. Database level: You can set the character set and collation for a specific database when creating the database using the CREATE DATABASE statement:

    CREATE DATABASE my_database
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_general_ci;
    

    Replace my_database, utf8mb4, and utf8mb4_general_ci with your desired database name, character set, and collation.

  3. Table level: You can set the character set and collation for a specific table when creating the table using the CREATE TABLE statement:

    CREATE TABLE my_table (
        id INT PRIMARY KEY,
        name VARCHAR(255)
    ) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
    

    Replace my_table, utf8mb4, and utf8mb4_general_ci with your desired table name, character set, and collation.

  4. Column level: You can set the character set and collation for a specific column in a table when creating or altering the table:

    CREATE TABLE my_table (
        id INT PRIMARY KEY,
        name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci
    );
    

    Replace my_table, utf8mb4, and utf8mb4_general_ci with your desired table name, character set, and collation for the column.

By configuring the character set and collation settings in MySQL, you can have different sorting and comparison rules than those used by the operating system.

What’s collation #

Collation, in the context of databases and character data, refers to a set of rules that determine how character strings are compared and sorted. Collation defines the proper order of characters based on a specific language, cultural context, or application requirements. It takes into account various factors such as case sensitivity, accent sensitivity, and special characters.

Different languages and regions have different sorting rules, so databases provide a variety of collations to support these requirements. For example, in English, the letter “A” comes before “B,” whereas in Swedish, the letter “Ä” is considered to be after “Z.” Collations in databases handle such language-specific rules.

When working with databases, you can define collations at various levels, such as the server level, database level, table level, or even column level. Selecting the appropriate collation for your data ensures accurate query results, proper sorting of text data, and correct handling of case sensitivity and accents.

In most databases, collations are tied to character sets, which define the encoding of characters that can be stored and processed by the database. When choosing a collation, it’s essential to select one that’s compatible with the character set you’re using. For example, in MySQL, you might choose the utf8mb4 character set and utf8mb4_general_ci collation for a Unicode-aware, case-insensitive sorting and comparison of text data.

GBK, GB2312, and GB18030 #

GBK, GB2312, and GB18030 are character sets and encodings for simplified Chinese characters. Each of them represents a different stage of development and expansion of the Chinese character set in computing. Here is a brief overview of each:

  1. GB2312: GB2312 (Guojia Biaozhun, which means “National Standard” in Chinese) is one of the earliest character sets for simplified Chinese. It was published in 1980 and contains approximately 7,445 characters, including 6,763 Chinese characters and 682 other characters (e.g., Latin letters, numbers, and symbols). GB2312 is now considered outdated and has been replaced by more comprehensive character sets.

  2. GBK: GBK (an abbreviation of “Guojia Biaozhun Kuo Zhan,” which means “National Standard Extension” in Chinese) is an extension of the GB2312 character set. It was introduced in 1995 and includes all characters in GB2312, as well as about 18,000 additional Chinese characters, for a total of approximately 21,000 characters. GBK is compatible with GB2312, meaning that GBK-encoded text can be correctly displayed by a GB2312 decoder, although some characters may be missing. GBK is still widely used, but it has been superseded by the more comprehensive GB18030 standard.

  3. GB18030: GB18030 is the most recent and comprehensive character set for simplified Chinese. It was introduced in 2000 and revised in 2005. GB18030 is a superset of GBK and includes over 1.7 million characters, including all Unicode characters. It covers not only simplified Chinese characters but also traditional Chinese characters, as well as characters from other languages such as Japanese, Korean, and many others. GB18030 is now the mandatory character set for software sold in China, and it ensures better compatibility with international character sets, such as Unicode.

When choosing a character set for your application or data storage, it’s generally best to use GB18030 if you need to support simplified Chinese characters, as it is the most up-to-date and comprehensive character set. It also offers better compatibility with other international character sets and encodings.