0. Resources


1. Database-Management System (DBMS)

A Database-Management System (DBMS) is a collection of interrelated data and a set of programs to access those data. The collection of data is called Database. The primary goal of a DBMS is to provide a way to store and retrieve database information that is both convenient and efficient.

DBMS are designed to manage large bodies of information. Management of data involves:

  • defining structures for storage information;
  • providing mechanism for the manipulation of information;

Furthermore:

  • it must ensure the safety of the information stored, despite system crashes or attempts at unauthorised access.

Case 1 This is why it’s not better to use a flat-file database. A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain text file (e.g. csv or txt), or a binary file. For instance, if you have a database in a plain text file with the following two tables:

and you want to “Get the year that GZA wen solo”, you can use this Python code:

for line in file.readlines():
	record = parse(line)
	if record[0] == "GZA":
		print(int(record[1]))

Case 2 Consider a university organization that keeps information about instructors, students, departments, and courses. One way to keep the information on a computer is to store it in operating-system files. The system has a number of application programs that manipulate the files, including programs to:

  • add new students, instructors, and courses;
  • register students for courses and generate class rosters;
  • assign grade to students, compute GPA, and generate transcripts.

Suppose a new major is created; then new application programs need to be created to deal with rules specific to the new major. This typical file-processing system is supported by conventional operating system. The system stores permanent records in various files, and it needs different application programs to extract records from, and add records to, the appropriate files. Keeping organizational information in a file-processing system has a number of major disadvantages:

  • Data redundancy and inconsistency. Different programmers create the files over time and they are likely to have different structures. Different programmers also create applications in several programming languages. Information may be duplicated in several files. Information redundancy leads to higher storage and access cost. Furthermore it leads to data inconsistency (a student’s address change affects one file, but not all the others.)
  • Difficulty in accessing data. If you need a specific information you can extract the needed information manually, or ask a programmer to write a new application program. Both alternatives are unsatisfactory. These operating-system files do not allow needed data to be retrieved in a convenient and efficient manner.
  • Data isolation. Data is scattered in various files, so they may have different formats.
  • Integrity problems. Data stored in the DB must satisfy certain types of consistency constraints.
  • Atomicity problems. A computer system is subject to failure. If there was a crash during a transaction, it would happen that moneys are removed from account A and not transferred to account B, resulting in an inconsistent database state. Transactions (in this case) must be atomic, meaning that they must happen in its entirely or not at all.
  • Concurrent-access anomalies. Suppose two transactions (500) happening at almost the same time from account A with a balance of 9400.
  • Security problems. Not every uses of the DB system should be able to access all the data.

2. Data Models

Underlying the structure of a database is the Data Model: a collection of conceptual tools for describing data, data relationship, data semantics, and consistency constraints.

  • Relational Model.
  • NoSQL:
    • Key/value
    • Graph
    • Document/xml/Object
  • Machine Learning:
    • Array/Matrix/Vectors
  • Obsolete/Legacy/Rare:
    • Hierarchical
    • Network
    • Multi-Value

A Schema is a description of a particular collection of data, using a given data model.

3. Database Languages

A DBMS provides:

  • a Data-Definition Language (DDL) to specify the database schema;
  • a Data-Manipulation Language (DML) to express database queries and updates.

They are not separate languages: they form parts of a single database language. Almost all relational DB employ the SQL language.

3.1. Data-Definition Language (DDL)

It’s used to specify a database schema and to specify additional properties of data like:

  • Domain Constraints. Domain of possible values with every attribute (integer types, character types, date/time types, etc.)
  • Referential Integrity. When you want to ensure that a value that appears in one relation for a given set of attributes also appears in a certain set of attributes in another relation.
  • Authorization. Different users can access different values in the database.

The output of the DDL statements is placed in the data dictionary, which contains metadata, that is data about data.

This language corresponds to manipulating tables through the CREATE TABLEALTER TABLE, and DROP TABLE.

The SQL Data-Definition Language SQL provides a rich DDL that allow to define tables with data types and integrity constraints. For instance:

create table department
	(dept_name char(20),
	building char (15),
	busget numeric (12,2));

As integrity constraint you could specify dept_name as primary key, ensuring that no two departments can have the same name.

3.2. Data-Manipulation Language (DML)

It enables users to access or manipulate data as organized by the appropriate data model. The types of access are:

  • retrieval of information stored in the DB;
  • insertion new data into the DB;
  • deletion of data from the DB;
  • modification of data store in the DB.

There are two types of DML:

  • Procedural DMLs: require a used to specify what data are needed and how to get those data.
  • **Declarative DMLs (**or non-procedural DMLs): require a user to specify what data are needed without specifying how to get those data. The DBMS has to figure out which is the most efficient solution to access data (i.e. SQL).

A query is a statement requesting the retrieval of information. The portion of a DML that involves information retrieval is called a Data Query Language (DQL). Although technically incorrect, it is common practice to use the terms query language (DQL) and DML synonymously.

In SQL, DML corresponds to the INSERTUPDATE, and DELETE.

The SQL Data-Manipulation Language The SQL Data-Query Language is non-procedural. A query takes as input one or more tables and always returns a single table. For example:

select instructor.name
from instructor
where instructor.dept_name='History';

3.3. Database Access from Application Programs

Some computations are not possible using SQL, but only general-purpose programming language like C++, Python, Java, etc. For instance, SQL does not support input from users, output to displays, or communication over the network. Application Programs are programs that allows us to do all these things.

Specific APIs allows us to send DML & DDL statements from the host to the database where they will be executed. For example, ODBC (Open Database Connectivity) standard defines APIs for use C, while JDBC (Java Database Connectivity) standard defines APIs for the Java language.


4. History of DBMS

1960s.

  • Early DB applications were difficult to build an maintain on available DBMSs.
  • Examples: IDS, IMS, CODASYL.
  • Computers were expensive, humans were cheap.
  • Tight coupling between logical and physical layers.
  • Programmers had to (roughly) know that queries the application would execute before the could deploy the DB.

Late 1960s and early 1970s.

  • Ted Codd in 1970 defined in a landmark paper the Relational Model and non-procedural ways of querying data in the relational model. Codd saw IBM’s developers rewriting DB programs every time the DB’s schema or layout changed. So, Relational DB were born.
  • The Relational Model was a mathematical abstraction on how you would represent both the DB and interaction with the DB.

Late 1970 and 1980s.

  • Relational Model was not used in practice initially because it’s perceived performance disadvantages.
  • This changed when IBM developed SQL/DS in 1981, the first commercial DBMS. At the same time, University California at Berkley developed Ingres and Oracle developed Oracle V2.
  • Relational DB replaced older network and hierarchical DB because with the latter programmers had to deal with many low-level implementation details, they had to code their queries in a procedural way, and they had to keep in mind efficiency when designing the applications.
  • In contrast, with Relational DB programmers needed to work only at logical level because the low-level tasks are carried out automatically by the system database.