Data refers to any recorded piece of information. For example, yesterday's recorded temperature is a form of data. The current day of the week is also a piece of information. Likewise, the text you're reading at this moment is data, as is the following image of three coffee beans!
These four examples cover the four forms of data we usually work with in data science: numeric, label, text and image. We work with these four forms of data in different ways and each of them requires different techniques to work with:
Let's dive in and examine our first real-world dataset. This dataset, structured in a tabular format, comprises both numerical and categorical data. This dataset was collected by the biologist Ronald Fisher in 1936. This dataset is called the iris dataset and is popular in data science education because of its well-structured format and the clarity it provides in making inferences. We will maintain focus on this particular dataset as we proceed and explore further throughout this course.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
7 | 3.2 | 4.7 | 1.4 | versicolor |
6.3 | 3.3 | 6 | 2.5 | virginica |
Ronald has meticulously recorded these five variables for each flower, and put the data into a tabular format, with columns representing the variables and rows containing the corresponding observations. In this instance, we have five distinct columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. And in this table we have 3 observations. We structure these columns, collectively placed side-by-side and often refer to them as variables, or alternatively, data items.
Taking a closer look at the table, the first observation reveals that the value for Sepal.Length is 5.1, and the value for Species is setosa.
Interestingly, these two values exhibit different characteristics. 5.1 is a numerical value, while setosa is a label.
We'll delve into the unfamiliar terms used in the following code snippet
at a later stage. For now, let's focus on how we can create our own data.
We can accomplish this using the data.frame
function in R,
which allows us to construct a dataframe. We'll store this dataframe
in a variable, which we'll name df. You are encouraged to
execute the following code and try it out for yourself!
Let's go through each line of code in the above example. In the
first line, a new dataframe object named df is created.
The dataframe is constructed using the
data.frame
function which comes pre-loaded in every
R session. The a column of the dataframe is populated with
the values 1 and 2, specified using the c
function
(another function that comes built in with R and the "c" stands for combine).
The b column of the dataframe is populated with the values
3 and 4, also specified using the c
function.
The second line prints out the contents of the dataframe
df. The print
function
is used to display the dataframe object. The output will show the
values in the a and b columns of the
dataframe, with each row representing a separate observation.