Geometry rasterization is about transforming vertices, which in the end make up the corners of triangles, which get rendered on the screen (very simply put).

A vertex is just a coordinate value triplet, like (5, 7, 9). That's called a three-vector (Vector3). The coordinate space that a three-vector is given in is known as "model space." In "model space," the "center" of the model has the coordinate (0,0,0), and all vertices are relative to this center. Thus, a cube with 1 meter side and the model center in the center wound have corners at positions like (-0.5,-0.5,-0.5), (0.5,-0.5,-0.5), (0.5,0.5,-0.5) etc in model space.

Your simulated "world" is one big coordinate space, but it's not the same as coordinate space. For example, if you have one cube at position (-10,0,0) and one cube at position (-5,0,0), then the model space vertices are the same, but the world-space vertices are different. The matrix that transforms coordinates from model space to world space is called the World matrix. This matrix is that rotates and translates (moves) your model into the appropriate world coordinates. World space is a great space to do calculations like lighting and reflection mapping and whatnot in, but you don't need to worry too much about that just yet.

Now, once in world coordinates, you have a camera or eye that views the world. This is a space in which you calculate things like fog/haze -- fog is applied based on distance from the eye, not absolute world position. The View matrix takes World coordinates and put them into the coordinate frame of the camera, where (in a right-handed system) positive X means to the right, positive Y means up, and negative Z means away from the camera. The camera itself is always at position (0,0,0) in view space. So far, we have just moved and rotated vertex values; there's still no perspective or scale involved. (World sometimes has scale, to make things smaller or bigger in world space, but ignore that for now).

There is a space called clip space. This space decides what goes into the viewport on screen. In clip space, coordinates between -1 and 1 in X and Y, and coordinates between 0 and 1 in Z, end up being drawn (in DirectX -- in OpenGL, clip space Z is different). The job of the Projection matrix is to take the vertices in view space, and transform them such that they end up in the -1..1 range that maps to the viewport. However, a cube that is 1 meter wide will be really big (say from -0.9 to 0.9 in clip space) when it is close to the camera, but really small (say from -0.01 to 0.01) when it is far away. To make this happen, you have to divide the X and Y values by the Z value, so that the farther away the objects get, the smaller the X/Y relative values are (and the wider a view of the camera space will be drawn in the viewport). Thus, the role of the projection matrix is to scale the view space coordinates to fit in the clip space coordinates, and then to calculate the appropriate perspective divide value. That's what FOV and aspect comes from -- if the screen is not square, you want to divide the viewspace Y by different values from viewspace X, else the image will look stretched (because the clip space is always -1..1, even if the screen is not square).

After the perspective transform matrix has been applied to the vertex, the perspective divide is automatically done, and the coordinates are passed along to the viewport transform. (there is also something called scissoring, but ignore that for now) The viewport transform takes the -1..1 clip space coordinates and turn them into Z buffer values (for depth testing) and screen coordinates (for rasterization). To set the viewport transform, you set the Viewport property, which is not a matrix but just a screen rectangle and depth buffer range, but it actually turns into a simple matrix on the inside.

So, once past the viewport transform, the graphics card now has a screen-space value for the vertex, which is a corner of a triangle, and can start setting up rasterization to paint the triangle.

Voila! Transform pipeline complete.