Abstract Methodology Results Discussion Summary Reference

Final project: Sketch to Design

CS 294-026 Intro to Computer Vision and Computational Photography

2021 Fall Xinwei Zhuang

To find a PDF version of this project: [PDF].


When a design is request, architects usually make some preliminary sketch to visualise their ideas. This sketch is usually abstract, and hard to interpret. However, those sketch are useful and sometimes can give a hint to the final design style. This study uses pix2pix as the frame, generate an application that with a simple sketch input, it can generate a more developed design with rendering.



Data preparation

The data is collected from Archdaily. Total of 2740 projects is collected. For each project, multiple images is collected. The total size of the image set is 45776, which is scaled and center-croped to 256x256 pixel. Based on these images, similarity matrix is calculated. The ideal image is a building with a clean background, and interior photos are eliminated from the preprocessing. The images after passing the similarity check is shown as below.

Dataset after preprocessing
Then, image contour is calculated. Several contour algorithm is tried. Firstly, the Harris corners is used to generate contour. However, the contour can have noise and do not provide highlight of the architecture itself. Different alpha ratio is used to find the optimal contour for input data. Because of the size of the image (256x256 pixel), the alpha is set to be 2. Similar researches usees image-space contour rendering approach (Delanoy et al. 2017). Though there is more recent contour generating algorithm such as suggestive contours, proposed by DeCarlo et al. (2003), the architectural sketch usually only with clean lines. And when sketching, the environmental factors can also be considered during the early design phase. Thus the simple Harris Corners can be used in this project. Further investiment might includes how to eliminate the background contours. Standard deviation of the Gaussian filter is set to 2 for the first round input, and set to 1 for second round edge input with more detail. The contour image and the original photos are used as paired data which is then feed to the network. Sample data is shown as follows.

Paired data with α = 2 (left) for the first iteration training, α = 1 (middle) for the sequent iterations training, and the real photo (right).

cGANs architecture

The conditional generative adversarial network is a subset of generative network. It takes conditional input data to feed both the generator and the deliminator.

Those parameters are chosen guided by Dumoulin and Visin (2016).

For the upsampling layers, drop-out is used, recommended by Hinton et al. (2012). Firstly, the edge images with α = 2 paired with the original photos are trained 30,000 iteration. This intermediate is then used for the second round training for edge images with α = 1. Illustration of workflow is as follows.

The proposed workflow is as follows.

Network pipeline


Training loss

generator loss
discrinimator loss
generator L1 loss

The training loss of generator and discriminator is shown as above. The generator loss is high (~2.5) while the discriminator drops below 0.4 after ~140 steps. The generator L1 loss is continually decreases.

Single photo input

The generated photo from single input (with only a blackand white edge diagram as input data) is as follows. ThecGAN learned some features such as curved sketch is mostlyinterpreted as greenery, and straight lines are mostly the facade(in some preference with dark metal modern material). Someshape inside the contours can be interpreted as openings suchas windows. The sky and the ground is separated with differentcolours, and the architecture itself shows a diversity in bothcolour, material and composition. Results are demonstratedbelow after 20,000 iteration of training with batch size 64.

Iterative input

The iterative process is achieved by inputting the output of the first network overlaid by the edge image with more details, such as the window frame and the roof depth. Results are demonstrated below after 10,000 iteration of training with batch size 64. The result from the second iteration shows extinction in different components of the building. The roof and the window can be easily separated. The results still preserves the ability to separate background (a blue sky), the foreground (the greenery) and the main body (architecture). However, it worth noticing that the diversity in building material seems disappeared, all the architecture converges into a white facade with brown frames.

Successful cases

Failure cases


Comparison between the output from the single photo input (left), and the second iteration output (right).
By introducing the updater network, the resolution of the output image is enhanced, and the network can emphasized the edge and separate the blurry output from the single-image iteration. From the failure cases, some of the rendering produced after the updater network is blurry in the foreground, and the details is lost. This might due to the overcomplexity edge, identifying that a more careful data preparation is required. Another observation is that due to the data imbalance, the majority of the output photos from the first iterations has a ’style’ in brick and warm light, whereas the subsequent iterations shows a monotonous appearance with cold metal facade, even though they are trained with the same real photos.

Summary and Future work

In summary, the proposed scheme is able to provide reasonable renderings from sketches, and the generated rendering can be updated if provided more detailed sketch within a second. However, due to the limitation of the scope of the data, generated rendering can present a monotonous modern mental appearance. There are two ways to improve. The first is to make a larger dataset with more diverse building styles. The architectural style is different across building type, cultural difference, and various environment, it will be useful to make sub field categories for sketch generation, such as training only use residential housing data for residential sketch. The second is to introduce multiple outputs from single sketch that user can choose the rendering he/she wants to continue, and use semantic segmentation to segment different components of a image, such as background (sky, greenery, water...), foreground (grass, people), and the architecture itself (facade, glazing, door, column...), and user can choose the area he/she is satisfied with and fix that area, and develop more on the remaining space. Another interesting direction worth investigation is to build a 3D generator from 2D sketch. A list of possible development is listed below


[1] Delanoy, J., Aubry, M., Isola, P., Efros, A. A., Bousseau, A. (2018). 3D Sketching using Multi-View Deep Volumetric Prediction. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 1(1), 1–22. https://doi.org/10.1145/3203197

[2] Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A. (2018). Image-to-Image Translation with Conditional Adversarial Networks. ArXiv:1611.07004 [Cs]. http://arxiv.org/abs/1611.07004

[3] Dumoulin, V., Visin, F. (2016). A guide to convolution arithmetic for deep learning. ArXiv:1603.07285 [Cs, Stat]. http://arxiv.org/abs/1603.07285

[4] Eigen, D., Fergus, R. (2015). Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. ArXiv:1411.4734 [Cs]. http://arxiv.org/abs/1411.4734

[5] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. ArXiv:1207.0580 [Cs]. http://arxiv.org/abs/1207.0580 [6] Mirza, M., Osindero, S. (2014). Conditional Generative Adversarial Nets. ArXiv:1411.1784 [Cs, Stat]. http://arxiv.org/abs/1411.1784

[7] Sangkloy, P., Burnell, N., Ham, C., Hays, J. (2016). The sketchy database: Learning to retrieve badly drawn bunnies. ACM Transactions on Graphics, 35(4), 119:1-119:12. https://doi.org/10.1145/2897824.2925954

[8] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680).

[9] Huang, W. and Zheng, H. 2018. Architectural Drawings Recognition and Generation through Machine Learning. Mexico city, ACADIA.

[10] Chaillou, S. (2019). AI + Architecture, Towards a New Approach. Harvard University, 188.

[11] Radford, A., Metz, L., Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. (DCGAN)



Code Reference